Increasing Vector Processor Pipeline Efficiency with a Thread

0 downloads 0 Views 213KB Size Report
vector processor architecture consists of a scalar processor feeding instructions to a SIMD (Single Instruction Multiple. Data) engine. All branches, as well as all ...
Increasing Vector Processor Pipeline Efficiency with a ThreadInterleaved Controller Valeriu Codreanu, Lucian Petrică, Radu Hobincu

Abstract—Vector processors are a fast and energy-efficient way of executing code with large amounts of data parallelism. Programmability has however been a difficult topic and has limited vector processors to a few niche applications. Recent advances in compiler auto-vectorization promise to make vector processors relevant for general-purpose computing. However, compiler-generated code is inefficient and makes poor use of vector resources, which are often the most area and power-consuming devices within a processing system. We propose adapting interleaved multi-threading, a proven technique for increasing pipeline efficiency of scalar processors, to improve utilization of vector resources, thereby providing gains in speed as well as potential reduction in energy consumption.

I. INTRODUCTION Vector processors have seen extensive research as a means of high performance parallel computing. A typical vector processor architecture consists of a scalar processor feeding instructions to a SIMD (Single Instruction Multiple Data) engine. All branches, as well as all scalar code, are executed on the scalar processor, while vector operations are passed to the SIMD array for execution [1]. This approach has the advantage of simplicity and area efficiency since fetch and branch logic is centralized and does not increase with SIMD engine size. As a result, vector processors scale well to hundreds or thousands of cores and are more energyefficient for certain workloads than MIMD (Multiple Instructions, Multiple Data) architectures [2]. With the recent advances in auto-vectorizing compilers and optimization tools [3], software can more easily be written for, or converted to vector implementations, with potentially significant increase in performance. However, the inherent advantages of vector processors are often wasted by inefficient compilers or simply scalar code intermixed with vector code. Such control bubbles are inevitable in a vector processor and reduce pipeline utilization in the SIMD engine, lengthening computation and wasting energy. Data locality is another issue that limits vector processing efficiency. In order for the required operands to be loaded Manuscript received June 18, 2011. This research was supported by POSDRU projects 7713 and 61178 and was carried out with the guidance and support of professor S.D. Cotofana of the Computer Science department of TU Delft and professor Gheorghe tefan of the DCAE department of ”Politehnica” University of Bucharest. V. Codreanu is with the “Politehnica” University of Bucharest, Iuliu Maniu 1-3, Bucharest, Romania (corresponding author to provide phone: +4 0723 295 987; e-mail: [email protected]). L. Petrică is with the “Politehnica” University of Bucharest, Iuliu Maniu 1-3, Bucharest, Romania (e-mail: [email protected]). R. Hobincu is with the “Politehnica” University of Bucharest, Iuliu Maniu 1-3, Bucharest, Romania (e-mail: [email protected]).

into vector registers in the correct form required by the algorithm, time consuming IO operations are required which, in most cases, stall the pipeline and reduce the overall efficiency. Similar delays appear when inter-cell communication is required, for example when copying data left or right in the same vector register [4]. We propose using a thread-interleaved scalar processor to alleviate the issues of low SIMD engine utilization. Threadinterleaved processors execute instructions from different threads sequentially, thereby eliminating some of the data and control dependencies between, consecutive instructions. High-latency blocking operations like memory loads can also be masked in this mode of execution by instructions from other threads, providing higher overall pipeline efficiency, as has been shown in prior art [5][6][7]. This paper is concerned with vector processor pipeline efficiency and the gains attainable by coupling a threadinterleaved scalar processor to a SIMD engine. The outline of the paper is as follows. Section II presents our proposed architecture and detailed explanation of the execution mechanism. Section III presents the test workloads and experimental setup. Section IV presents and discusses the experimental results. Section V presents some concluding remarks while in section VI we suggest a couple of ideas for future work. II. PROPOSED ARCHITECTURE The proposed architecture is presented in Fig. 1. The scalar processor employs a 32-bit instruction set and supports four hardware threads. It is a custom implementation of the BEAM execution model presented in [6] as a part of the authors’ PhD research. The SIMD engine, ConnexArray™, is currently being developed as a commercial product by AllSeach Inc. Both the vector and the scalar processor share most of the same instruction set, which simplifies the design and implementation of a compiler for this architecture. Instructions are fetched into separate buffers for each thread while instructions are selected and issued in a round-robin fashion. Threads which encounter blocking operations are marked as inactive until they can be unblocked, at which time they are reinserted in the round-robin queue. Issued instructions which have a vector register as source or destination are pushed onto the SIMD engine instruction queue. The SIMD engine consists of 128 processing elements (PE). Each PE can operate on 16-bit values and consists of a 16-entry register file, an ALU and a large SRAM local store of 1024 entries. A vector Load/Store Unit serves the SIMD engine globally and can perform DMA between main memory and the local store.

Vector to scalar and scalar to vector conversions are supported through the use of the reduction and distribution networks respectively. A 16-bit value can be converted to a vector by replicating it 128 times, and a vector can be reduced to a scalar by performing logical OR, maximum of the selected values or summation on all or some of the vector elements. All vector instructions are executed in one clock cycle except multiply and divide (which are interpreted), reduction and load/store. Scalar instructions are also executed in one clock cycle except multiply and divide, as well as loads. Encountering any of these instructions will remove the issuing thread from the running queue until the instruction has completed.

Fig. 2. Vector processor memory resources

While this partitioning allows four separate threads to issue instructions to the SIMD engine, the total vector resources available to each thread are reduced by a factor of four, which may affect per-thread performance and power consumption by increasing variable swapping between SIMD registers, the local store and main memory. These factors must be included in the overall efficiency estimates. Section III evaluates each test workload with regard to resource utilization. III. TEST SETUP AND BENCHMARKS Fig. 1. Vector processor structure

For the purpose of converting the initial system to an interleaved multi-threaded scheme, a clear resource partitioning scheme must be implemented to isolate threads from each-other and provide program coherency as well as system-level security. The scalar processor has a 64 entry register file and provides each thread with 16 registers. Register windows are enforced by hardware and from a programming point of view threads are invisible to eachother. Our goal was to implement a similar partitioning scheme for vector resources while adding as little hardware as possible to SIMD processing elements. This can be achieved by restricting access from each controller thread to a subset of vector resources. In this way, each controller thread has access to 16 scalar registers, 4 vector registers and 256 entries in the local store. Fig. 2 presents the memory resources available to each thread, and data paths between resources.

We have chosen several application kernels from different application classes to test the proposed architecture under different types of workloads. All programs were written in C and compiled with a custom port of the GNU Compiler Collection (GCC). The cycle-accurate simulation was generated by an open-source tool called Verilator from the original Verilog source code. The autocorrelation test is a signal processing kernel used in many applications including LPC (Linear Predictive Coding) voice encoding. The test consists of calculating the autocorrelation vector of 384 input sequences of 128 samples each. Samples are 16-bit fixed-point values. To simulate a LPC processing workload, only 8 values of the autocorrelation vector are computed and stored as potential input to a Levinson-Durbin algorithm implementation [8]. Profiling shows autocorrelation to be a MAC (Multiply and Accumulate) intensive kernel, where multiplication and reduction operations make up most of the vector instructions. Resource usage and IO requirements for this kernel are minimal: it only requires one entry in the local

store for input data and three registers for intermediate results. All output is done through the reduction network. AES is the most widely-used encryption algorithm and considered to be the most secure. We have split the encryption of an AES block across 16 PEs of the SIMD engine. In addition, if CTR (Counter Mode) chaining mode is used, all the PEs can be active simultaneously to encrypt or decrypt 8 blocks in parallel. The test consists of encrypting 4 kilobytes of data in CTR mode with a 128-bit key. AES consists mostly of bitwise logical operations and as such most of the instructions are single-clock. The FFT test consists of a batch of 64 128-point fast Fourier transforms. The last test is the application of a Gaussian blur filter to a 300K pixel image. For the purpose of testing the proposed architecture we have ported GCC to the new instruction set and implemented a custom assembler. Compiled and assembled code can be executed on a cycle-accurate simulator or an instructionaccurate emulator. For each test we will measure controller CPI (Clocks Per Instruction) as well as SIMD CPI and utilization (percentage of the total execution time in which the SIMD instruction queue is not empty). All tests have been set up as a pool of tasks which can be executed sequentially on a single thread or split between several threads. Instruction counts are generated by the emulator and cycle counts, as well as pipeline utilization statistics, by the simulator. The simulated memory interface has a 40-cycle latency and 3.2 GB/s of bandwidth in long sustained bursts. IV. TEST RESULTS Autocorrelation test results are presented in Fig. 3. They show a significant increase in SIMD engine utilization until up to three threads. Utilization levels off at 94 percent for 3 and 4 threads, as do the SIMD CPI and the controller CPI. This is because most of the test is made up of multiplication and reduction instructions. Multiplication takes 8 cycles and reduction takes 10 cycles, and are both blocking operations for the SIMD engine, while reduction is also blocking for the controller. This high density of blocking instructions sets a theoretical limit to the achievable CPI on both the controller and SIMD engine. It must be noted that for most of the tests with 3 or 4 threads the SIMD engine is fully used and total utilization is only 94% because of thread initialization overhead. Another important observation is that the 4-thread test takes more cycles to complete because the machine is already saturated performance-wise at 3 threads, and adding another thread introduces additional initialization overhead. AES test results are presented in Fig. 4. The algorithm consists mainly of vector bitwise logical operations and does not contain any vector multiplication or reduction. About 30% of the total instruction count is scalar, and 10% are loads and stores. Because of this instruction distribution, the single-threaded AES test yields only 29% SIMD utilization. Utilization increases to 56% at 3 threads and levels off as the controller memory interface becomes saturated.

12.00 10.00 8.00 1 Thread 6.00

2 Threads

4.00

3 Threads

2.00

4 Threads

0.00 Clocks (x10.000)

CPI

SIMD CPI

Fig. 3. Autocorrelation test results for different execution threads

The FFT implementation for the Connex system is described in [9] and the test results are presented in Fig. 5. It is similar in characteristics to AES and scales in the same way. It is able to maintain a modest 5% increase in SIMD utilization from 3 to 4 threads because it contains vector multiplications which use up more vector pipeline cycles. Blur filter test results are presented in Fig. 6. They show no further performance increase beyond 2 threads. This is because the algorithm has a high ratio of IO to computation. Most of the time is spent waiting for vector DMA transfers between the local store and main memory, and utilization suffers accordingly. 18.00 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00

1 Thread 2 Threads 3 Threads 4 Threads Clocks (x10.000)

CPI

SIMD CPI

Fig. 4. AES test results for different execution threads

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

7 6 5 4

1 Thread

3

2 Threads

2

3 Threads

1

4 Threads

1 Thread 2 Threads 3 Threads 4 Threads

0 Clocks (x100.000)

CPI

SIMD CPI

Fig. 5. FFT test results for different execution threads

8.00 7.00 6.00 5.00 4.00 3.00 2.00 1.00 0.00

Fig. 7. Array Utilization for each kernel and different execution threads

In real life applications, the scalar interleaved CPU can also execute additional tasks if the SIMD engine utilization is saturated with less than 4 threads further increasing overall system throughput. VI. FUTURE WORK

1 Thread 2 Threads 4 Threads

Clocks (x100.000)

CPI

SIMD CPI

We are considering increasing efficiency, both in means of speed and energy consumption, by matching threads with complementary characteristics. This can be achieved by a hybrid software-hardware scheduling mechanism that takes into consideration both the machine characteristics as well as the dynamic of the algorithm. Also, reduction operations can be masked by proper compilers by reordering instructions so that the reduction is done in background.

Fig. 6. Blur Filter test results for different execution threads

REFERENCES [1]

V. CONCLUSIONS We have implemented a thread-interleaved controller for a vector processor. Tests on several computational kernels with different characteristics have shown significant increases in SIMD engine utilization when compared to single-threaded execution, as shown in Fig. 7. All tests show some limiting factor for performance beyond a certain number of threads. This depends on test instruction profile and IO requirements. SIMD engine utilization was at least doubled in the multi-threaded scenario compared to singlethreaded execution of the same workload. The modifications required to convert the SIMD engine to support multi-threading are minimal and easily achieved. It has also been shown that several factors can limit efficiency gains of multi-threaded execution, most notably IO DMA operations but also scalar loads and stores. These however are strongly dependent on memory interface bandwidth and latency and future work should focus on the effect of memory interface speed on multi-threaded pipeline efficiency.

[2] [3]

[4] [5] [6] [7] [8] [9]

Asanovic, K., “Vector microprocessors”, PhD Thesis, University of California, 1998 Stefan, G., “The ca1024: A massively parallel processor for costeffective HDTV”, Spring Processor Forum: Power-Efficient Design, 2006 Grosser, T. and Zheng, H. and Raghesh, A. and Simburger, A. and Grosslinger, A. and Pouchet, L.N., “Polly-Polyhedral optimization in LLVM”, First International Workshop on Polyhedral Compilation Techniques (IMPACT'11), 2011 Shahbahrami, A. and Juurlink, B. and Borodin, D. and Vassiliadis, S., “Avoiding conversion and rearrangement overhead in SIMD architectures”, International Journal of Parallel Programming, 2006 Hennessy, J.L. and Patterson, D.A., “Computer architecture: a quantitative approach, 4th Edition”, Morgan Kaufmann, 2007 Valeriu, C. and Hobincu, R., “Performance Gain from Data and Control Dependency Elimination in Embedded Processors”, Proceedings of ISETC, 2010 Laudon, J., “Performance/watt: the new server focus”, ACM SIGARCH Computer Architecture News, vol. 33, nr 4, 2005 J. D. Markel and A. H. Gray, Linear Prediction of Speech. Berlin: Springer Verlag, 1976. Lőrentz I., Maliţa M. and Andonie R., “Fitting FFT onto an energy efficient massively parallel architecture”, Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies IFMT'10

Suggest Documents