Development and Code Partitioning in a Software ...

4 downloads 2438 Views 224KB Size Report
processor are the members of the Stretch family, Software. Configurable ... flexibility as they are not tied to any particular development ..... Custom-instruction.
Development and Code Partitioning in a Software Configurable Processor Karan R Shetti

C.L. Koh, M.T. Aung, T. Bretschneider

School of Computer Engineering Nanyang Technological University Singapore [email protected]

Real Time Embedded Systems EADS Innovation Works Singapore {timo.bretschneider, augustine.koh, myo_tun.aung}@eads.net

Abstract—Most reconfigurable processors are not fully controlled by software; they are reconfigured using hardware description languages. By moving the data paths into the processor, the system architect can discard the external control logic, the finite state machines and micro-sequencers. Examples for such a processor are the members of the Stretch family, Software Configurable Processors which have a reconfigurable fabric embedded inside the actual processor. This allows development in a high level programming language like C. The paper provides an overview of the architecture and a case study problem based on the 2D FFT illustrating the issue of software-hardware partitioning. A software only solution is first described and is used as a benchmark to compare two code partitioning schemes proposed later. The designs of the partitions are described in detail along with their time analyses. The primary criterion chosen is the overall execution time along with the frequency at which the reconfigurable fabric can be clocked. By comparing the results it can be concluded as to which partition performs better and the parameters that contribute to reduction in execution time. This paper also provides an overview of the challenges and benefits of software reconfigurable processors. Keywords-component; Software configurable processors, Code partitioning, Fast Fourier Transform

I.

INTRODUCTION

General Purpose Processors (GPP) provide a vast degree of flexibility as they are not tied to any particular development environment and problem nature. This makes them suitable for numerous applications, but their performance is sub-optimal where the scope is application specific. Alternatively, Application Specific Integrated Circuits (ASIC) are able to work efficiently under constraints of power and size while delivering high performance. By customizing circuits to a specific application, better performance is achieved. However, due to the complexity of the design process, rigid design constraints, ASIC implementations are on the decline in many areas. In order to reach the middle ground between the flexibility of a GPP and the efficiency of an ASIC, the concept of a reconfigurable processor has gained importance. There are two kinds of reconfigurable processors,

978-1-4244-4547-9/09/$26.00 ©2009 IEEE

namely the hardware reconfigurable processor and Software Configurable Processor (SCP). Field Programmable Gate Arrays (FPGA) can be programmed to hold reconfigurable processors that use a fine grained approach to develop applications in hardware. Typically hardware description language like VHDL or Verilog are used to implement the hardware portion, while special subsets of C or other high level languages are employed for the software component running on the processor. This approach does allow flexibility in terms of its ability to reconfigure but only to a limited extent. Alternatively, software configurable processors, only use high level languages or their subsets to program the reconfigurable fabric. The reconfigurable fabric can act as co-processor or a specialized functional unit [1] thereby accelerating only sections of the applications. This approach provides the flexibility of a software solution and the efficiency of a hardware design in one framework. II.

LITERATURE REVIEW

FPGAs provide high performance as shown by Dillon [2] who was able to achieve 120 frames per second (fps) on a 2048×2048 image while performing 2D Fast Fourier Transform. Two Virtex–II FPGA’s were used to develop an economic solution. However, the implementation effort is time consuming and the solution cannot be easily modified or adapted. To achieve more flexibility, one possible solution is to couple a Central Processing Unit (CPU) with a FPGA. There are various levels of coupling [3], whereby more loosely coupled solutions offer more flexibility. This is because the reconfigurable logic executes more cycles without intervention from the CPU. Xilinx FPGA’s have pre-installed Fast Fourier Transformation (FFT) cores like Xilinx® LogiCORE™ and they deliver very good performance in terms of speed [4]. However, this paradigm of design requires specialized knowledge of both the software and hardware platform. An easier approach to achieve flexibility at a lower cost is to extend the architecture of a system using customized instructions. This idea has led to the advent of Application Specific Instruction Set Processors (ASIP) [5]. ASIPs allow

TENCON 2009

multiple levels of parallelization and provide support to a wide variety of external peripherals [1]. There are two ways in which instructions can be customized, automatic and manual. Automatic customization of instructions is a well specified problem and various solutions using graph theory have been developed to solve the problem. Using the Language for Instruction Set Architecture (LISA) an implementation of new cached FFT algorithm was proposed by Atak et al. [6]. Cong et al. [8] show that ASIPs can be further accelerated by using shadow registers to solve the data bandwidth problem associated with ASIPs. Combining the benefits of ASIPs and FPGAs, Stretch Inc. proposed the SCP [9]. By adding pragma statements, a normal C function is converted to an instruction by the compiler and executed in a single cycle. As the development is carried out only in software, solutions can be very flexible. Reference [10] provides an overview of the development on a first generation Stretch platform. III.

CASE STUDY: 2D FAST FOURIER TRANSFORM

Among the most common operations in image processing is the convolution of an image with image filtering as a simple application. In order to reduce the processing time for larger data sets, the operation is best performed in the Fourier space. For this purpose the FFT is one of the most widely accepted and efficient implementations. The basic principle of the transformation is the “divide-and-conquer” approach reducing the computational load to O (NlogN) operations. The complexity for the 2D case is O (N2log N). A. 2D Fast Fourier Transform Algorithm The Y component of a grayscale YUV image is processed by a 1D FFT along the row direction first, before the same function is applied along the columns. For an input image I of size N×N (N=2k, k∈ℵ) with complex values the pseudo-code is given in Algorithm 1. Algorithm 1: 2D FFT decomposed into two 1D FFTs FFT2D (I.real [N] [N], I.imaginary [N] [N])

1. for (p a. for i. ii. b. END c. for i. ii. d. END 2. END 3. for (r a. for i. ii. b. END c. for i. ii. d. END 4. END

: 0 -> N-1) (q : 0 -> N-1) Copy data to data RAM 1DFFT() (q : 0 -> N-1) Copy data to local memory Normalize

: 0 -> N-1) (s : 0 -> N-1) Copy data to data RAM 1DFFT() (s : 0 -> N-1) Copy data to local memory Normalize

B. 1D Fast Fourier Transform Algorithm For an input sequences of P, the number of stages required for the radix-2 FFT is computed according to M=lbP, where M is the number of stages. Prior to the actual FFT, the input image is scrambled using bit reverse sorting. In the given case, the hardware is most efficient for fixed-point arithmetic. Hence, the image is first converted to fixed-point before any operations are carried out. Then, for an N-point FFT with the real part A and imaginary part B, Algorithm 2 is applied. Algorithm 2: 1D FFT with data type conversion 1. for (i : 0-> N-1) a. fixed-point conversion(A,B) b. bit reverse sorting(A,B) 2. param1 = 1 3. for (i: 0-> log2N) a. param2 = param1 b. param1 = param1 N-1, j+=param1) i. for(k:j->N-1;j+= param2) a. Butterfly(A[i],B[i],A[j],B[j]) ii. END d. END 4. END

IV.

ARCHITECTURE OF THE STRETCH S6105

A. Host Processor The Stretch S6105 is a hybrid processor: contains a 300 MHz RISC (Xtensa) processor coupled with a reconfigurable fabric called Instruction Set Extension Fabric (ISEF). It also hosts an auxiliary processor that has programmable accelerators to perform various tasks more efficiently and is attached to a number of peripherals enabling efficient I/O transfers [9]. This work focuses only on the SCP engine. B. Instruction Set Extension Fabric (ISEF) The ISEF is a reconfigurable fabric which can be programmed using the high level language C. By adding a pragma statement SE_FUNC, C functions are converted into single instructions [9]. The ISEF can be configured to run at 100, 150 or 300 MHz depending on the complexity of the implemented functionality. The ISEF architecture is made up of arithmetic units, a dedicated multiplication array and other logic units. The arithmetic units are used to perform various operations such as addition and subtraction. It also has a 35 Kb pipeline to create a deep pipeline of instructions thereby enhancing performance. Additionally, 64 8×16-bit dedicated multiplication units are structured in a way that they can perform multiple multiplication operations in one cycle, similar to a FPGA. These units can be easily cascaded to perform up to eight 32×32-bit multiplications. Three input ports and two output ports, which are 128 bits wide, translate to the number of parameters that can be passed to a function. C. Memory Architecture There are four types of memory available on the Stretch platform [11]:

TENCON 2009

• Local memory on the host processor (64 MB) • Dual-port RAM (64 KB): Storing data in this memory reduces the number of processor stalls due to cache misses. • IRAM (64 KB): This memory is embedded inside the ISEF and can be accessed in one cycle. It is organized in 32 banks of 2 KB each. Apart from these memory structures, the Stretch platform also has 32 banks of wide registers which are able to pass up to 128 bits each between the processor and the ISEF in one cycle. V.

DESIGN ALTERNATIVES

At the most basic level, the FFT is characterized by the butterfly operations. This involves complex multiplication of the data with the twiddle factors. The data is represented in 32bit fixed-point and the twiddle factors occupy 16 bits each. A single butterfly operation involves four multiplications, five additions and five subtractions. Hence, the hardware resources available in the ISEF restrict the implementation to four simultaneous butterfly operations. In this paper the overall execution time is considered as the main efficiency criteria, including other parameters such as flexibility, frequency at which the hardware can run and amount of resources used. All alternatives designed here use the same algorithm and similar computations except for data reusability and parallelization. In this regard the alternative configurations are comparable. In order to determine the best implementation, several options were explored. The simplest is to perform the entire algorithm on the host processor; this approach was used as a benchmark to assess the improvements made by the different hardware configurations. The next step was to port the butterfly operations on the ISEF, while the processor controlled the data transfers and address generations. This was done in a single configuration with single instructions for performing the butterfly operations. The final alternative explored multiple configurations and multiple instructions. A. Partitiong Choice 1 Performing the FFT on the host processor is straight forward. In order to obtain an optimized software solution, the dual-port data RAM was used to buffer one row/column of the image and then to perform the 32-bit fixed-point 1D FFT. This reduced the number of processor stalls due to data cache misses. Even though this approach incurs some overhead due to memory copy operations, the improvement in execution time favors this approach. The look-up tables for the twiddle factors were also stored in the data RAM. The execution flow of the algorithm is shown in Section III.A.

algorithm. However, there are some difficulties in porting this part of the algorithm onto the ISEF. In order to achieve parallelization in the ISEF, loops in the program flow are unrolled, i.e. the loop variables need to be determined at compile time. This produces a challenge as the loop variables in the FFT are generated at runtime based on different parameters (Lines 3.c and 3.c.i in Section III.B). Moreover, the ISEF does not have the hardware resources to unroll all loops as it can perform only four butterfly operations at a time. Another option is to hard-code the addresses (normally computed in the loop) using look-up tables, but this on the other hand reduces the flexibility of using a software based solution. The ISEF cannot access processor local memory or the data RAM, the data must be transferred either via DMA to the IRAM or using the extended 128-bit registers (WR). These registers, however, do not allow random access and are small in size compared to the size of the image data. The IRAM on the other hand has 64 KB of memory and supports random access, but concurrent read/write operations to the same memory bank cause stalls of one or more cycles in the ISEF. Keeping in mind the hardware limitations, the FFT operation was partitioned between the host processor and the ISEF for optimal performance. The processor was tasked with generating all the indexes while the ISEF performs the complex multiplications. The parameters generated were passed to the ISEF using the extended register in one cycle. The data on the other hand was transferred to the ISEF via DMA from the dual-port data RAM. Eight data rows of complex values were sent at a time and stored in different banks. The FFT algorithm performs two consecutive read operations from the same row (bank) of data and calculates the FFT in place (Line 3.c.i. Section III.B). Therefore, performing simultaneously four butterfly operations using row data from the same memory bank severely degraded the performance due to ISEF stalls. Hence, eight rows of data were stored in the IRAM in different banks and accessed column wise. This effectively allowed parallel execution on eight data rows and reduced the number of times the ISEF stalls. The execution flow of this configuration was 1. DMA in the eight rows of data 2. If row wise FFT a. Scale data & bit reverse sort eight rows 3. Else a. Bit reverse sort eight rows 4. Transfer parameters through WR registers a. Perform four butterfly operations 5. DMA data out

The advantage of the software solution is the flexibility that it provides by being able to change the size of image at compile time or even runtime.

This implementation occupied 88% of the resources available on the ISEF and could run at a frequency of 100 MHz

B. Partitioning Choice 2 By profiling the code, it was observed that the butterfly operations are the computationally most intensive part of the

C. Partitioning Choice 3 In this partitioning scheme, a similar approach to the scheme illustrated in Section V.B was used in terms of

TENCON 2009

dividing the workload between the host and the ISEF. However, the routing resources restricted any change to include more operations on the ISEF. In order to port more operations on the ISEF, the single configuration was split into two configurations. One configuration was used to perform the FFT on data selected row wise. Afterward the ISEF is reconfigured for the FFT operating on the data in column direction. The reconfiguration time is typically 30-100 ms [9]. Using this approach, more operations such as double buffering and normalization can be carried out inside the ISEF at the same time. A different approach was also used inside the ISEF. As mentioned in Section IV.B, the compiler converts the C functions into single instructions. In order to achieve a better performance the complex instruction were reduced to multiple simpler instructions. The butterfly operation which was performed as a single function was split into two separate read operations, followed by the actual butterfly computation, and two consecutive write operations. By using a reduced instruction set, the issue of the IRAM stall was reduced even more as each instruction accessed a particular bank only once. The execution sequences of the two configurations are:

VI.

RESULTS

Examining the first scheme presented in Section V.A, Figure 1 shows the change in overall execution time for the different image sizes. It also shows the breakdown of the time between the memory transfer operations (Lines 1.a.i, 1.c.i, 3.a.i and 3.c.i in Section III.A) and actual computations (Lines 1.a.ii, 3.a.i and 3.c.ii in Section III.A). All timing measurements are taken using the host (Xtensa) processor’s in-built cycle count instruction. Thus, for an image of size 512×512 the memory copy operations are responsible for over 57% of the execution time whereas computation takes 39%. The remaining portion of the time is consumed by functional overheads and operations to allocate memory. Figure 2 shows the execution time of the implementation presented in Section V.B. It also shows the breakdown of the execution time. In this scheme there are two levels of memory transfers, one to copy to the dual-port data RAM and the second to transfer to the ISEF, and accounts for the increase in time for memory copy. For an image of size 512×512, 62% of the time is consumed by memory copy operations and 37% is taken by the FFT operation. The overheads are reduced substantially owing to the fact that functions are mapped as instructions now.

Configuration 1: Row data

1. DMA in the eight rows of data 2. Scale the data and bit reverse sort eight rows 3. Transfer parameters through WR registers a. Read Data 1 b. Read Data 2 c. Perform 4 butterfly operations d. Write Data 1 e. Write Data 2 4. DMA data out

Configuration 2: Column data 1. DMA in the eight rows of data 2. Bit reverse sort eight rows and 3. Transfer parameters through WR registers a. Read Data 1 b. Read Data 2 c. Perform 4 butterfly operations d. Write Data 1 e. Write Data 2 4.

Figure 1: Overall execution time vs. image size for Partitioning Choice 1

DMA data out

The implementations consumed 86% and 85% for Configuration 1 and 2, respectively. A theoretical operating frequency of 165 MHz was achieved, however as the ISEF can run only at one of the three frequencies mentioned in Section IV.B, 150 MHz was chosen. Figure 2: Overall execution time vs. image size for Partitioning Choice 2

TENCON 2009

Examining Figure 3, which shows the overall execution time for the third choice shown in Section V.C, an improvement in the memory transfer time can be observed. This can be attributed to using double buffering for DMA transfers and performing more operations in the ISEF. Memory transfer now only take 52% and execution time takes about 47% of the time for an image size of 512×512. Even though the overall improvement is not significant, one of the implications is that the ISEF can now include other operations like image filtering by reusing the data. A new configuration to do the same is not required which would have incurred more memory overheads. Comparing the improvement of execution for the two hardware partitioning choices, it can be observed that the third choice yields the best results in terms of execution time. Furthermore, this choice uses lesser resources and performs at a higher frequency. As shown in the Figure 4, for an image of size 512×512, Partitioning Choice 2 improves the execution time by 5.5 times, while Partitioning Choice 3 yields a 6-times improvement.

throughout all the partitions, but by reorganizing the code and porting only hardware efficient operations on the hardware, we have shown how to accelerate an algorithm while maintaining flexibility of a software solution. We can see from the results that using multiple configurations and instructions improves the execution speed and resource usage. As we can see from Figure 2, memory transfers take up the most amount of execution time which is the most significant parameter to reduce execution time. In order to improve the performance, we need to reduce the memory copy operations between the processor and the ISEF. Even though the improvement in execution time is not significantly higher, Partitioning Choice 3 has the important implication that code should be organized such that data re-usability is high. Moreover, this configuration can perform extra operations such as filtering in parallel to the FFT without a drop in the operating frequency or need for a new configuration. The current implementation can perform the FFT on different image sizes by changing parameters at compile time. To further exploit the flexible nature of the SCP, further studies will explore the possibility of performing FFT of different sizes by reconfiguring using runtime parameters. REFERENCES [1]

Figure 3: Overall execution time vs. image size for Partitioning Choice 3

Figure 4: Execution time of different choices for an image of size 512×512 pixels

VII. CONCLUSION AND FUTURE WORK In this paper we have shown how different partitioning techniques between software and hardware and within hardware itself result in different execution times. The total number of operations in the program remains the same

Keutzer, K., Malik, S., & Newton, A. (2002). ASIC to ASIP: The next design discontinuity. In Proceedings of IEEE International Conference on VLSI in Computers and Processors, pp. 84-90. [2] Dillon, T. (2001). Two Virtex-II FPGAs Deliver fastest, cheapest, best high performance image processing system. Retrieved April 11, 2009, http://www.dilloneng.com/documents/fft_success.pdf [3] Compton, K., & Hauck, S. (2002). Reconfigurable computing: A survey of systems and software. ACM Computing Surveys, Volume 34, Issue 2, pp. 171-210. [4] Xilinx. (2008, September 19). Xilinx FFT Documentation. Retrieved May 12, 2009, from Xilinx: http://www.xilinx.com/support/documentation/ipdsp_transform_fft.htm [5] Athanas, P. M., & Silverman, H. F. (1993). Processor reconfiguration through instruction-set metamorphosis. Computer.Volume 26, Issue 3, pp, 11-18. [6] Atak, O., Atalar, A., Arikan, E., Ishebabi, H., Kammler, D., Ascheid, G., et al. (2006). Design of application specific processors for the cached FFT algorithm. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 3, (pp. III-III). [7] Ravi, S., Raghunathan, A., Jha, N., & Sun, F. (2004). Custom-instruction synthesis for extensible-processor platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 23, Issue 2, pp. 216-228. [8] Cong, J., Fan, Y., Han, G., Jagannathan, A., Reinman, G., & Zhang, Z. (2005). Instruction set extension with shadow registers for configurable processors. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 99-106. [9] Arnold, J.M. (2005). S5: The architecture and development flow of a software configurable processor. In Proceedings of IEEE International Conference on Field-Programmable Technology, pp. 121-128. [10] Gonzalez, R.E. (2006). A software-configurable processor architecture. IEEE Micro, Volume 26, Issue 5, pp. 42-51. [11] Stretch Inc. (2007). Stretch Documentation. Retrieved March 15, 2009, from Stretch Inc. Corporate Website: http://www.stretchinc.com/documentation/downloads.php.

TENCON 2009

Suggest Documents