Floating-point Operation Based Reconfigurable

0 downloads 0 Views 1MB Size Report
RASP corresponding to one function that takes charge of a macro-task. Even though DSP instruction sets are optimized for digital signal processing, the.
This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.*, No.*, 1–12

Floating-point Operation Based Reconfigurable Architecture for Radar Processing Fan Feng1 , Li Li1,a) , Kun Wang1 , Feng Han1 , Baoning Zhang1 , Guoqiang He2 1

School of Electronic Science and Engineering, Nanjing University Nanjing 210023, China 2 Nanjing Research Institute of Electronics Technology, Nanjing 210013, China a) [email protected]

Abstract: To meet the increasing demand of large bandwidth and high throughput in modern radar system, we proposed a reconfigurable application specified processor (RASP) according to the feature of radar digital signal processing applications. RASP is a reconfigurable coprocessor based on hierarchical floating-point operation elements that is capable of executing a set of fundamental subalgorithms, take these subalgorithms as the minimal task node can improve the computational efficiency tremendously. The experimental results show that the processor performance exceeds TI state-of-the-art DSP by 1.05x to 3.22x. Our reconfigurable processor can be integrated into customizable radar systems, it was fabricated with TMSC 40nm CMOS process and has an area of 19.2mm2 . Keywords: Coarse-grained reconfigurable architecture, floating-point unit, radar signal processing Classification: Integrated circuits References

©IEICE 2016 DOI: 10.1587/elex.13.20160893 Received September 8, 2016 Accepted October 3, 2016 Publicized October 17, 2016

[1] Rossi. D,et al: “A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing,” IEEE J. Solid-State Circuits 45 (2010) 1615 (DOI: 10.1109/JSSC.2010.2048149). [2] De Sutter,et al: “Design Space Exploration for Efficient Resource Utilization in Coarse-Grained Reconfigurable Architecture,” IEEE Transactions on VLSI Systems 18 (2010) 1471 (DOI: 10.1109/TVLSI.2009.2025280). [3] Yoonjin. Kim,et al: “Coarse-grained reconfigurable array architectures,” in Handbook of signal processing systems, (Springer, 2013), 553. [4] Stojilovic. M,et al: “Selective Flexibility: Creating Domain-Specific Reconfigurable Arrays,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32 (2013) 681 (DOI: 10.1109/TCAD.2012.2235127). [5] Novo. D,et al: “Mapping a multiple antenna SDM-OFDM receiver on the ADRES coarse-grained reconfigurable processor,” IEEE Workshop on Signal Processing Systems Design and Implementation (2005) 473 (DOI: 10.1109/SIPS.2005.1579915). [6] Hasegawa. Y,et al: “An adaptive cryptographic accelerator for IPsec on dynamically reconfigurable processor,” IEEE International Confer-

1

IEICE Electronics Express, Vol.*, No.*, 1–12

ence on Field-Programmable Technology (2005) 163 (DOI: 10.1109/FPT.2005.1568541). [7] Singh. H,et al: “MorphoSys: an integrated reconfigurable system for dataparallel and computation-intensive applications,” IEEE Transactions on Computers 49 (2000) 465 (DOI: 10.1109/12.859540). [8] Dongwook. Lee,et al: “FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability,” IEEE International Conference on Field-Programmable Technology (2009) 376 (DOI: 10.1109/FPT.2009.5377609). [9] Jo. Manhwee,et al: “Design of a coarse-grained reconfigurable architecture with floating-point support and comparative study,” Integration, the VLSI Journal 47 (2014) 232 (DOI: 10.1016/j.vlsi.2013.08.003). [10] Bo. Liu,et al: “A Configuration Compression Approach for Coarse-Grain Reconfigurable Architecture for Radar Signal Processing,” International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (2010) 448 (DOI: 10.1109/CyberC.2014.83). [11] Yin. Shouyi,et al: “Reducing configuration contexts for coarse-grained reconfigurable architecture,” IEEE International Symposium on Circuits and Systems (ISCAS)(2012) 121 (DOI: 10.1109/ISCAS.2012.6271452). [12] Han. Feng,et al: “An ultra-long FFT architecture implemented in a reconfigurable application specified processor,” IEICE Electronics Express 13(2016) (DOI: 10.1587/elex.13.20160504). [13] Kun. Wang,et al: “Design and implementation of high performance matrix inversion based on reconfigurable processor,” IEICE Electronics Express 13(2016) (DOI: 10.1587/elex.13.20160579). [14] TMS320C6672 Multicore Fixed and Floating-Point DSP(2014), Lit. no. SPRS708E, Texas Instruments [15] TMS320C66x DSP Library (2014), Lit. no. SPRC265, Texas Instruments

1

Introduction

Coarse-grained reconfigurable architecture (CGRA) has come into the spotlight with enormous increase in demand of various kinds of high performance applications in various application field because of their flexibility and performance [1, 2, 3] . CGRAs reduce the overheads of typical fine grain field programmable gate array (FPGA), replacing look up tables (LUTs) with coarser computational blocks and simplifying interconnect patterns. Coarse-grain reconfigurable logic has been mainly proposed for speedingup loops of multimedia and digital signal applications in embedded systems [4]. They consist of processing elements (PEs) with world-level data bitwidths connected with a reconfigurable interconnect network. Their coarse granularity greatly reduces the delay, area, power consumption and reconfiguration time relative to FPGA at the cost of flexibility. To sum up, CGRA can not only boost the performance by adopting coarse-grained array but it can also be reconfigured to adapt different characteristics of class of applications. The commonly used solution for digital signal applications are hardwired application specific integrated circuit (ASIC) or commercial digital signal processor (DSP), but ASIC has limited applicability hence brings about huge non-recurring engineering (NRE) cost, while DSP suffers from poor energy 2

IEICE Electronics Express, Vol.*, No.*, 1–12

efficiency because of sequential software execution. Therefore, CGRA can provide an alternative solution in terms of both flexibility and efficiency. Steaming applications, cryptography applications and video decoding, baseband processing are the relatively mature used real-life applications which implemented on reconfigurable architectures, and there are many commercial products released in recent years, such as ADRES [5] for wireless communication, Awashima [6] for cryptography, MorphoSys [7] for audio-visual data codec, etc. The aforementioned processors have a common feature of integerbased operation, thus they cannot handle floating-point based applications effectively. FloRA [8, 9] added extra FSM to support floating-point operation, but the improvement is not significant. When the target application comes to radar applications, which need to meet precision requirement and deal with large volume of data, integer-based architecture is insufficient. Considering the circumstances, we propose an architecture use floating-point algorithm intellectual properties (IPs) as the basic processing elements, construct a reconfigurable processor that can fulfill floating-point operation efficiently. In radar applications, the most frequently used kernel subalgorithms are FFT, FIR, correlation and so on [10]. Based on this character, we decompose a general radar task into a set of subalgorithms (RASP can support 17 frequently-used subalgorithms) and take them as the minimal task node. This solution can provide desired ASIC-like performance, meanwhile, maintain a certain degree of programmability. The validation of proposed processor is demonstrated through real application benchmarks and prototype chip test. The proposed system has shown performance boost: exceeding the state-of-the-art TI DSP by 1.8x to 8.0x. The organization of the paper is as follows. Section 2 presents the base hardware architecture and the workflow of RASP. Section 3 explains the implementation of algorithm on RASP, performance analysis is carried out in Section 4. Section 5 presents the characteristics of the chip. Finally, Section 6 concludes the paper with some future work. 2

Architecture

2.1 Overall hardware architecture RASP is a digital signal processing architecture based on a hierarchical array of coarse-grain computing elements called Reconfigurable Processing Elements (RPEs). It consists of a bus interface, a Direct Memory Access (DMA), a main controller (MC), a reconfigure controller (RC), a data memory and a processing element array as show in Fig. 1. The bus interface is used to connect general purpose processor and perform data exchange with external memory. In the design case, all these components are coupled through high bandwidth AXI4 buses. The main controller manages the task allocation and synchronization in the system, DMA transports data from the external memory to local data memory and vice versa, reconfigure controller can choose and organize processing elements to complete certain task, data memory for data storage and the RPE array for the regular computing.

3

IEICE Electronics Express, Vol.*, No.*, 1–12

External Processor

External Memory

AXI bus

AXI Interface Main Controller Reconfigure Controller

Data Memory

DMA

Reconfigurable Processing Array

From Config-port

RPE3 RPE5

Fully Meshed

RPE1

RPE2 Mux A

RPE4

Configport

Mux B

ALU

From Data-Mem

RPE6

Output to Data-Mem

Fig. 1. Overall structure of RASP. The RPE array is the core part that takes charge of the entire compute task in the system as shown in Fig.1. Each RPE is a cluster of algorithm IPs that can be configured to perform dedicated operations, different RPE has different computational resource, the details are listed below (see Table I). To reach high operating frequency, all the IPs used are pipelined and have an instinct fix delay of 4 clock cycles. The multiplexers in RPE are used to select input operand from configure port. Operation codes (OPCODE) define the operations in the RPE and the interconnections among them. The RPE interconnection network is fully meshed, but the routing mode is relatively static, which means that the communication between RPEs is determined during the configuration phase, and cannot be changed at runtime. Through the interconnection network, two or four RPEs can be combined to perform more complex operation such as butterfly unit used in FFT. The data memory stores source data and intermediate results, it has a capacity of 2M Byte and is divided into 32 banks for high bandwidth and parallelism requirement. 2.2 Main controller and processor workflow In this section, the detailed workflow and configuration information delivery will be introduced. RASP needs to be booted via external host processor, the host processor can be a RISC CPU. An Application Programming Interface (API) function library is developed for host processor, after host processor executes the API, Table I. Computation Resource in each RPE

Resources

RPE1∼RPE4

RPE5

RPE6

1*complex multiplier 4*complex adder 1*real multiplier

1*fix-to-float convertor

2*real multiplier 2*floating-point divider 1*real adder 1*complex multipiler

4

IEICE Electronics Express, Vol.*, No.*, 1–12

Input application (C++) RISC(API Compiling)

Main Controller

Feedback

Parsing

Reconfigurable Controller Configuration RPEs Reorganize

Execute

Fig. 2. Configuration flow of RASP. 64 bits

Source_1 address Context Field

Source_2 address 22:20 19:17 16:14 13:11 10:9 8:6 5:2 1 0 Complex_mul Request Real_mul Request

Complex_add Request

Complex_mul Opcode

Real_mul Opcode

Complex_add1 Opcode

Para_config Reg & Matrix Mul Reg

Complex_add2 Opcode

Dest length & DDR burst times

Complex_add3 Opcode

Dest address

Complex_add4 Opcode

Source_1 length & Source_2 length

Other_para Reg δaε

δbε

Fig. 3. Configuration context structure (a) Initial configuration information. (b) Detailed Context. configuration information will be generated. The entire configure process is depicted in Fig. 2. The input application is described in a high-level language. Firstly, the code is programmed manually with the collaboration of calling API function, then compiled by RISC, generating a string of bit streams written to external memory, these bit streams are the initial configuration information which will send to RASP. RASP can receive this initial configuration information directly from host processor or fetch them from external memory initiatively, the MC accepts and translates them. The translation results determine which subalgorithm to be execute, what size the subalgorithm is, and so on. The RC acts as a secondary decoder in the system, it contains subalgorithm controllers, each subalgorithm controller could assign the configure port in RPE directly when the particular subalgorithm is chosen, and the interconnection between the RPEs is organized by reconfigurable controller simultaneously. An example of configuration context transition of RPE1 is shown in Fig. 3, Fig. 3 (a) indicates the configuration information after API compiling, stored in external memory, Fig. 3 (b) is the detailed configuration context structure,

5

IEICE Electronics Express, Vol.*, No.*, 1–12

x(n) x(n + N ) 2 x (n)

η

WNr

g (n)

g(n)

η η Nö æ xçn + ÷ 2ø è

WNr

h(n)

h (n)

ι

(a) Butterfly unit in Radix-2 decimate in frequency FFT algorithm

(b) Detailed implementation in RPE

Fig. 4. (a) Butterfly unit in Radix-2 decimate in frequency FFT,(b) Detailed implementation in RPE they are sent to the RPE configure port by reconfigurable controller, and these contexts determine the RPE function directly. The connection network between RPEs is also managed by configuration controller. With the emerging configuration technologies such as time-multiplexing and runtime parallelism [11], they can reduce configuration time, but also significantly enhance the configuration memory requirements. Our proposed multistage configuration procedure can greatly minish the usage of configuration memory, but at the cost of hardware overhead, which are included in reconfigurable controller. 3

Implementation of Algorithm

In this section, the executing of single subalgorithm and combinational algorithms will be discussed. The most common used function in radar digital signal processing is Fourier transform, here we interpret how the computing kernel is mapped onto RPEs. Similarly, we will give a brief description about complex combinational algorithm implementation in RASP and take a frequently-used real life radar application as the example. 3.1 Signal algorithm implementation The Fast Fourier transform (FFT) forms the basis for many radar signal processing algorithms. The basic computational element in FFT is butterfly unit, as shown in Fig. 4 (a), this structure is referred to as a radix-2 butterfly. Higher-radix butterflies provide some computational savings, in RASP we can provide flexible radix-2/4/8, mixed radix FFT is also supported. During the execution, the intra structure of RPEs reorganize according to the configuration context. These contexts determine which IPs are active and how they are connected. The input data of RPE is allocated by extra address generator module. A sample of IP connection and data distribution in RPE1 when constituting a radix-2 structure is shown in Fig. 4 (b). It is obvious that during the implementation of FFT, the direct data flow exceeds the number of computational resource in the chip, so these flows need to be scheduled by time sharing. For example, it takes 8 cycles to complete a radix-8 butterfly unit under the control of FFT finite-state-machine (FSM).

6

IEICE Electronics Express, Vol.*, No.*, 1–12

PE Array

Local Memory

Bank 0

Bank 4

Bank 8

Bank 16

Bank 20

Bank 24

RPE2

Bank 1

Bank 5

Bank 9

Bank 17

Bank 21

Bank 25

Bank 2

Bank 6

Bank 10

Bank 18

Bank 22

Bank 26

RPE3

Bank 27

Bank 23

Bank 19

Bank 7

Bank 11

RPE4

Bank 3

Reconfigurable Controller

RPE1

Main Memory

DMA

Ping-pong Switching

Fig. 5. Ping-pong operation in the system. 3.2 Ping-pong operation for large size processing As mentioned above, RASP has a local memory of 2M Byte and can stores 256K floating point numbers in total. The storage capacity limits the calculation size in a single operation, like 64K points for matrix add, 128K points for FFT and so on. When exceeding the size, RASP provides an alternative solution, called ping-pong operation, to accomplish required calculation, in the meanwhile, reducing the waiting period for data transportation. We divided the RAM into 32 banks, 16 of them as a group, act as a double buffering manner: one part for calculating while the other for exchanging data with external memory, the schematic diagram can be seen in Fig. 5. Since in most cases, data transportation between external memory and local memory occupied a large proportion in the whole executing processing. When the data transportation cycles is comparable to processing cycles, the vast majority of data transportation can be hided through ping-pong operation, and thus improve the utility of resources. 3.3 Combinational algorithm management The strength of RASP originates from the combination of subalgorithms with high efficiency management mechanism and simple compilation method. A complete radar task usually contains more than one subalgorithms, for instance, digital pulse compression fast convolution include FFT, pointby-point multiplier and IFFT, while moving target indication (MTI) includes matrix transposition, dot production and FIR. As mentioned above, the most computation-intensive and commonly used subalgorithms in radar system are covered in RASP as the subtasks. We will take pulse compression as an example to show how the RASP implements a complex task. Pulse compression is a signal processing technique commonly used by radar to increase the range resolution as well as the signal to noise ratio. Compared to time domain convolution, a more computational efficiency approach is frequency domain complex multiplication, the operate procedure is shown in Fig. 6. From the diagram we can see the task was partitioned to a sequential process, that contain three subalgorithms which

7

IEICE Electronics Express, Vol.*, No.*, 1–12

Multiply Complex Envelope Input Signal

Inverse FFT

FFT Reference Waveform Replica

Compressed Pulse

FFT with zero fill

Fig. 6. Procedure of digital pulse compression. are FFT, multiply and IFFT. Three subtasks will be called by user complier, the framework of function construction is shown below. The code in the block is a framework of application preparation process, three segments represent three subalgorithms respectively. The subalgorithms functions are predefined in API library, parameters in these functions are some basic information such as algorithm type, processing points, source data addresses, and destination data addresses, it provides great convenience to programmers with the support of function library. Unlike instructions in DSP dealing with common mathematical operations such as multiply-accumulates (MACs), one piece of configuration of RASP corresponding to one function that takes charge of a macro-task. Even though DSP instruction sets are optimized for digital signal processing, the coarser configuration in RASP is supported by the tailor-made hardware acceleration. How the FFT and matrix inversion module exploit the reconfigurable computing resources can be seen in [12] and [13], the other kernels also go through similar optimization. By constructing functions, a complex task can accomplish with high efficiency due to its ease of task decomposition and modularization. Real life Radar task: digital pulse compression 1: int DoDpcByRasp(raspid, resptype, fftlen, srcaddr, dstaddr, tmpaddr, lfmcoefaddr, fftcoefaddr) //define the parameters used in DPC 2: int xPhyRaspDesc; int xVirRaspDesc; 3: xVirRaspDesc = xPhyRaspDesc — 0xa0000000; xPhyRaspDesc = xPhyRaspDesc + insindex*14*4*3; 4: setFFTParamEx((u32*)(xVirRaspDesc+0*14*4), fftlen, 0, 0, VaToPaH(srcaddr), VaToPaL(srcaddr), VaToPaH(fftcoefaddr), VaToPaL(fftcoefaddr),VaToPaH(tmpaddr), VaToPaL(tmpaddr), 0, 1, 0, 0); //FFT function call 5: setMatrixDotMulEx((u32*)(xVirRaspDesc+1*14*4),fftlen, VaToPaH(tmpaddr), VaToPaL(tmpaddr), VaToPaH(lfmcoefaddr), VaToPaL(lfmcoefaddr), VaToPaH(tmpaddr+fftlen*8), VaToPaL(tmpaddr+fftlen*8), 0, 1, 0); //Matrix multiply function call 6: setFFTParamEx((u32*)(xVirRaspDesc+2*14*4), fftlen, 0, 1, VaToPaH(tmpaddr+fftlen*8), VaToPaL(tmpaddr+fftlen*8), VaToPaH(fftcoefaddr), VaToPaL(fftcoefaddr), VaToPaH(dstaddr), VaToPaL(dstaddr), resptype, 1, 1, 0); //IFFT function call

8

IEICE Electronics Express, Vol.*, No.*, 1–12

Fig. 7. The percentage of configure, data transportation and calculate in the whole executing process, the data onto the vertical bar is the detailed cycle spent. The FFT size range from 32 to 1M. 4

Experiment and Comparison

In this section, we assess the performance of our processor. First we measure the configurational and computational efficiency of our method. Additionally, some classic subalgorithms of RASP are implemented, the result analysis is made and the comparison between a state-of-the-art DSP is figured out. 4.1 Analyze the efficiency of configuration and calculation The total execution cycles of a task contains three elements: configuration, calculation and data transportation. We choose FFT as an example to analysis the efficiency of configuration and calculation, the cycles of different phases are shown in Fig. 7. The FFT size under analysis varies from 32 points to 1M points. Since each configuration information storage in external memory has fixed length, the configuration cycles maintain at a relatively stable value of about 1200 as shown in Fig. 7, independent with the operation size, is negligible in large point operation. When the total processing point is relatively small, data transportation dominates the whole process, while in large processing point, by adopting ping-pong operation, the proportion can be reduced significantly. For 256K-point FFT, where the data fill all the local memory, has the highest computing efficiency. With the measurement result, it provides a real-time FFT throughput of 97.55Mpps and data transmission bandwidth of 1.89GB/s @ 500MHz. The utilization rate of the resources in RASP during the calculation phase

9

IEICE Electronics Express, Vol.*, No.*, 1–12

and the entire execution phase can be expressed below: UCalculation = UExecution =

Computation amout Computation capacity × Caculation cycle Computation amout Computation capacity × Execution cycle

(1)

where computation amount of radix-2 FFT is N2 log2 N times complex multiply and N log2 N times complex adder. Substituting the detailed experimental values, where N = 256K, calculation cycle = 787212, and execution cycle = 1343733, the results can be obtained, they are 74.9% and 43.9% respectively. The utilization rate means the average ratio of number of active RPEs over number of total RPEs within execution cycles. 4.2 Results comparison We chose a TI DSP processor TMS320C6672 [14] as a contrast, for it has similar computing resources compared with RASP. The TMS320C6672 is a multicore fixed and floating-point DSP optimized for radar and electronic warfare applications, it consists of two C66+ DSP core and the total floatingpoint multiplier are of the same amount with RASP, the main device characteristics comparison is shown in Table II. The operating frequency of RASP can achieve 500MHz in the first edition prototype chip, but in the second edition in can reach 1GHz by updating memory IP and code optimizing. We selected three typical kernels which are available in commercial libraries and the benchmarks cover classical sizes. From Table III [15], we can see that in FIR and matrix multiply, RASP has an improvement varies from 5% to 183% compared with TMS320C6672, but when it comes to FFT, which goes through deep optimization in our design phase, RASP can get higher speedup ratio (up to 3.22x) as the processing points increase. 5

Development Environment and Prototype Chip

Our work was developed on the Synopsys VCS platform, descried with Verilog hardware description language and implemented using TMSC 40nm CMOS technology. RASP consumes a chip area of 19.2mm2 , and the clock frequency of the chip is 500MHz. The hardware layout is shown in Fig. 8, the SoC chip contains two DSP cores and one RASP core, RASP acts as a coprocessor in the entire SoC. 6

Conclusion

We have introduced a floating-point units based reconfigurable architecture specially for radar signal processing. Subalgorithms as the minimum task can Table II. Comparison between RASP and TMS320C6672 RASP TMS320C6672

Frequency 0.5GHz 1\1.25\1.5GHz

Feature size 40nm 40nm

Local memory 2MB 1152KB

GFLOPS 70@1GHz [email protected]

10

IEICE Electronics Express, Vol.*, No.*, 1–12

Table III. Performance comparison between RASP and TI TMS320C6672 on running some typical algorithm Type Point 16/161 1024/16 1024/64 1024/128 65536/128 131072/128 Type Point 16/16/161 64/64/64 96/96/96 96/1024/96 60/3072/127 Type Point 1281 256 512 1024 2048 4096 1

FIR (cycles) RASP TI DSP 166 194 4198 10274 17446 22562 36902 28946 2101397 2490402 4198697 4980770 Matrix Multiply (cycles) RASP TI DSP 1058 2994 65570 119442 221218 378834 2369088 3595282 5930880 8847078 FFT (cycles) RASP TI DSP 588 801 780 1457 1164 2582 2572 5470 4620 12635 8176 26332

speedup 1.17 2.45 1.29 1.05 1.19 1.19

speedup 2.83 1.82 1.71 1.52 1.49

speedup 1.36 1.87 2.21 2.12 2.73 3.22

The point column represents algorithm size, in FIR segment 1024/64 means 64-tap FIR for a 1024 input, and so forth; in matrix multiply, the three numerical values represent row number of matrix 1, column number of matrix 1, and column number of matrix 2, respectively; in FFT, the value is processing points.

simplify the configuration process and improve configuration efficiency, the proposed multistage configure procedure can improve computation efficiency and reduce programming complexity. Ping-pong operation applied in large size processing can hide the vast majority of data transportation cycles. The configuration and computing efficiency are analyzed, we also have compared the chip implementation of RASP with a state-of-the-art TI DSP on running the same tasks.

11

IEICE Electronics Express, Vol.*, No.*, 1–12

Fig. 8. Layout of the entire SoC. As future work, larger RPE array with customizable capabilities should be used in our system and the corresponding configuration method needs to be improved for high flexibility and extendibility. 7

Acknowledgement

This work was supported by National Nature Science Foundation of China under Grant No. 61176024; Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20120091110029; The project on the Integration of Industry, Education and Research of Jiangsu Province BY2015069-05; The project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD); And the Fundamental Research Funds for the Central Universities.

12

Suggest Documents