2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications
Automatic Loop-based Pipeline Optimization on Reconfigurable Platform
Qi Guo, Chao Wang, Xiaojing Feng
Xuehai Zhou
Suzhou Institute of Advance Study University of Science and Technology of China Suzhou, Jiangsu, China {gqustc, saintwc, bangyan}@mail.ustc.edu.cn
School of Computer Science University of Science and Technology of China Hefei, Anhui, China
[email protected]
pipeline. Then we design some algorithms to automatically choose the optimized strategy based on the profiling of application. Prior to implementing the algorithms, there are certain prerequisites that need to be prepared. Firstly, what the granularity of the loop body should be chosen. That is to say, how to define the abstraction level of instructions in the loop body. Considering the features of reconfigurable hardware platform, where the function units can do fixed works. We define that the abstracted instructions in loops are tasks, which refer to an operation which can be done by a function unit on hardware platform such as IP cores. Secondly, how to express a task on reconfigurable MPSoC platform. In this paper, we abstract tasks as OP:{write set},{read set} pattern, where OP indicate the type of operation while the {write set} and {read set} indicate the output buffer and input buffer respectively, on the reconfigurable hardware platform, they usually contain several registers to hold the data transferred to them. For a simple illustration, we take this pattern as Task_op{{D1,D,…,Dn},{S1,S2,…,Sn}}, where D1,D2,…,Dn represent the destination registers as write set and S1,S2,..,Sn represent source registers in read set. Data dependency [3] may occur if there is a register present in read/write sets of different tasks. As shown in Fig. 1 (a), register a exists in both the write set of task_1 and the read set of task_3, so a data dependency arise between task_1 and task_3. That means task_3 can start execution only after task_1 finishing execution, the same to task_3 with task_2 for register b, task_5 with task_3 for registers e and f, and task_5 with task_4 for registers g and h. The data dependency introduced another problem: how to model them. In another word, how to formalize the loops. In our approach, the loops are modeled as weighted data flow graph (WDFG) which is extended from data flow graph (DFG) [4]. As illustrated in Fig. 1 (b), the solid arrow lines indicate the data dependencies between tasks and the dotted arrow lines represent the iterations of loop. Meanwhile, the edges are labeled with the estimated communication overheads between tasks depending on the scale of tasks’ output and input buffer. Each node in WDFG is marked with a pair (index, weight), where index indicates the position of a task in the loop body and weight represents the execution time of a task. Take task_3 as an example, it is the third task in the loop body, so the index of task_3 is 3, whilst the
Abstract—Pipelining is an effective technique to improve the performance of a loop by overlapping the execution of several iterations. We consider the pipeline scheduling of loops on reconfigurable platform in this paper. A loop is abstracted as a weighted data flow graph (WDFG), where nodes represent tasks while edges stand for inter-task dependencies. The weights of nodes and edges indicate task execution times and communication overheads respectively. Based on the abstraction, we design a novel and flexible technique for scheduling loops running on reconfigurable platforms using loop pipelining. This results in good parallelism for the loops. To evaluate the performance of the proposed technique, we have demonstrated experiments both with software simulation and hardware evaluation on FPGA-based reconfigurable platform. The experimental results show that our approach has satisfactory performance. Keywords- loop; pipelining; optimization; reconfiguration;
I.
INTRODUCTION
With the increasing interest in heterogeneous MultiProcessor System-on-Chip (MPSoC) architectures, the importance of how to make full use of the available on-chip cores by programs is rising. This problem is more pronounced in the context of reconfigurable platforms such as FPGAs where each function unit is much simpler and faster than general purpose processor (GPP). Recent research [1] indicates loops are usually the most timecritical parts of an application, and thereby the parallelism embedded in the repetitive pattern of a loop needs to be explored. Pipelining is the most widely used way of mapping loops, especially nested loop [1, 2] onto hardware platform. With pipelining, the operations of a loop are mapped to function units in different pipeline stages. Each function unit will perform only the operation mapped onto it rather than the entire iteration of a loop. Furthermore, on the reconfigurable platform, the function unit can be dynamically configured according to the application, that is to say, the pipeline on reconfigurable platform can reconstruct to achieve higher parallel degree. It enlightened us to an idea that the traditional pipeline optimized scheme at instruction level can be utilized to task level by reconfiguration, In this paper, we first analyse key constraints in choosing the optimized strategy which directly influences the performance of
978-0-7695-5022-0/13 $26.00 © 2013 IEEE DOI 10.1109/TrustCom.2013.112
919
execution time of task_3 is two time units, so that the pair (3,2) is marked on the corresponding node in WDFG. Furthermore, there are two registers (e and f) in the write set of task_3, and task_5 will read these registers, which are also in the read set of task_5. If we take the time consumed by transferring the data in one register as one time slot, the estimated communication cost between task_3 and task_5 is two time slots, which is labeled on the solid arrow line between task_3 and task_5. (1,1) Loop{ 1:Task_1{{a},{b}}; 2:Task_2{{c},{d}}; 3:Task_3{{e,f},{a,c}}; 4:Task_4{{g,h},{i}}; 5:Task_5{{j},{e,f,g,h}}; }
as follows: We propose a series of algorithms, which automatically compute the optimized strategy of pipelining loops onto reconfigurable hardware and select the most suitable scheme to implement. The algorithm is based on profiling information about WDFG of tasks, such as execution time, communication overheads and so on. We apply our proposed approach on test cases from EEMBC benchmark [20] to pipeline their loop parts, and evaluate the performance of our approach in software simulation. For the hardware evaluation, we use a study case for a classical application—JPEG encoding. The processing procedure of JPEG encoding is a loop, which contains the operations of color convert (CC), discrete cosine transformation (DCT), quant and Huffman coding. We have designed IP cores for each operation and implemented them on FPGA prototype system. The rest of this paper is organized as follows. Section 2 introduces the problem in loop pipeline and the motivation of our work. In section 3, we present the framework and the kernel algorithms. We illustrate experimental results in section 4, including software simulation and hardware evaluation on FPGA. Finally, the related work and conclusions are presented in section 5 and section 6 respectively.
(2,1)
1
1
(3,2)
(4,1) 2
2
(5,2) (a)
(b)
Figure 1. Weighted Data Flow Graph (WDFG)
Based on the WDFG model, some algorithms are proposed to demonstrate the profiling, WDFG acts as input of the algorithms, while the optimized strategy of pipeline acts as the output. We declare the contributions of this paper
Figure 2. optimal schemes of pipeline targeting reconfigurable platform
920
II. PROBLEM AND MOTIVATION
of iterations in loop and each iteration contains k tasks whose execution times are t1 t2 t3…tk respectively. The max(t1 t2 t3…tk ) represents the longest task execution time by which we locate the bottleneck of pipeline. Considering the communication cost on MPSoC platform, we assume that the average communication time cost is tc. Based on these assumptions, the performance indicators of pipeline such as throughput rate (Tp), speedup ratio (S), efficiency (E) can be formulated as follow: Throughput rate (Tp):
A. Pipelining of the loop Since loops are the most time-critical part of many applications, particularly on the MPSoC system, where resource is limited, therefore, loops are the best candidates for MPSoC driven acceleration [1]. There are two basic methods to help scheduling on loops: unrolling [5] and pipelining. Loop unrolling allows operations from different iterations to be scheduled together by eliminating the branches. But if we simply replicate the operations when we unroll the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop. Thus, we will want to use different registers for different iterations, increasing the required number of registers. It is unadvisable on MPSoC system since resources are limited there. On the other hand, pipeline is the most popular method to accelerate loops by overlapping the unrelated operations. However, some restrictions are arising at task level. The pipeline may stall because of the discordant execution time of tasks and communication overhead between tasks, and the performance of the pipeline will decline, that is the bottleneck problem of pipeline and can be observed from time-space diagram. As Fig. 2 (a) shows, in the time-space diagram of origin pipeline, the horizontal axis represents the time, while the vertical axis represents the space which is marked with the index of function unit. The index indicates the task type that can be executed on which function unit. we can see that the task 4 has the longest execution time in the loop, while the other tasks may stall to wait for the task 4 finishing execution. It can be concluded that the bottleneck of this pipeline occurs at task 4. Similarly, task 3 is the bottleneck in Fig. 2 (b) since it also has the longest execution time. To avoid this problem, some traditional optimized schemes at instruction level such as splitting and duplexing of function unit could be applied to task level but may be modified targeting reconfigurable platform. Considering the reconfigurable platform, the most important characteristic is that the function units such as IP cores can be dynamically configured, which provide us a scheme to reconstruct the pipeline on reconfigurable platform. So that the frequently-used methods (duplexing of function unit, splitting a function unit) for eliminating the bottleneck of pipelines can be implemented by dynamic reconfiguration technology. Fig. 2 illustrates two schemes to optimize the pipeline based on analysing WDFG of loops. There are four columns in the figure. The first column is the original WDFG which is directly abstracted from the source code of loops. The second column is the optimal WDFG that is optimized from original WDFG by profiling of our approach. The third and the forth columns are pipeline time-space diagrams obtained from the original WDFG and optimal WDFG respectively. Before elaborating the detail of two schemes, some performance indicators are defined to evaluate the performance of the strategies. We assume that n is the count
Tp
S
t1 t 2
n t 3 ... t k (n 1)(max( t1 , t 2 , t 3 ,...t k ) t c )
Speedup ratio (S):
t1 t 2
n t1 t 2 t 3 ... t k (k 1)t c
t 3 ... t k (n 1)(max( t1 , t 2 , t 3 ,...t k ) t c )
Efficiency (E): E
t1 t 2 t 3 ... t k k (max( t1 , t 2 , t 3 ,...t k ) t c )
B. Duplexing of function unit Duplexing of function unit is a well-known method to eliminate pipeline bottlenecks. In our approach, the function unit is an IP core that can finish specific tasks in the loops. Inspired by the method of duplexing function units, we can dynamically add a corresponding IP core through reconfiguration once we found out which task the bottleneck exist in. As Fig. 2 (a) shows, the bottleneck of pipeline is task 4 since it has the longest execution time. So another function unit 4’ is added which has the same functionality as function unit 4. After that, since there is more than one function unit capable to execute the bottleneck task, the max(t1 t2 t3…tk ) will decrease indirectly, and therefore the throughput rate, speedup ratio and efficiency of the pipeline are improved. Obviously, the shortcoming of this method is that it will lead to more resource consumption. As a result, this approach is not preferable when the hardware resource is limited, although it can bring in good performance profit. C. Splitting a function unit In the case where the hardware resource is quite limited, another way is widely applied to eliminate pipeline bottle necks, which is splitting the function unit. Regarding our platform, IP cores act as function units, if we want to split an IP core, we should divide the function of IP core into several parts and implement every part. Some connections and buffers will be added to link them. As Fig. 2 (b) depicts, task 3 is the bottleneck of pipeline and the function unit 3 can be divided into two parts 3 and 3’, each of which has the shorter execution time comparing
921
B. Algorithms As the previous subsection states, there are three steps in the profiling. Each step involved a kernel algorithm. Algorithm 1 illustrates how to correctly find the parallelizable tasks in WDFG and arrange them in topologic order. Actually, this algorithm is extended from standard topologic sorting, while the difference is that when we firstly find all of the nodes that have no previous nodes and then delete them and their connection instead of deleting the node immediately when it has no previous node. In algorithm 1, set R is the container of sorted tasks and set P contains the tasks that can execute in parallel. Similarly with standard topologic sorting, we insert a node into set P if it has no previous node. When all of the nodes left in the WDFG have previous nodes, we delete the nodes and their connection in P from WDFG, and then append P into set R as the result. Assume that the WDFG contains N nodes, the complexity of algorithm 1 is O(N2). This algorithm is the first part of profiling, and can be implemented in software.
with the original function unit 3. Similarly to duplexing function unit, the max(t1 t2 t3…tk ) is also decrease in this case, therefore, the throughput rate, speedup ratio and the efficiency are improved. However, the execution time of single iteration in loops is increased since the communication overhead between 3 and 3’ is newly introduced. According to introduce above, We can conclude that the speedup ratio, throughput rate and efficiency are increased in both duplexing and splitting strategies. But there are extra hardware resource requirements in duplexing strategy or communication cost in splitting strategy. FRAMEWORK AND ALGORITHMS
III.
In this section, the framework and the kernel algorithms are presented. Our algorithms are based on the WDFG model, and the pipeline strategy is obtained by applying the algorithms. A. Framework In our approach, we analyse the loop source codes and map them to WDFG with each node in WDFG stands for an individual task. Then we make the connection of these nodes by detecting the data dependencies between tasks with comparing the write sets and read sets of tasks which are mentioned in section I. After that, the estimated execution time of each task will be marked on the corresponding node in WDFG. Finally, we check the endpoint of edges and label them with the communication cost which is inferred from the scale of write set of the endpoint. At this moment, a complete WDFG is built and it can reflect the essential factors of loops, including the number, correlation and execution time of tasks. After obtaining the WDFG of loops, the main point of our approach is the profiling of WDFG. As Fig. 3 presents, the WDFG is the input of profiling while the pipeline strategy is output. In the profiling stage, we have designed 3 sequential algorithms with one’s output serving as the input of the next. At the end of the sequence, 2 optimal strategies are alternative that is duplexing of function unit and splitting a function unit. All of these strategies can be realized by dynamic reconfiguration.
Weighted DFG
Profiling
Extended topological sorting
Algorithm 1. Extended Topologic sorting
Input WDFG; Output resultSet R; 1: resultSet R; 2: while(there is node in WDFG) 3: paraSet P; 4: while(node a has no previous) 5: P.insert(a); 6: a := another node; 7: endwhile 8: delete nodes and their connection in P; 9: R.append(P); 10:endwhile
The second step of profiling is the procedure of finding the max weight. As algorithm 2 presents, the input of the algorithm is the result set R from previous step. Firstly, we find the node with max weight within each parallel set P, and mark P with this weight. Then we search all parallel set in the result set and find the one with max weight. We call it bottleneck set Q, which contains the node where bottleneck may occur. Because of the traversal of every node in WDFG, the time complexity is O(N) with assuming N as the number of nodes in WDFG. Algorithm 2. Find max
Pipeline strategy
Input resultSet R; Output bottleneckSet Q; 1: for each P in R 2: P.weight = max{a.weight | aęP}; 3: endfor 4: neckSet Q = max{P.weight | PęR};
Duplexing Find max weight
Choose strategy Splitting
The last step of profiling is choosing the strategy of pipeline based on the bottleneck set Q which is found in previous two steps. Algorithm 3 shows the process of this step. First, if there is only one node in Q and there are sufficient hardware resources, we may duplicate the node in
Figure 3. Framework of our approach
922
Q to solve the bottleneck problem. Otherwise, if there is more than one node in set Q, we should find the one with max weight. Considering this node, if it can be split, we may split it into several small nodes and arrange them in sequential order.
nodes. According to algorithm 2, set {2} is selected as the set with the max weight. Taking into account the second operation of FFT, which is FFT computation, we have observed that there is an inner loop in this procedure. As algorithm 3 illustrates, if the bottleneck set has only one node, we may duplicate or split it. Moreover, since this procedure contains loop, which is highly resource consumed, duplexing will cost much more resources. On the contrast, splitting is much easier by simply dividing the iterations of loop into two parts, for example, if the loop has 100 iterations, the first part and the second part each contains 50 iterations by changing the loop control variable’s starting and ending position. As Fig. 4 (b) depicts, we split node 2 into two parts, the node 2 and 2’, both of which have smaller execution time, which is about half of the origin. Consequently, we have obtained the optimized strategy that is splitting function unit.
Algorithm 3. Choose strategy
Input bottleneckSet Q; Output pipeline strategy; 1: if(Q.nodeCount == 1 && there are sufficient hardware resource) 2: duplicate the node in Q; 3: endif 4: else 5: find the node a which with max weight in Q 6: if(a can be split) 7: split a into smaller node; 8: endif 9: endelse
start
IV.
Initialize and Get Test Data
DATA ANALYSIS AND EXPERIMENT
We demonstrate our framework and algorithm on both software simulation platform and hardware prototype system. The software platform is a PC with Intel Core Dual CPU 2.93G while the hardware prototype system is based on Xilinx Virtex-5 development board with XC5VLX110T FPGA. We have chosen 7 test cases from EEMBC benchmark to evaluate our approach including FFT, FIR, IIR, iFFT, Road Speed Calculation, Table Lookup and Interpolation, JPEG. The first six benchmarks are for software simulation and the last one is for hardware evaluation.
(1,27661) (1,27661)
Get Input values
(2,101567) Compute FFT
(2,195823)
split
Compute FFT Spectrum
(2',110045) (3,92498)
Compute Power Spectrum
Done?
A. Software simulation In software simulation, our experiments begin with analysing the structure of benchmarks. Taking FFT as an example, first we get the execution flow chart of FFT benchmark as shown in Fig. 4 (a). Then we map the loop part to WDFG which is mentioned in section I. Fig. 4 (b) depicts the corresponding WDFG of the loop part in FFT. The nodes of WDFG are marked with indexes and weights, which are the positions of operations in the loop and their execution time. We evaluate the execution time of every procedure by inserting timestamp into the benchmarks and take CPU cycles as the unit to achieve more accurate results. Furthermore, because of the data transferring between operations via shared variable or memory in software simulation, the communication time cost among procedures is negligible. That is why we do not sign the age in WDFG with communication cost as introduce in section I. To this end, we have got the WDFG as the input of our framework. The next step is profiling of the WDFG, in another word, the three kernel algorithms will be applied to perform the profiling. Because there are no parallel nodes in WDFG, the output of algorithm 1 is set R = {{1},{2},{3},{4}} where the numbers are indexes of the
Yes
No
(3,92498) (4,57695)
Cleanup and Report Results
(4,57695) stop
(a)Flow chart
(b) WDFG
(c) Optimized WDFG
Figure 4. FFT process flow
Next, for the simulation, we carry out a multi-thread program to simulate the pipeline execution where each thread represents one type of operations, the threads communicate with each other via shared memory to reduce the impact of data transfer overheads. In more detail, one thread stands for one pipeline stage. At the beginning of the pipeline, the thread that is occupied to the first stage of pipeline starts execution while the others stay in idle state. At the end of first thread, it will send a wake-up signal to the second thread and put the data into the shared memory, so that the second thread that is on behalf of the second stage of pipeline will get the data from shared memory and start execution. At this moment, the first thread is executing the first operation of next iteration. As this continues, the
923
6
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
5 4 3 2 1 0
FFT
iFFT
RSPEED
TBLOOK
IIR
FIR
FFT
iFFT
(a)Speedup ratio
x106 25
RSPEED
TBLOOK
IIR
FIR
(b)Efficiency
20 15
Test Cases
FFT
iFFT
Strategy
Split
Split
RSPEED TBLOOK Split
Duplex
IIR
FIR
Duplex
Duplex
(d)Strategies
10 5 0
Origin
FFT
iFFT
RSPEED
TBLOOK
IIR
Optimized
FIR
(c)Throughput rate
Figure 5. Results of software simulation
pipeline filling will be complete and the structure will be built. We deal with other test cases in a similar way. By this means, each of the test cases executes 250 to 1000 times, and then the average speedup, efficiency and throughput rate of pipeline can be evaluated. Fig. 5 depicts the primary results we get. Performance indicators of both original pipeline and the optimized one of each test case are shown in Fig. 5, the strategies we obtained by the automatic profiling are listed in Fig. 5 (d). We apply splitting on FFT, iFFT and RSPEED test cases while duplexing on TBLOOK, IIR and FIR test cases. In Fig. 5(a) and Fig. 5 (c), we can see that the speedup ratio and throughput rate increased by about 2 times after optimization, and in Fig. 5 (b), the efficiency increased by 1.1-1.5 times. These are satisfactory results that we expected and also show that the automatic profiling can choose the appropriate and efficient optimized strategy to achieve good performance.
CC_IP_Core FSL MFSL
SFSL
IDCT_IP_Core FSL MFSL
SFSL
QUANT_IP_Core MFSL
FSL
SFSL
HUFFMAN_IP_C ore SFSL
FSL
MFSL
Local Memory bram_block SLMB
SLMB
lmb_bram_ lmb_bram_ if_cntlr if_cntlr
dlmb
ilmb
DLMB
ILMB
SFSL
MFSL
SFSL
MFSL
SFSL
MFSL
SFSL
MFSL
DEBUG
B. JPEG case study on FPGA The software simulation verified that our profiling procedure and the algorithms are effective. Now that, considering the real hardware platform, firstly, we implement a prototype system that is derived from [6] on a state-of-the-art FPGA board. Fig. 6 depicts the model diagram of our system. The components in the system are listed as follows Several heterogeneous IP cores as function units are integrated in the system. All of them are packed in Xilinx Fast Simplex Link (FSL) master-slave manner and connect with the Microblaze processor through FSL links.
DPLB PLB(Processor Local Bus) mb_plb
IPLB
Pipeline Control Microblaze
microblaze_0 microblaze_0 mdm bus
mdm
slaves of mb_plb
SPECS EDK VERSION 12.4 ARCH VIRTEX5 PART XC5VLX110T
xps_intc
xps_timer
xps_uartlite
Interrupt Timer RS232 Controller Controller Controller
DDR2_SDRAM
Figure 6. Pipeline hardware architecture
924
ICAP
MPMC Reconfigurable Interface Controller
of the original process flow. As Fig. 6 depicts, each IP core is responsible for one operation in JPEG process flow, the pipeline is controlled by an controller running on Microblaze processor, the processor firstly transfer data to the CC IP core, when the CC IP core finish execution, it will notify the processor by sending a interrupt signal, once the processor receive this signal, it will read data from the output buffer of CC IP core immediately and transfer it to the input buffer of the second IP core that is DCT-2D, what’s more, the data transmissions between different IP cores are without interference, since there are difference thread running on the processor to finish the data transmission and one thread is only occupied to one kind of data transmission such as reading data from output buffer of DCT-2D IP core. Following this pattern, the pipeline on hardware platform will be built. And then, to demonstrate the evaluation on optimized pipeline, the partial bit stream was configured into FPGA by ICAP to reconstruct the pipeline, the other components in the prototype system have no need to change as well as the pipeline controller running on Microblaze processor. By this means, the speedup ratio, throughput rate and the efficiency of both original and optimized pipeline are evaluated.
A Microblaze processor acts as pipeline controller. It is occupied in the task distribution and data transmission between IP cores. Some peripherals are connected through Processor Local Bus (PLB), including interrupt controller, timer controller, RS232 controller, MPMC interface and the reconfiguration controller. Moreover, the interrupt controller is used to notify the processor when the tasks on IP cores are finished and the RS232 works for result display. The MPMC provides the interface to programmer for SDRAM controlling and the ICAP is used for reconfiguration. The key technology to realise our approach is partial reconfiguration. Firstly, we preload the partial configuration bit stream into Compact Flash card, and then when we need to reconstruct the pipeline to apply our optimized strategy, the program running on the Microblaze processor will load the partial bit stream from flash to SDRAM and transfer it to ICAP that is the reconfiguration controller through PLB bus, the ICAP would configure FPGA to fit the new pipeline structure. Obviously, the reconfiguration procedure will introduce extra overheads, but it happens only once when the optimized pipeline structure initialize. So it has no significant effect on our approach. According to software simulation, we can see that our optimization strategy can achieve satisfied result. On the real hardware platform, we choose a more complex test case from EEMBC, JPEG encoding, to perform the evaluation.. Similarly to software simulation, we start from analysing the process flow of JPEG. As illustrated in Fig. 7 (a), the process flow consists of four major steps: colour conversion (CC), two dimensional DCT, quant and Huffman coding. Read 8*8 RGB Block
CC
DCT-2D
Quant
ZZ/ Huffman
CC
DCT-1D
DCT-1D
Quant
3
0.6
2.5
0.5
2
0.4
1.5
0.3
1
0.2
0.5
0.1
250000
150000 Origin 100000 Optimized 50000
Write Compress Bit
ZZ/ Huffman
0.7
200000
0
0 Speedup
(a) JPEG Origin Process Flow
Read 8*8 RGB Block
3.5
0 Efficiency
Throughput rate
Figure 8. Results of hardware evaluation
Write Compress Bit
(b) JPEG Optimized Process Flow
Fig. 8 presents the results of hardware evaluation. The results are close to software simulation. The speedup and the throughput rate increase by 2 times and the efficiency increase by 1.5 times. That is to say, our approach can also achieve good performance on real hardware platform.
Figure 7. JPEG process flow
A JPEG case study [7] shows that the workload ratio for CC: 2D-DCT: Quant: Huffman is approximately 11.95%: 61.2%: 11.95%: 14.9%. We can use these ratios as the weight of each node of WDFG in our algorithm. The parallelism of loop in JPEG encoding procedure is also considered in [8]. In our approach, algorithm 1 and algorithm 2 will recognise DCT-2D as the bottleneck task. By further observation, DCT-2D can be divided into two steps: Vertical DCT and Horizontal DCT, so algorithm 3 will choose splitting DCT-2D as optimized strategy. Fig. 7 (b) depicts the optimized process flow, there are two DCT1Ds between CC and Quant. Based on previous analyses, we evaluate both original and optimized pipeline. First, four IP cores have been arranged on the prototype system to perform the operations
V.
RELATED WORK
Accelerating loops with pipeline has been studied on various architectures, such as ASICs and FPGA. [5] is aimed at increasing the through put of the pipeline system to accelerate nested loop. For VLIW processor pipelining on loops can be demonstrated in software which called software pipelining [9]. [10] extend and adapt single dimension software pipelining to generate schedules for FPGA. Pipeline vectorization is presented in [11] which is a method for synthesizing hardware pipelines in reconfigurable systems based on software vectorizing compilers. For DSP
925
Processing Symposium Workshops & PhD Forum (IPDPSW), 2012. p.282-287. [4]. C. Liang-Fang, A.S. LaPaugh, and E.H.M. Sha, “Rotation scheduling: a loop pipelining algorithm”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 1997. 16(3): p. 229-239. [5]. D. Petkov, R. Harr, and S. Amarasinghe. “Efficient pipelining of nested loops: unroll-and-squash”. in Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-ROM. 2002. [6]. C. Wang, et al., A star network approach in heterogeneous multiprocessors system on chip. The Journal of Supercomputing, 2012. 62(3): p. 1404-1424. [7]. J. Wu, et al. “Design Exploration for FPGA-Based Multiprocessor Architecture: JPEG Encoding Case Study”. in Field Programmable Custom Computing Machines, 2009. FCCM '09. 17th IEEE Symposium on. 2009. [8]. C. Wang, et al., “MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs”. ACM Trans. Archit. Code Optim., 2013. 10(2): p. 1-26. [9]. V.H. Allan, et al., “Software pipelining”. ACM Comput. Surv., 1995. 27(3): p. 367-432. [10]. K. Turkington, et al., “Outer Loop Pipelining for Application Specific Datapaths in FPGAs”. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 2008. 16(10): p. 1268-1280. [11]. M. Weinhardt and W. Luk. “Pipeline vectorization for reconfigurable systems”. in Field-Programmable Custom Computing Machines, 1999. FCCM '99. Proceedings. Seventh Annual IEEE Symposium on. 1999. [12]. W. Jian and S. Bogong. “Software pipelining of nested loops for real-time DSP applications”. in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. 1998. [13]. J. Ramanujam, “Optimal software pipelining of nested loops”. in Parallel Processing Symposium, 1994. Proceedings., Eighth International. 1994. [14]. V. Srinivasan, et al. “Optimizing pipelines for power and performance”. in Microarchitecture, 2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on. 2002. [15]. J.C. Timothy and W. John, “Adapting software pipelining for reconfigurable computing”, 2000, ACM. [16]. K. Muthukumar and G. Doshi, “Software Pipelining of Nested Loops Compiler Construction”, R. Wilhelm, Editor 2001, Springer Berlin / Heidelberg. p. 165-181. [17]. M. Fellahi and A. Cohen, “Software Pipelining in Nested Loops with Prolog-Epilog Merging High Performance Embedded Architectures and Compilers”, A. Seznec, et al., Editors. 2009, Springer Berlin / Heidelberg. p. 80-94. [18]. Q. Jin, et al. “The research of FPGA-based loop optimization pipeline scheduling technology”. in Computer and Communication Technologies in Agriculture Engineering (CCTAE), 2010 International Conference On. 2010. [19] Y.S. Yin, G.M. Du, and Y.-K. Song. “Study on the Multipipeline Reconfigurable Computing System”. in Computer Science and Software Engineering, 2008 International Conference on. 2008. [20] J.A. Poovey, Conte, T.M.,et al. “A Benchmark Characterization of the EEMBC Benchmark Suite”. 2009, Micro, IEEE, Volume 29, p.18-29
applications [12] extends software pipelining from innermost loops to whole nested loops. Pipeline optimization also has been researched in many cases, [13] optimize software pipelining by finding the minimum iteration initiation interval for each level of nested loop, [4] introduced a rotation scheduling algorithm to optimize the pipelining, [5] make pipelining more efficient by unroll and squash, it can achieve 2x improvement in the area efficiency to the best known optimization techniques, [14] optimized the pipelines for power and performance by analyse the power performance model to derive optimal pipeline depth and get more than 10% additional power reduction. [15] extended VLIW software pipelining for Garp complier and architecture with 2x~4x speedup increasing and the Garp is a rapidly reconfigurable coprocessor. What’s more, prolog-epilog merging is also a widely used way to optimize pipeline, such as [1, 16, 17]. On reconfigurable platform, [15] introduced a adapting software pipelining on Garp compiler and architecture, [1] pipeline the nested loops on reconfigurable array processors and optimized it with prolog-epilog merging. For FPGA [18] present optimized pipeline scheduling method based on pipeline types and scheduling principles. The Multi-Pipeline Reconfigurable System (MPRS) is presented in [19] which has multiple linear arrays for pipeline applications VI.
CONCLUSION
In this paper we present algorithms to automatically choose pipeline optimization strategies by profiling the WDFG of loops. The strategies are extended from traditional optimization method according to the particularity of reconfigurable platform and we apply them on task level. Also, we evaluate our approach both in software simulation and hardware platforms. The both results show that our approach can achieve satisfied performance. ACKNOWLEDGMENT This work was supported by the National Science Foundation of China under grants No. 61272131 and No. 61202053, China Postdoctoral Science Foundation grant No. BH0110000014, Fundamental Research Funds for the Central Universities No. WK0110000034, and Jiangsu Provincial Natural Science Foundation grant No. SBK201240198. We owe many thanks to the anonymous reviewers and editors for their feedback and suggestions. REFERENCES [1]. Y. Kim, et al., “Improving performance of nested loops on reconfigurable array processors”. ACM Trans. Archit. Code Optim., 2012. 8(4): p. 1-23. [2]. A. Morvan, S. Derrien, and P. Quinton. “Efficient nested loop pipelining in high level synthesis using polyhedral bubble insertion”. in Field-Programmable Technology (FPT), 2011 International Conference on. 2011. [3]. C. Wang, et al., Detecting Data Hazards in Multi-Processor System-on-Chips on FPGA. in Parallel and Distributed
926