Performance of Remote FPGA-based Coprocessors for Image-Processing Applications* Domingo Benitez University of Las Palmas G.C. Edificio de Informatica, 35017 Las Palmas, Spain.
[email protected] Abstract
*
This paper describes a performance evaluation of ImageProcessing applications on FPGA-based coprocessors that are part of general-purpose computers. Our experiments show that the maximum speed-up depends on the amount of data processed by the coprocessor. Taking images with 256 x 256 pixels, a moderate FPGA capacity of 10E+5 CLBs provides two orders of magnitude of performance improvement over a Pentium III processor for most of our benchmarks. However, memory organization and host bus degrade these results. Those benchmarks that can exhibit high performance improvement would require about 200 memory banks of 256 bytes and a host bandwidth as high as 30 GB/s. Based on our quantitative approach, it can be explained why some currently available FPGA-based coprocessors do not provide the achievable level of performance for some ImageProcessing applications.
compare the performance of a successful Pentium III processor with various FPGA-based coprocessors. Further motivating our study is the large role memory hierarchy plays in limiting performance. From our experiments, we can see how dependent is the performance of reconfigurable coprocessors on an efficient memory organization and the associated bandwidth of the data transmissions. We observe that Image-Processing applications demand not only sufficient CLB count and memory size, but also a large number of memory banks and a high-bandwidth host bus. If these architectural elements were not well dimensioned, they would limit the achievable performance. Since the variation of bus bandwidths encountered in contemporary computer systems is substantial, we suggest that reconfigurable architectures are more efficient when placed as close to the processor as possible without being part of its data-path.
2: Related Work 1: Introduction Image Processing refers to an important class of dataparallel and computation-intensive applications that is becoming one of the dominants computing workloads ([5], [6]). Image-Processing applications operate on data to be presented visually or to extract image features for other computer vision applications. Some FPGAs support many configurable data-paths that can provide higher performance than current processors for Image-Processing [9]. In spite of the large amount of attention given to multimedia workloads by investigators in the domain of configurable computing, there is little quantitative understanding of the performance of such applications on FPGA-based coprocessors. A major challenge for such studies is the enormous cost for developing applications using standardized hardware description languages (e.g. VHDL, Verilog, etc.) [12]. The goal of this work is to explore the behavior of FPGA-based coprocessors that are part of general-purpose computer systems. An important motivation is the architectural characterization of reconfigurable systems that interact with current high-performance processors. We *
This work was supported by the Ministry of Education and Science of Spain under contract TIC98-0322-C03-02, Xilinx and Celoxica
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
Using the criteria of interface we can divide the reconfigurable architectures into local and remote systems. Local reconfigurable systems integrate a highperformance processor with a reconfigurable functional unit into the same chip (Garp [3], Remarc [12], Chimaera [20], PipeRench I-COP [4], MOLEN [18], etc.). Remote systems combine a matrix of reconfigurable processing elements with fine or coarse grain, some blocks of RAM memory, a system interface, and optionally one or several general-purpose processors (MorphoSys [16], Xilinx Virtex FPGAs [19], PipeRench [7], etc.). Most of these reconfigurable systems have demonstrated their innovative characteristics on multimedia applications ([20], [12], [18], [16], etc,). Some studies show that local reconfigurable systems exhibit a maximum factor of 2X performance improvement over superscalar processors ([20]). On the other hand, remote systems can provide higher performance improvements for multimedia applications. However, the performance characterization is usually limited to a mention of the benefits. There is little quantitative understanding of the causes of performance benefits, bottlenecks, or impact of alternative architectures. A quantitative analysis is more difficult to appear as the justification of a solution when a reconfigurable system is involved.
Some researchers have claimed that there have been very few attempts to quantify the trade-offs in reconfigurable systems [17]. So, a quantitative analysis of systems that perform the same tasks using a conventional computing model relative to another where a reconfigurable coprocessor has been added is available in this paper. Regarding related work, quantitative studies of multimedia applications on general-purpose processors have been given a growing attention. Ranganathan et al. studied the performance of Image-Processing applications on general-purpose processors, with and without media ISA extension [14]. They reported that using SIMD-style media instruction set architecture extensions and software prefetching, all their Image-Processing benchmarks are compute-bound. This effect limits the performance improvement, which can be enhanced by using remote reconfigurable coprocessors. Slingerland and Smith studied the caching behavior of multimedia applications [15]. They observed that multimedia applications exhibit low instruction miss ratios and comparable data miss ratios when contrasted with other workloads. The reuse exhibited by Image-Processing applications (more than 98%) is highly exploited by cache configuration in current processors. As will be described below, the computation model that follows the FPGAbased systems can provide higher performance improvement if the memory organization can exploit more data locality than caches.
3: Experimental Methodology
E D G E F D C T M E
Edge detector performed by subtracting the pixel values for adjacent horizontal and vertical pixels, taking the absolute values and thresholding the result [13]. Data set: images of 256x256 16-bit pixels. Integer implementation of the Forward Discrete Cosine Transform (FDCT) [5] adopted by JPEG, MPEG and MediaBench [11]. Data set: images of 256x256 pixels, 1024 blocks of 8x8 16-bit data. Motion Estimation stage of MPEG-2 implemented by the Full Search Block Matching method [10]. Data set: 8-bit pixels, 1 search block of 24x24 pixels, and matching blocks of 16x16 pixels. A Automatic Target Recognition application that T automatically detects, classify, recognize and identiR fies chromatic patterns [1]. Data set: color images of 256x256 24-bit pixels. Table 1. Image-Processing benchmarks
3.2: Remote FPGA Coprocessor Architecture One of our goals is to study the reconfigurable system model shown in Figure 1, with an architecture that is composed of a current general-purpose processor (similar to the Intel Pentium III) and a reconfigurable coprocessor based on FPGAs. The reconfigurable coprocessor has a remote interface since the general-purpose processor and the reconfigurable hardware are connected through the “System Bus”.
M
3.1: Workloads For our study, we employ the real benchmarks and data sets that are described in Table 1. These applications were chosen from the media-processing domain where configurable computing may provide greater speedup than other architectural solutions. They are representative kernels for Image-Processing applications. Note that we do not include applications as JPEG or MPEG encoders/decoders, which are used in studies of general-purpose processors ([14], [15], [20]). Our benchmarks exhibit workload characteristics that are different from those applications that really consist of kernels. We found that each Image-Processing kernel does not exploit the system architecture in the same way. This fact is not very noticeable in the above mentioned studies of processors. On the other hand, remote reconfigurable coprocessors can better exploit this non-uniform behavior in order to provide higher performance than general-purpose architectures with reconfigurable functional units. Our conjecture, shared by other authors, is that exploiting parallelism at kernel level, reconfigurable coprocessors can speed up application programs [12].
M
M
M
M
M
Local bus
FPGA-based Coprocessor
Memory
IC
GeneralPurpose Processor
F P G A
M
IC
Host bus
Cache
Figure 1. FPGA-based System
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
The instruction-set processor and the reconfigurable data-path may support data processing in parallel or concurrently. Customizing the hardware configuration of the coprocessor, higher performance can be provided. An important component is the Local Memory (M), which is analogous to a data cache. This memory is made up from several banks and all of them could be accessed in parallel by the reconfigurable data-path. The general-purpose microprocessor can access to the memory banks through the “Local Bus”. As we will see, a dedicated local memory is key for reconfigurable architectures to achieve high performance. However, this architectural element is not found in some reconfigurable systems. Another component is the memory interface that is implemented through a system bus controller called “Interface Controller (IC)”. We suppose that its integrated DMA controller can handles data transfers with variable bandwidths. If all of these architectural features are combined with efficient programming abstractions and compiler techniques, then CPU performance can be exceeded on a range of problems. Many computational-intensive kernels and applications can be mapped onto this architectural model. For this quantitative study, our experiments use the Image-Processing benchmarks described earlier. We don’t consider any implementation technology for the design of the FPGA-based coprocessor. All of its components may be put on a chip, a board, or integrated with the processor on the same chip. Our architectural study tries to understand the impact of several factors of the system design on the coprocessor performance: number of configurable blocks, reconfiguration time, organization of the local memory, and bandwidth of the host interface.
3.3: Analysis Tools Our analysis of reconfigurable coprocessors uses a combined environment composed of simulators and real hardware. The experimental platform is composed of the following elements: A C compiler for FPGAs called Handel-C (2.1). This
compiler generates a netlist that is an intermediate description of the functionality of the reconfigurable hardware [8]. We also use its simulator which allows verify the functionality of programs without using real hardware, and obtain the number of cycles that the FPGA spends in processing a piece of code. The Xilinx Foundation design suite (3.1). It was required for generating FPGA bitstreams from the netlist. These tools allow estimate the maximum clock speed of the synchronous hardware generated from the netlist. The Intel VTune™ Performance Analyzer 4.5 was used to simulate the code generated by the compiler MS VC++ 6.0 for the Image-Processing kernels. We profiled the applications to identify key procedures that spend
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
more than 99% of cycle count on a Pentium III. Then, we obtained the cycle count for every type of instruction. Our study compares these measurements with those obtained from the simulations of an FPGA-based coprocessor that executes the same benchmarks. A PCI coprocessor board called RC1000-PP from Celoxica with a one million-gate FPGA [2].
3.4: Coprocessor Usage Methodology We manually coded the four benchmarks using the programming language Handel-C. Each benchmark was coded in five different ways corresponding to five hardware microarchitectures. Then, the compiled code was simulated using the Handel-C simulator in order to collect the cycle counts provided by the FPGA. Every Handel-C program was tested to verify the same output results exhibited by the VC++ versions. Finally, we automatically synthesized the netlist files generated by the Handel-C compiler with the Xilinx development suite. So, the maximum operating frequency for each microarchitecture was estimated. Some of the implementations were also tested on the real FPGA-based board RC1000-PP. All of them could not be tested on this board because the supported architectural resources are limited. Nevertheless, at least one hardware design for each benchmark was implemented.
3.5: Performance Metrics We use the speed-up of the reconfigurable coprocessor over the Pentium III as the primary metric to evaluate its performance benefits. The speed-up is evaluated by dividing the cycle count provided by both architectures and scaling the clock speeds. We assume that a representative Pentium III fabricated with 0,18 µm technology operates at 1 GHz. Additionally, we assume that the maximum clock rate for the FPGA is obtained from the place and route tools taking a Xilinx Virtex(-4) device fabricated with 0,22 µm technology. The results shown in this paper would be even more optimistic for the reconfigurable coprocessor if the transistor density were the same.
3.6: Performance of Conventional ILP Features Table 2 lists the instruction and cycle counts that characterize the workload simulation for the Pentium III. These percentages are divided into performance categories: branch instructions, Add/Sub instructions that operate on registers or immediate values (AddRR), Add/Sub instructions that operate on operands stored in memory (AddRM), logic instructions, shift instructions, load instructions, and store instructions. The variation of average CPIs obtained for the processor means that the conventional ILP features exploit different levels of parallelism. On average, 61,2% of the instructions require memory
access, and the associated cycle count is 68,4%. These results differ from those reported in others studies where it is shown an average 22% of memory instructions for multimedia workload [15]. Name FDCT ME EDGE ATR
Instruction references 2.842.757 2.417.292 2.513.359 3.949.593
Number Average of cycles CPI 2.188.923 0,77 1.522.894 0,63 2.136.355 0,85 3.989.089 1,01
Name
A1 A2
% Instructions Branch AddRR AddRM Logic FDCT
0,68
EDGE
8,14
15,56
4,68
Shift Loads Stores 4,03
44,27
22,37
11,42
19,61
10,43
4,91
10,43
36,47
6,51
ME
5,79
16,64
21,87
2,71
21,69
24,90
5,61
ATR Avera.
8,30 6,60
11,63 13,53
3,33 11,62 11,64 6,66
11,62 11,60
39,82 37,10
13,27 12,45
% Clock cycles Branch AddRR AddRM Logic
Shift Loads Stores
FDCT
0,94
8,96
21,17
4,49
4,50
33,83
EDGE
17,91
9,95
12,27
3,13
3,31
44,63
8,33
ME
13,24
10,90
34,05
4,30
9,73
23,23
4,37
ATR
7,41
7,17
4,49 14,05
5,61
46,71
14,52
Avera.
9,15
8,75
5,50
39,76
14,18
14,47
8,04
combinations of hardware techniques used for each microarchitecture are shown in Table 3. Note that each of them can be multicyle or pipelined, and this architectural feature is used for one or several identical replications of the processing hardware called “hw path”. Each hardware path exploits the data level parallelism inherent to the respective benchmark.
A3 A4
A5
EDGE FDCT Multicycle, Multicycle, 1 hw path 1 hw path Pipelined, Pipelined, 1 hw path 1 hw path Pipelined, Pipelined, 2 hw paths 2 hw paths Pipelined, Pipelined, 8 hw paths 8 hw paths
ME One cycle, 1 hw path One cycle, 2 hw paths One cycle, 8 hw paths One cycle, 32 hw paths
ATR Multicycle, 1 hw path Pipelined, 1 hw path Pipelined, 2 hw paths Pipelined, 8 hw paths
Pipelined, Pipelined, 32 hw 32 hw paths paths
One cycle, 256 hw paths
Pipelined, 32 hw paths
Table 3. Characteristics of the synthesized microarchitectures
26,11
Table 2. Workload simulation characteristics for Pentium III On the other hand, 43,4% of the instructions use the ALU functional unit, and only a 6,6% of the instructions are branches. The reduced percentage of branch instructions has associated a relative large percentage of stall cycles due to missprediction (9,15%). Overall, reducing basically both the number of instructions and the stall time originated by data dependencies and memory latencies, FPGA-based coprocessors allow the performance of the computer system to be improved.
4: Improving System Performance 4.1: Alternative Architectures for ImageProcessing on FPGA-based Coprocessors We have used several microarchitectures to develop different implementations of the benchmarks. These microarchitectures are targeted at FPGA-based coprocessors. The hardware implementations apply a combination of well-known hardware techniques that improve performance. They are muticycle, pipelining, and replication of data-paths. We report results for five variations of the coprocessor microarchitecture called “A1,...,A5”. The
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
Each microarchitecture followed a software development flow using the Handel-C tools and the Xilinx Foundation design suite. A Xilinx Virtex(-4) device was chosen as FPGA. The maximum clock frequencies obtained in the logic synthesis are presented in Table 4. We observe that the microarchitectures process images at rates that vary in the range from 40 MHz to 113 MHz. Name A1 A2 A3 A4 A5
EDGE 113 113 98 92 81
FDCT 48 55 49 42 40
ME 50 47 43 40 35
ATR 52 77 69 62 50
Table 4. Clock rates. Measurements in MHz The variation of rates is according to two factors. On one hand, the complexity of computation demanded by the benchmarks imposes the number of CLBs involved in the propagation of signals between registers, which are synchronized using the same clock signal. This complexity is the highest for FDCT and ME and the lowest for EDGE. Thus, FDCT and ME exhibit the lowest clock rates. On the other hand, the higher demand for connecting CLBs exhibited by the microarchitectures that support more identical data-paths makes the propagation delay within the FPGA larger. We observed that the hardware compiler and synthesizer might influence these results. However, all microarchitectures were compiled and synthesized with the same options.
4.2: Impact of FPGA-based Architectures Figure 2 presents speed-up measurements for the five variations of the coprocessor architecture. The speed-up is computed as the ratio of cycle count taken by the Pentium III and shown in Table 2 to the cycle count taken by the reconfigurable microarchitectures Ax, x=1,...,5. These amounts have been scaled by the respective ratio of frequencies, taking 1 GHz. for Pentium III. The frequencies for the microarchitectures are shown in Table 4. Performance Improvement over Pentium III A1
A2
A3
A4
A5
1000,0
Speed-Up
100,0 10,0 1,0 0,1 0,0
FDCT
EDGE
ME
ATR
Figure 2. Performance improvement of the FPGA over a Pentium III processor The reconfigurable coprocessor may provide significant performance improvement for all benchmarks, with maximum values that can reach factors of 180X (see Figure 2). On average, the addition of an FPGA-based coprocessor improves the performance of the Pentium III-1 GHz system by a factor of 114X. Figure 3 presents additional data. Each line represents the “I/O FPGA Bandwidth” for one of the benchmarks, i.e., the number of bytes that the microarchitectures load and store from/to local memory each second. Note that the I/O bandwidth becomes higher as the speed-up increases. Speed-Up vs. FPGA I/O Bandwidth FDCT
EDGE
ME
ATR
Speed-Up
1000 100 10 1 0 1
10
100
1000
10000
Bandwidth (MB/s)
Figure 3. Impact of the FPGA I/O Bandwidth One justification of these results is the existence of pipelined hardware paths. All the benchmarks are essentially composed of loops. The operations of each loop can be divided into blocks that have no data dependence and then pipelined. Compared to multicycle A1 implementa-
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
tions, the pipelined microarchitectures A2 can load new data and store results every clock cycle, thus demanding higher FPGA I/O bandwidth. The benchmarks FDCT, EDGE and ATR experiment performance benefits from pipelining. We found that computation can be deeply pipelined in EDGE and ATR, and moderately in FDCT. For ME, the main computation can be reduced to one operation, so pipelining has no effect on the number of data accessed each clock cycle. A second justification for these higher bandwidth requirements is due to that more parallelism may be applied since multiple hardware paths can be synthesized. They operate in parallel on independent or shared data sets. Therefore, FPGA I/O throughput is higher because the Image-Processing workload allows synthesize FPGA data-paths that are massively multithread and fully pipelined. Note that some reconfigurable microarchitectures may exhibit relatively lower performance compared to the Pentium III (A1 for FDCT, ME and ATR; A2 for ME). A low FPGA I/O bandwidth does not provide better performance than current processors. This low I/O bandwidth may be originated by FPGA microarchitectures that do not exploit enough parallelism though FPGAs make use of specialized operators and reduce the overhead of branch instructions. In these cases, the ILP parallelism of current processors provides more benefits than FPGAs. We observe a correlation between the level of performance experimented by three of the benchmarks and the respective distributions of retired instructions and cycle counts for the Pentium III. Those benchmarks with higher percentage of instruction and cycle count in the Load, Store and AddRM categories demand higher FPGA I/O bandwidth (see Figure 3 and Table 2). Other studies on general-purpose processors have shown that the Image-Processing kernels exhibit high memory stall time unaffected by larger caches. In these cases, after applying software prefetching, the benchmarks revert to being compute bound [14]. FPGA systems can improve the performance by exploiting more parallelism than processors. Our results show that reconfigurable coprocessors with higher FPGA I/O bandwidth achieve better performance. The limit is associated with the number of iterations in each benchmark. Normally, it is the same as the number of pixels or data blocks. On average, using FPGA technology, all Image-Processing kernels revert to being I/O bound.
4.3: Hardware Cost Figure 4 shows the results obtained for the area required by the microarchitectures on a Xilinx Virtex FPGA. These amounts are given in Xilinx CLB slices. Each CLB is equivalent to approximately 90 gates. Our experiments show that for each benchmark, the increase in area has benefits associated with increased performance
due to more parallelism. However, the increase in area does not correspond to an equivalent performance improvement. An average increase of 2,5X in CLBs (range of 1,6X to 2,9X) provides an average factor of 5,1X performance improvement (range of 3X to 8,1X). Therefore, combining these results with those shown previously, the benefits from FPGAs are more efficiently achieved when higher FPGA I/O bandwidth is supported. Hardware Cost A1
1E+6
A2
A3
A4
A5
CLBs
1E+5 1E+4 1E+3 1E+2 1E+1 1E+0
FDCT
EDGE
ME
ATR
Figure 4. Number of CLBs that are required to synthesize the microarchitectures on a FPGA
4.4: Reconfiguration Time An FPGA may need to be reconfigured for executing different applications. This is one of the disadvantages when using FPGA. However, FPGA coprocessors can take advantage of the fact that Image-Processing applications repetitively run the same algorithm on multiple images. If the configuration for the FPGA is shared by the execution of the application on multiple images, the negative impact of the reconfiguration time is reduced. Supposing that each benchmark makes data processing on successive frames at a rate of 25 images/second (25 Hz), Figure 5 shows the speed-up of the reconfigurable coprocessor for the FPGA microarchitectures that ideally achieve higher performance than Pentium III. These measurements are given as percentages of the maximum performance improvements (see Figure 2). Our analysis supposes that the reconfiguration process is static and is made before Image-Processing starts. Additionally, we suppose a loading time of 11,6 µs per CLB. This is a realistic value taken from Virtex FPGAs. Examining the variation of continuous processing time needed to get a real speed-up of 1 across the microarchitectures and benchmarks, we observe that it is correlated to the maximum speed-up in the same way as the CLB count does.
5: Analysis of Memory System Performance 5.1: Impact of Memory Banks From an architectural point of view, the motivation for memory banks is higher memory bandwidth. Each FPGA microarchitecture Ax requires a different number of local
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
memory modules. For each benchmark, Figure 6 presents the bank count for the FPGA microarchitectures that provide performance improvement. Except in the case of ME, the total size of local memory was kept fixed (EDGE, FDCT: 128 KB, ATR: 192 KB). In these cases, the total memory size depends on the image size. Larger images would require larger memory. However, if the bank count is kept fixed, the memory size has no impact on the performance. Similar to our results, [14] also found that the size of data cache needed to exploit the reuse in superscalar processors depends on the image size. Examining the performance improvement achieved by FPGA coprocessors, we observe that it is mainly due to data level parallelism. The reduction of processing time is due to the increased number of parallel data-paths that can operate on reduced data sets, which are stored in local memory banks. The ME application achieves performance improvement by replicating the reference block. Each matching block is stored in an independent memory bank and assigned to a data-path in the FPGA. So, the size of local memory for ME ranges from 1,25 KB to 65 KB. Then, for this application, performance improvement is achieved by increasing both the total memory size and the bank count. Nevertheless, ME exhibits less benefits than the other benchmarks when the bank count is fixed. Additionally, Figure 6 shows that all ImageProcessing kernels demand a different bank count for a fixed level of performance. We observe that this corresponds to the variations of CPIs for memory instructions in Pentium III (see Table 2). ME exhibits the lowest CPI for memory instructions and demand the highest bank count. For ATR, we found the opposite. Therefore, the addition of memory banks improves the performance by reducing the memory stall time exhibited by generalpurpose processors. This improvement is linear with the bank count as shown in Figure 6.
5.2: Impact of Host Bus Bandwidth Real coprocessors require an initial stage in which data are transmitted from the host memory to the coprocessor memory. So, the limited bandwidth of the host bus may cause the performance improvement degrades. In Figure 7, we present the results for the impact of the host bus bandwidth on the speed-up of FPGA coprocessors relative to the base system with Pentium III. Now, the computation of speed-up adds the latency of data transmission to the cycle count for the FPGA, and the reconfiguration time is not considered. Nevertheless, the reconfiguration time would degrade the performance improvement as described earlier. A minimum bus bandwidth of 400 MB/s is needed to get performance improvement. With 1 GB/s, all applications see a speed-up reduction that is higher than 50%. A 10 GB/s system bus allows the speed-up to reach the 70% of the maximum speed-up.
EDGE 256x256 - 25 Hz uA1
uA2
uA3
uA4
ME 24x24 - 25 Hz uA3 uA4 uA5
uA5 100%
80%
% max Speed-Up
% max Speed-Up
100%
60% 40% 20% 0% 0
0,5
1
80% 60% 40% 20% 0%
1,5
0
seconds
20
40
uA3
uA4
uA2
uA5
100%
100
uA3
uA4
uA5
100%
% max Speed-Up
% max Speed-Up
80
ATR 256x256 - 25 Hz
FDCT 256x256 - 25 Hz uA2
60 seconds
80% 60% 40% 20% 0%
80% 60% 40% 20% 0%
0
5
10
15
20
0
1
seconds
2
3
4
seconds
Figure 5. Impact of the reconfiguration time on performance of the reconfigurable coprocessor EDGE
FDCT
ME
ATR
1000
6: Summary
100
Bank Count
ages, we found that a speed-up equal to 1 requires a host bandwidth of 6,2 GB/s.
6.1: FPGA-based Coprocessor Design 10
1 1
10
100
1000
Speed-Up Figure 6. Impact of the memory bank count We observe that EDGE and FDCT require a minimum bandwidth to exhibit performance improvement that is higher than those required by the remaining benchmarks. EDGE and FDCT operate on images and the results are images. Thus, the higher impact of host bandwidth is due to the transmission of complete images in both directions, from main memory to local memory and viceversa. On the other hand, ME and ATR operate on images but the results correspond to small data structures. Then, these applications spend less time in data transmission. Therefore, those microarchitectures that ideally provide higher performance improvement require higher system bus bandwidth to get the same real performance improvement. For larger images, the bandwidth requires to be incremented proportionally in order to exhibit the same performance improvement. For example, taking the worst case (A2 for FDCT) and 1024x1024 2-byte im-
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
We have provided a quantitative analysis of three important parameters for FPGA coprocessors in order to support Image-Processing: Hardware capacity. A moderate FPGA capacity of 10E+5 CLBs was found to provide two orders of magnitude of performance improvement over a Pentium III for most of our benchmarks. Memory bank count. We found performance improvement when the bank count for the coprocessor memory is increased. About 200 memory banks of 256 bytes can provide the maximum speed-up. Bus bandwidth. When the data transmission between host memory and local memory is considered, we found that maximum speed-up depends on the size of images. Taking images with 256 x 256 pixels and benchmarks that can exhibit a performance improvement of two orders of magnitude, the available bandwidth must be as high as 30 GB/s.
6.2: Conclusions We can explain why some currently available FPGA coprocessors do not provide the achievable two orders of magnitude performance improvement for some ImageProcessing applications. This conclusion was verified
uA1
EDGE 256x256 uA2 uA3
uA4
uA5
10,0
1,0 0
200
400
600
800
1000
uA5
10,0
1,0 0
0,1
200
uA3
uA4
uA2
uA5
100,0
10,0
10,0
Speed-Up
100,0
1,0 0
200
400
600
800
600
800
1000
System Bus Bandwidth (MB/s)
FDCT 256x256 uA2
400
0,1
System Bus Bandwidth (MB/s)
Speed-Up
ME 24x24 uA4
100,0
Speed-Up
Speed-Up
100,0
uA3
ATR 256x256 uA3 uA4
uA5
1,0 0
1000
200
400
600
800
1000
0,1
0,1
System Bus Bandwidth (MB/s)
System Bus Bandwidth (MB/s)
Figure 7. Performances of the reconfigurable coprocessor taking account the system bus bandwidth using the PCI bus plug-in board called RC1000-PP. It has one Xilinx Virtex XCV1000 FPGA with four banks of local memory [2]. The FPGA device can provide a hardware complexity equivalent to 12288 CLBs - one million of gates. The maximum bandwidth for the PCI system bus is 132 MB/s. So, the maximum achievable speed-up is 7,8X for ME, and 2X for ATR (see Figure 7). For the remaining benchmarks, the limited bandwidth does not allow achieve any performance improvement. The four banks of memory and the one million of gates would achieve a factor of 2,8X (FDCT) to 9,4X (ATR) performance improvement. ME requires a minimum of eight memory banks to see a factor of 2,3X performance improvement. The four memory banks supported by RC1000-PP are not sufficient to get a minimum speed-up. Therefore, an unappropriated data bandwidth both for the local reconfigurable data-paths and the system bus can degrade enormously the performance improvement achievable by current FPGAs.
Acknowledgement The author thanks Daniel Herrera and the referees for helpful feedback in preparing this paper.
References 1. 2. 3. 4.
D. Benitez; Modular Architecture for Custom-Built Systems Oriented to Real-Time Computer Vision: Application to Color Recognition; J. System Architecture, 42(8):709-723, 1997 celoxica.com T.J.Callahan, J.R.Kouser, J.Wawrzynek; The Garp Architecture and C Compiler; IEEE Computer, 33(4):62-69 2000 Y. Chou, P. Pillai, H. Schmit, J.P. Shen; PipeRench Implementation of the Instruction Path Coprocessor; Int. Symp. on Microarchitecture, pp.147-158, 2000
Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE
5.
6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16.
17. 18. 19. 20.
T. M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M.S. Schlansker, P. Song, A. Wolfe; Challenges to combining general-purpose and multimedia processors; IEEE Computer, 30(12):33-37 (1997) K. Diefendorff, P.K. Dubey; How Multimedia Workloads Will Change Processor Design; IEEE Computer, 30(9):43-45 (1997) S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R.R. Taylor; PipeRench: A Reconfigurable Architecture and Compiler; IEEE Computer, 33(4):70-77 (2000) Celoxica; Handel-C Language Reference Manual, 1998 R. Hartenstein; Reconfigurable Computing: a New Business Model - and its Impact on SoC Design (invited keynote); DSD'2001 - Warzaw, Poland, September, 2001 T.Komarek and P.Pirsch, Array Architectures for Block Matching Algorithms; IEEE T. Circuits & Systems, 36:1301-1308,1989 C. Lee, M. Potkonjak, W. Mangione-Smith; MediaBench:A Tool for Evaluating and Synthesizing Multimedia and Communications Systems; Int. Symp. on Microarchitecture, pp.330-335, 97 T. Miyamori, K. Olukotun; A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications; IEEE Symp. on FCCM, 1998 W. Pratt; Digital Image Processing, 2nd ed; John Wiley, 1991 P. Ranganathan, S.V. Adve, N.P. Jouppi; Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions; ISCA-99, ACM Computer Architecture News, 27(2):124-135, 1999 N..Slingerland, A..Smith; Cache Performance for Multimedia Applications; 15th Intl. Conf. Supercomputing, 204-217, 2001 H. Singh, M. Lee, G. Lu, F.J. Kurdahi, N. Bagherzadeh, E.M. Chaves Filho; MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications; IEEE Trans. Computers, 49(5):465-481, 2000 J.Villasenor, B.Hutchings; The Flexibility of Configurable Computing; IEEE Signal Proc. Magazine, 15(5): 67-84, 1998 S. Vassiliadis, S.Wong, S. Cotofana; The MOLEN pu-coded processor; Proc. 11th Int. Conf. on FPL, pp.275-285, 2001 Xilinx; Virtex-EM FIR Filter for Video Applications; Xilinx Application Note XAPP241, 2000 (xilinx.com) Z.A. Ye, A. Moshovos, S. Hauck, P. Banerjee; CHIMAERA: A high-performance architecture with tightly-coupled reconfigurable functional unit; ISCA-27, pp.225-235, 2000