Performance of remote FPGA-based coprocessors for ... - CiteSeerX

Performance of Remote FPGA-based Coprocessors for Image-Processing Applications* Domingo Benitez University of Las Palmas G.C. Edificio de Informatica, 35017 Las Palmas, Spain. [email protected] Abstract

*

This paper describes a performance evaluation of ImageProcessing applications on FPGA-based coprocessors that are part of general-purpose computers. Our experiments show that the maximum speed-up depends on the amount of data processed by the coprocessor. Taking images with 256 x 256 pixels, a moderate FPGA capacity of 10E+5 CLBs provides two orders of magnitude of performance improvement over a Pentium III processor for most of our benchmarks. However, memory organization and host bus degrade these results. Those benchmarks that can exhibit high performance improvement would require about 200 memory banks of 256 bytes and a host bandwidth as high as 30 GB/s. Based on our quantitative approach, it can be explained why some currently available FPGA-based coprocessors do not provide the achievable level of performance for some ImageProcessing applications.

compare the performance of a successful Pentium III processor with various FPGA-based coprocessors. Further motivating our study is the large role memory hierarchy plays in limiting performance. From our experiments, we can see how dependent is the performance of reconfigurable coprocessors on an efficient memory organization and the associated bandwidth of the data transmissions. We observe that Image-Processing applications demand not only sufficient CLB count and memory size, but also a large number of memory banks and a high-bandwidth host bus. If these architectural elements were not well dimensioned, they would limit the achievable performance. Since the variation of bus bandwidths encountered in contemporary computer systems is substantial, we suggest that reconfigurable architectures are more efficient when placed as close to the processor as possible without being part of its data-path.

2: Related Work 1: Introduction Image Processing refers to an important class of dataparallel and computation-intensive applications that is becoming one of the dominants computing workloads ([5], [6]). Image-Processing applications operate on data to be presented visually or to extract image features for other computer vision applications. Some FPGAs support many configurable data-paths that can provide higher performance than current processors for Image-Processing [9]. In spite of the large amount of attention given to multimedia workloads by investigators in the domain of configurable computing, there is little quantitative understanding of the performance of such applications on FPGA-based coprocessors. A major challenge for such studies is the enormous cost for developing applications using standardized hardware description languages (e.g. VHDL, Verilog, etc.) [12]. The goal of this work is to explore the behavior of FPGA-based coprocessors that are part of general-purpose computer systems. An important motivation is the architectural characterization of reconfigurable systems that interact with current high-performance processors. We *

This work was supported by the Ministry of Education and Science of Spain under contract TIC98-0322-C03-02, Xilinx and Celoxica

Proceedings of the Euromicro Symposium on Digital System Design (DSD’02) 0-7695-1790-0/02 $17.00 © 2002 IEEE

Using the criteria of interface we can divide the reconfigurable architectures into local and remote systems. Local reconfigurable systems integrate a highperformance processor with a reconfigurable functional unit into the same chip (Garp [3], Remarc [12], Chimaera [20], PipeRench I-COP [4], MOLEN [18], etc.). Remote systems combine a matrix of reconfigurable processing elements with fine or coarse grain, some blocks of RAM memory, a system interface, and optionally one or several general-purpose processors (MorphoSys [16], Xilinx Virtex FPGAs [19], PipeRench [7], etc.). Most of these reconfigurable systems have demonstrated their innovative characteristics on multimedia applications ([20], [12], [18], [16], etc,). Some studies show that local reconfigurable systems exhibit a maximum factor of 2X performance improvement over superscalar processors ([20]). On the other hand, remote systems can provide higher performance improvements for multimedia applications. However, the performance characterization is usually limited to a mention of the benefits. There is little quantitative understanding of the causes of performance benefits, bottlenecks, or impact of alternative architectures. A quantitative analysis is more difficult to appear as the justification of a solution when a reconfigurable system is involved.

Some researchers have claimed that there have been very few attempts to quantify the trade-offs in reconfigurable systems [17]. So, a quantitative analysis of systems that perform the same tasks using a conventional computing model relative to another where a reconfigurable coprocessor has been added is available in this paper. Regarding related work, quantitative studies of multimedia applications on general-purpose processors have been given a growing attention. Ranganathan et al. studied the performance of Image-Processing applications on general-purpose processors, with and without media ISA extension [14]. They reported that using SIMD-style media instruction set architecture extensions and software prefetching, all their Image-Processing benchmarks are compute-bound. This effect limits the performance improvement, which can be enhanced by using remote reconfigurable coprocessors. Slingerland and Smith studied the caching behavior of multimedia applications [15]. They observed that multimedia applications exhibit low instruction miss ratios and comparable data miss ratios when contrasted with other workloads. The reuse exhibited by Image-Processing applications (more than 98%) is highly exploited by cache configuration in current processors. As will be described below, the computation model that follows the FPGAbased systems can provide higher performance improvement if the memory organization can exploit more data locality than caches.

3: Experimental Methodology

E D G E F D C T M E

Edge detector performed by subtracting the pixel values for adjacent horizontal and vertical pixels, taking the absolute values and thresholding the result [13]. Data set: images of 256x256 16-bit pixels. Integer implementation of the Forward Discrete Cosine Transform (FDCT) [5] adopted by JPEG, MPEG and MediaBench [11]. Data set: images of 256x256 pixels, 1024 blocks of 8x8 16-bit data. Motion Estimation stage of MPEG-2 implemented by the Full Search Block Matching method [10]. Data set: 8-bit pixels, 1 search block of 24x24 pixels, and matching blocks of 16x16 pixels. A Automatic Target Recognition application that T automatically detects, classify, recognize and identiR fies chromatic patterns [1]. Data set: color images of 256x256 24-bit pixels. Table 1. Image-Processing benchmarks

3.2: Remote FPGA Coprocessor Architecture One of our goals is to study the reconfigurable system model shown in Figure 1, with an architecture that is composed of a current general-purpose processor (similar to the Intel Pentium III) and a reconfigurable coprocessor based on FPGAs. The reconfigurable coprocessor has a remote interface since the general-purpose processor and the reconfigurable hardware are connected through the “System Bus”.

M

3.1: Workloads For our study, we employ the real benchmarks and data sets that are described in Table 1. These applications were chosen from the media-processing domain where configurable computing may provide greater speedup than other architectural solutions. They are representative kernels for Image-Processing applications. Note that we do not include applications as JPEG or MPEG encoders/decoders, which are used in studies of general-purpose processors ([14], [15], [20]). Our benchmarks exhibit workload characteristics that are different from those applications that really consist of kernels. We found that each Image-Processing kernel does not exploit the system architecture in the same way. This fact is not very noticeable in the above mentioned studies of processors. On the other hand, remote reconfigurable coprocessors can better exploit this non-uniform behavior in order to provide higher performance than general-purpose architectures with reconfigurable functional units. Our conjecture, shared by other authors, is that exploiting parallelism at kernel level, reconfigurable coprocessors can speed up application programs [12].

M

M

M

M

M

Local bus

FPGA-based Coprocessor

Memory

IC

GeneralPurpose Processor

F P G A

M

IC

Host bus

Cache

Figure 1. FPGA-based System


The instruction-set processor and the reconfigurable data-path may support data processing in parallel or concurrently. Customizing the hardware configuration of the coprocessor, higher performance can be provided. An important component is the Local Memory (M), which is analogous to a data cache. This memory is made up from several banks and all of them could be accessed in parallel by the reconfigurable data-path. The general-purpose microprocessor can access to the memory banks through the “Local Bus”. As we will see, a dedicated local memory is key for reconfigurable architectures to achieve high performance. However, this architectural element is not found in some reconfigurable systems. Another component is the memory interface that is implemented through a system bus controller called “Interface Controller (IC)”. We suppose that its integrated DMA controller can handles data transfers with variable bandwidths. If all of these architectural features are combined with efficient programming abstractions and compiler techniques, then CPU performance can be exceeded on a range of problems. Many computational-intensive kernels and applications can be mapped onto this architectural model. For this quantitative study, our experiments use the Image-Processing benchmarks described earlier. We don’t consider any implementation technology for the design of the FPGA-based coprocessor. All of its components may be put on a chip, a board, or integrated with the processor on the same chip. Our architectural study tries to understand the impact of several factors of the system design on the coprocessor performance: number of configurable blocks, reconfiguration time, organization of the local memory, and bandwidth of the host interface.

3.3: Analysis Tools Our analysis of reconfigurable coprocessors uses a combined environment composed of simulators and real hardware. The experimental platform is composed of the following elements: A C compiler for FPGAs called Handel-C (2.1). This

compiler generates a netlist that is an intermediate description of the functionality of the reconfigurable hardware [8]. We also use its simulator which allows verify the functionality of programs without using real hardware, and obtain the number of cycles that the FPGA spends in processing a piece of code. The Xilinx Foundation design suite (3.1). It was required for generating FPGA bitstreams from the netlist. These tools allow estimate the maximum clock speed of the synchronous hardware generated from the netlist. The Intel VTune™ Performance Analyzer 4.5 was used to simulate the code generated by the compiler MS VC++ 6.0 for the Image-Processing kernels. We profiled the applications to identify key procedures that spend


more than 99% of cycle count on a Pentium III. Then, we obtained the cycle count for every type of instruction. Our study compares these measurements with those obtained from the simulations of an FPGA-based coprocessor that executes the same benchmarks. A PCI coprocessor board called RC1000-PP from Celoxica with a one million-gate FPGA [2].

3.4: Coprocessor Usage Methodology We manually coded the four benchmarks using the programming language Handel-C. Each benchmark was coded in five different ways corresponding to five hardware microarchitectures. Then, the compiled code was simulated using the Handel-C simulator in order to collect the cycle counts provided by the FPGA. Every Handel-C program was tested to verify the same output results exhibited by the VC++ versions. Finally, we automatically synthesized the netlist files generated by the Handel-C compiler with the Xilinx development suite. So, the maximum operating frequency for each microarchitecture was estimated. Some of the implementations were also tested on the real FPGA-based board RC1000-PP. All of them could not be tested on this board because the supported architectural resources are limited. Nevertheless, at least one hardware design for each benchmark was implemented.

3.5: Performance Metrics We use the speed-up of the reconfigurable coprocessor over the Pentium III as the primary metric to evaluate its performance benefits. The speed-up is evaluated by dividing the cycle count provided by both architectures and scaling the clock speeds. We assume that a representative Pentium III fabricated with 0,18 µm technology operates at 1 GHz. Additionally, we assume that the maximum clock rate for the FPGA is obtained from the place and route tools taking a Xilinx Virtex(-4) device fabricated with 0,22 µm technology. The results shown in this paper would be even more optimistic for the reconfigurable coprocessor if the transistor density were the same.

3.6: Performance of Conventional ILP Features Table 2 lists the instruction and cycle counts that characterize the workload simulation for the Pentium III. These percentages are divided into performance categories: branch instructions, Add/Sub instructions that operate on registers or immediate values (AddRR), Add/Sub instructions that operate on operands stored in memory (AddRM), logic instructions, shift instructions, load instructions, and store instructions. The variation of average CPIs obtained for the processor means that the conventional ILP features exploit different levels of parallelism. On average, 61,2% of the instructions require memory

access, and the associated cycle count is 68,4%. These results differ from those reported in others studies where it is shown an average 22% of memory instructions for multimedia workload [15]. Name FDCT ME EDGE ATR

Instruction references 2.842.757 2.417.292 2.513.359 3.949.593

Number Average of cycles CPI 2.188.923 0,77 1.522.894 0,63 2.136.355 0,85 3.989.089 1,01

Name

A1 A2

% Instructions Branch AddRR AddRM Logic FDCT

0,68

EDGE

8,14

15,56

4,68

Shift Loads Stores 4,03

44,27

22,37

11,42

19,61

10,43

4,91

10,43

36,47

6,51

ME

5,79

16,64

21,87

2,71

21,69

24,90

5,61

ATR Avera.

8,30 6,60

11,63 13,53

3,33 11,62 11,64 6,66

11,62 11,60

39,82 37,10

13,27 12,45

% Clock cycles Branch AddRR AddRM Logic

Shift Loads Stores

FDCT

0,94

8,96

21,17

4,49

4,50

33,83

EDGE

17,91

9,95

12,27

3,13

3,31

44,63

8,33

ME

13,24

10,90

34,05

4,30

9,73

23,23

4,37

ATR

7,41

7,17

4,49 14,05

5,61

46,71

14,52

Avera.

9,15

8,75

5,50

39,76

14,18

14,47

8,04

combinations of hardware techniques used for each microarchitecture are shown in Table 3. Note that each of them can be multicyle or pipelined, and this architectural feature is used for one or several identical replications of the processing hardware called “hw path”. Each hardware path exploits the data level parallelism inherent to the respective benchmark.

A3 A4

A5

EDGE FDCT Multicycle, Multicycle, 1 hw path 1 hw path Pipelined, Pipelined, 1 hw path 1 hw path Pipelined, Pipelined, 2 hw paths 2 hw paths Pipelined, Pipelined, 8 hw paths 8 hw paths

ME One cycle, 1 hw path One cycle, 2 hw paths One cycle, 8 hw paths One cycle, 32 hw paths

ATR Multicycle, 1 hw path Pipelined, 1 hw path Pipelined, 2 hw paths Pipelined, 8 hw paths

Pipelined, Pipelined, 32 hw 32 hw paths paths

One cycle, 256 hw paths

Pipelined, 32 hw paths

Table 3. Characteristics of the synthesized microarchitectures

26,11

Table 2. Workload simulation characteristics for Pentium III On the other hand, 43,4% of the instructions use the ALU functional unit, and only a 6,6% of the instructions are branches. The reduced percentage of branch instructions has associated a relative large percentage of stall cycles due to missprediction (9,15%). Overall, reducing basically both the number of instructions and the stall time originated by data dependencies and memory latencies, FPGA-based coprocessors allow the performance of the computer system to be improved.

4: Improving System Performance 4.1: Alternative Architectures for ImageProcessing on FPGA-based Coprocessors We have used several microarchitectures to develop different implementations of the benchmarks. These microarchitectures are targeted at FPGA-based coprocessors. The hardware implementations apply a combination of well-known hardware techniques that improve performance. They are muticycle, pipelining, and replication of data-paths. We report results for five variations of the coprocessor microarchitecture called “A1,...,A5”. The


Each microarchitecture followed a software development flow using the Handel-C tools and the Xilinx Foundation design suite. A Xilinx Virtex(-4) device was chosen as FPGA. The maximum clock frequencies obtained in the logic synthesis are presented in Table 4. We observe that the microarchitectures process images at rates that vary in the range from 40 MHz to 113 MHz. Name A1 A2 A3 A4 A5

EDGE 113 113 98 92 81

FDCT 48 55 49 42 40

ME 50 47 43 40 35

ATR 52 77 69 62 50

Table 4. Clock rates. Measurements in MHz The variation of rates is according to two factors. On one hand, the complexity of computation demanded by the benchmarks imposes the number of CLBs involved in the propagation of signals between registers, which are synchronized using the same clock signal. This complexity is the highest for FDCT and ME and the lowest for EDGE. Thus, FDCT and ME exhibit the lowest clock rates. On the other hand, the higher demand for connecting CLBs exhibited by the microarchitectures that support more identical data-paths makes the propagation delay within the FPGA larger. We observed that the hardware compiler and synthesizer might influence these results. However, all microarchitectures were compiled and synthesized with the same options.

4.2: Impact of FPGA-based Architectures Figure 2 presents speed-up measurements for the five variations of the coprocessor architecture. The speed-up is computed as the ratio of cycle count taken by the Pentium III and shown in Table 2 to the cycle count taken by the reconfigurable microarchitectures Ax, x=1,...,5. These amounts have been scaled by the respective ratio of frequencies, taking 1 GHz. for Pentium III. The frequencies for the microarchitectures are shown in Table 4. Performance Improvement over Pentium III A1

A2

A3

A4

A5

1000,0

Speed-Up

100,0 10,0 1,0 0,1 0,0

FDCT

EDGE

ME

ATR

Figure 2. Performance improvement of the FPGA over a Pentium III processor The reconfigurable coprocessor may provide significant performance improvement for all benchmarks, with maximum values that can reach factors of 180X (see Figure 2). On average, the addition of an FPGA-based coprocessor improves the performance of the Pentium III-1 GHz system by a factor of 114X. Figure 3 presents additional data. Each line represents the “I/O FPGA Bandwidth” for one of the benchmarks, i.e., the number of bytes that the microarchitectures load and store from/to local memory each second. Note that the I/O bandwidth becomes higher as the speed-up increases. Speed-Up vs. FPGA I/O Bandwidth FDCT

EDGE

ME

ATR

Speed-Up

1000 100 10 1 0 1

10

100

1000

10000

Bandwidth (MB/s)

Figure 3. Impact of the FPGA I/O Bandwidth One justification of these results is the existence of pipelined hardware paths. All the benchmarks are essentially composed of loops. The operations of each loop can be divided into blocks that have no data dependence and then pipelined. Compared to multicycle A1 implementa-


tions, the pipelined microarchitectures A2 can load new data and store results every clock cycle, thus demanding higher FPGA I/O bandwidth. The benchmarks FDCT, EDGE and ATR experiment performance benefits from pipelining. We found that computation can be deeply pipelined in EDGE and ATR, and moderately in FDCT. For ME, the main computation can be reduced to one operation, so pipelining has no effect on the number of data accessed each clock cycle. A second justification for these higher bandwidth requirements is due to that more parallelism may be applied since multiple hardware paths can be synthesized. They operate in parallel on independent or shared data sets. Therefore, FPGA I/O throughput is higher because the Image-Processing workload allows synthesize FPGA data-paths that are massively multithread and fully pipelined. Note that some reconfigurable microarchitectures may exhibit relatively lower performance compared to the Pentium III (A1 for FDCT, ME and ATR; A2 for ME). A low FPGA I/O bandwidth does not provide better performance than current processors. This low I/O bandwidth may be originated by FPGA microarchitectures that do not exploit enough parallelism though FPGAs make use of specialized operators and reduce the overhead of branch instructions. In these cases, the ILP parallelism of current processors provides more benefits than FPGAs. We observe a correlation between the level of performance experimented by three of the benchmarks and the respective distributions of retired instructions and cycle counts for the Pentium III. Those benchmarks with higher percentage of instruction and cycle count in the Load, Store and AddRM categories demand higher FPGA I/O bandwidth (see Figure 3 and Table 2). Other studies on general-purpose processors have shown that the Image-Processing kernels exhibit high memory stall time unaffected by larger caches. In these cases, after applying software prefetching, the benchmarks revert to being compute bound [14]. FPGA systems can improve the performance by exploiting more parallelism than processors. Our results show that reconfigurable coprocessors with higher FPGA I/O bandwidth achieve better performance. The limit is associated with the number of iterations in each benchmark. Normally, it is the same as the number of pixels or data blocks. On average, using FPGA technology, all Image-Processing kernels revert to being I/O bound.

4.3: Hardware Cost Figure 4 shows the results obtained for the area required by the microarchitectures on a Xilinx Virtex FPGA. These amounts are given in Xilinx CLB slices. Each CLB is equivalent to approximately 90 gates. Our experiments show that for each benchmark, the increase in area has benefits associated with increased performance

due to more parallelism. However, the increase in area does not correspond to an equivalent performance improvement. An average increase of 2,5X in CLBs (range of 1,6X to 2,9X) provides an average factor of 5,1X performance improvement (range of 3X to 8,1X). Therefore, combining these results with those shown previously, the benefits from FPGAs are more efficiently achieved when higher FPGA I/O bandwidth is supported. Hardware Cost A1

1E+6

A2

A3

A4

A5

CLBs

1E+5 1E+4 1E+3 1E+2 1E+1 1E+0

FDCT

EDGE

ME

ATR

Figure 4. Number of CLBs that are required to synthesize the microarchitectures on a FPGA

4.4: Reconfiguration Time An FPGA may need to be reconfigured for executing different applications. This is one of the disadvantages when using FPGA. However, FPGA coprocessors can take advantage of the fact that Image-Processing applications repetitively run the same algorithm on multiple images. If the configuration for the FPGA is shared by the execution of the application on multiple images, the negative impact of the reconfiguration time is reduced. Supposing that each benchmark makes data processing on successive frames at a rate of 25 images/second (25 Hz), Figure 5 shows the speed-up of the reconfigurable coprocessor for the FPGA microarchitectures that ideally achieve higher performance than Pentium III. These measurements are given as percentages of the maximum performance improvements (see Figure 2). Our analysis supposes that the reconfiguration process is static and is made before Image-Processing starts. Additionally, we suppose a loading time of 11,6 µs per CLB. This is a realistic value taken from Virtex FPGAs. Examining the variation of continuous processing time needed to get a real speed-up of 1 across the microarchitectures and benchmarks, we observe that it is correlated to the maximum speed-up in the same way as the CLB count does.

5: Analysis of Memory System Performance 5.1: Impact of Memory Banks From an architectural point of view, the motivation for memory banks is higher memory bandwidth. Each FPGA microarchitecture Ax requires a different number of local


memory modules. For each benchmark, Figure 6 presents the bank count for the FPGA microarchitectures that provide performance improvement. Except in the case of ME, the total size of local memory was kept fixed (EDGE, FDCT: 128 KB, ATR: 192 KB). In these cases, the total memory size depends on the image size. Larger images would require larger memory. However, if the bank count is kept fixed, the memory size has no impact on the performance. Similar to our results, [14] also found that the size of data cache needed to exploit the reuse in superscalar processors depends on the image size. Examining the performance improvement achieved by FPGA coprocessors, we observe that it is mainly due to data level parallelism. The reduction of processing time is due to the increased number of parallel data-paths that can operate on reduced data sets, which are stored in local memory banks. The ME application achieves performance improvement by replicating the reference block. Each matching block is stored in an independent memory bank and assigned to a data-path in the FPGA. So, the size of local memory for ME ranges from 1,25 KB to 65 KB. Then, for this application, performance improvement is achieved by increasing both the total memory size and the bank count. Nevertheless, ME exhibits less benefits than the other benchmarks when the bank count is fixed. Additionally, Figure 6 shows that all ImageProcessing kernels demand a different bank count for a fixed level of performance. We observe that this corresponds to the variations of CPIs for memory instructions in Pentium III (see Table 2). ME exhibits the lowest CPI for memory instructions and demand the highest bank count. For ATR, we found the opposite. Therefore, the addition of memory banks improves the performance by reducing the memory stall time exhibited by generalpurpose processors. This improvement is linear with the bank count as shown in Figure 6.

5.2: Impact of Host Bus Bandwidth Real coprocessors require an initial stage in which data are transmitted from the host memory to the coprocessor memory. So, the limited bandwidth of the host bus may cause the performance improvement degrades. In Figure 7, we present the results for the impact of the host bus bandwidth on the speed-up of FPGA coprocessors relative to the base system with Pentium III. Now, the computation of speed-up adds the latency of data transmission to the cycle count for the FPGA, and the reconfiguration time is not considered. Nevertheless, the reconfiguration time would degrade the performance improvement as described earlier. A minimum bus bandwidth of 400 MB/s is needed to get performance improvement. With 1 GB/s, all applications see a speed-up reduction that is higher than 50%. A 10 GB/s system bus allows the speed-up to reach the 70% of the maximum speed-up.

EDGE 256x256 - 25 Hz uA1

uA2

uA3

uA4

ME 24x24 - 25 Hz uA3 uA4 uA5

uA5 100%

80%

% max Speed-Up

% max Speed-Up

100%

60% 40% 20% 0% 0

0,5

1

80% 60% 40% 20% 0%

1,5

0

seconds

20

40

uA3

uA4

uA2

uA5

100%

100

uA3

uA4

uA5

100%

% max Speed-Up

% max Speed-Up

80

ATR 256x256 - 25 Hz

FDCT 256x256 - 25 Hz uA2

60 seconds

80% 60% 40% 20% 0%

80% 60% 40% 20% 0%

0

5

10

15

20

0

1

seconds

2

3

4

seconds

Figure 5. Impact of the reconfiguration time on performance of the reconfigurable coprocessor EDGE

FDCT

ME

ATR

1000

6: Summary

100

Bank Count

ages, we found that a speed-up equal to 1 requires a host bandwidth of 6,2 GB/s.

6.1: FPGA-based Coprocessor Design 10

1 1

10

100

1000

Speed-Up Figure 6. Impact of the memory bank count We observe that EDGE and FDCT require a minimum bandwidth to exhibit performance improvement that is higher than those required by the remaining benchmarks. EDGE and FDCT operate on images and the results are images. Thus, the higher impact of host bandwidth is due to the transmission of complete images in both directions, from main memory to local memory and viceversa. On the other hand, ME and ATR operate on images but the results correspond to small data structures. Then, these applications spend less time in data transmission. Therefore, those microarchitectures that ideally provide higher performance improvement require higher system bus bandwidth to get the same real performance improvement. For larger images, the bandwidth requires to be incremented proportionally in order to exhibit the same performance improvement. For example, taking the worst case (A2 for FDCT) and 1024x1024 2-byte im-


We have provided a quantitative analysis of three important parameters for FPGA coprocessors in order to support Image-Processing: Hardware capacity. A moderate FPGA capacity of 10E+5 CLBs was found to provide two orders of magnitude of performance improvement over a Pentium III for most of our benchmarks. Memory bank count. We found performance improvement when the bank count for the coprocessor memory is increased. About 200 memory banks of 256 bytes can provide the maximum speed-up. Bus bandwidth. When the data transmission between host memory and local memory is considered, we found that maximum speed-up depends on the size of images. Taking images with 256 x 256 pixels and benchmarks that can exhibit a performance improvement of two orders of magnitude, the available bandwidth must be as high as 30 GB/s.

6.2: Conclusions We can explain why some currently available FPGA coprocessors do not provide the achievable two orders of magnitude performance improvement for some ImageProcessing applications. This conclusion was verified

uA1

EDGE 256x256 uA2 uA3

uA4

uA5

10,0

1,0 0

200

400

600

800

1000

uA5

10,0

1,0 0

0,1

200

uA3

uA4

uA2

uA5

100,0

10,0

10,0

Speed-Up

100,0

1,0 0

200

400

600

800

600

800

1000

System Bus Bandwidth (MB/s)

FDCT 256x256 uA2

400

0,1


Speed-Up

ME 24x24 uA4

100,0

Speed-Up

Speed-Up

100,0

uA3

ATR 256x256 uA3 uA4

uA5

1,0 0

1000

200

400

600

800

1000

0,1

0,1



Figure 7. Performances of the reconfigurable coprocessor taking account the system bus bandwidth using the PCI bus plug-in board called RC1000-PP. It has one Xilinx Virtex XCV1000 FPGA with four banks of local memory [2]. The FPGA device can provide a hardware complexity equivalent to 12288 CLBs - one million of gates. The maximum bandwidth for the PCI system bus is 132 MB/s. So, the maximum achievable speed-up is 7,8X for ME, and 2X for ATR (see Figure 7). For the remaining benchmarks, the limited bandwidth does not allow achieve any performance improvement. The four banks of memory and the one million of gates would achieve a factor of 2,8X (FDCT) to 9,4X (ATR) performance improvement. ME requires a minimum of eight memory banks to see a factor of 2,3X performance improvement. The four memory banks supported by RC1000-PP are not sufficient to get a minimum speed-up. Therefore, an unappropriated data bandwidth both for the local reconfigurable data-paths and the system bus can degrade enormously the performance improvement achievable by current FPGAs.

Acknowledgement The author thanks Daniel Herrera and the referees for helpful feedback in preparing this paper.

References 1. 2. 3. 4.

D. Benitez; Modular Architecture for Custom-Built Systems Oriented to Real-Time Computer Vision: Application to Color Recognition; J. System Architecture, 42(8):709-723, 1997 celoxica.com T.J.Callahan, J.R.Kouser, J.Wawrzynek; The Garp Architecture and C Compiler; IEEE Computer, 33(4):62-69 2000 Y. Chou, P. Pillai, H. Schmit, J.P. Shen; PipeRench Implementation of the Instruction Path Coprocessor; Int. Symp. on Microarchitecture, pp.147-158, 2000


5.

6. 7. 8. 9. 10. 11. 12. 13. 14.

15. 16.

17. 18. 19. 20.

T. M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M.S. Schlansker, P. Song, A. Wolfe; Challenges to combining general-purpose and multimedia processors; IEEE Computer, 30(12):33-37 (1997) K. Diefendorff, P.K. Dubey; How Multimedia Workloads Will Change Processor Design; IEEE Computer, 30(9):43-45 (1997) S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R.R. Taylor; PipeRench: A Reconfigurable Architecture and Compiler; IEEE Computer, 33(4):70-77 (2000) Celoxica; Handel-C Language Reference Manual, 1998 R. Hartenstein; Reconfigurable Computing: a New Business Model - and its Impact on SoC Design (invited keynote); DSD'2001 - Warzaw, Poland, September, 2001 T.Komarek and P.Pirsch, Array Architectures for Block Matching Algorithms; IEEE T. Circuits & Systems, 36:1301-1308,1989 C. Lee, M. Potkonjak, W. Mangione-Smith; MediaBench:A Tool for Evaluating and Synthesizing Multimedia and Communications Systems; Int. Symp. on Microarchitecture, pp.330-335, 97 T. Miyamori, K. Olukotun; A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications; IEEE Symp. on FCCM, 1998 W. Pratt; Digital Image Processing, 2nd ed; John Wiley, 1991 P. Ranganathan, S.V. Adve, N.P. Jouppi; Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions; ISCA-99, ACM Computer Architecture News, 27(2):124-135, 1999 N..Slingerland, A..Smith; Cache Performance for Multimedia Applications; 15th Intl. Conf. Supercomputing, 204-217, 2001 H. Singh, M. Lee, G. Lu, F.J. Kurdahi, N. Bagherzadeh, E.M. Chaves Filho; MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications; IEEE Trans. Computers, 49(5):465-481, 2000 J.Villasenor, B.Hutchings; The Flexibility of Configurable Computing; IEEE Signal Proc. Magazine, 15(5): 67-84, 1998 S. Vassiliadis, S.Wong, S. Cotofana; The MOLEN pu-coded processor; Proc. 11th Int. Conf. on FPL, pp.275-285, 2001 Xilinx; Virtex-EM FIR Filter for Video Applications; Xilinx Application Note XAPP241, 2000 (xilinx.com) Z.A. Ye, A. Moshovos, S. Hauck, P. Banerjee; CHIMAERA: A high-performance architecture with tightly-coupled reconfigurable functional unit; ISCA-27, pp.225-235, 2000

Performance of remote FPGA-based coprocessors for ... - CiteSeerX

Performance of remote FPGA-based coprocessors for ... - CiteSeerX

Suggest Documents

Using Secure Coprocessors - CiteSeerX

Distributed High-Performance Computation for Remote ... - CiteSeerX

Remote Performance Monitoring (RPM): Full-System ... - CiteSeerX

FPGAbased educational platform for realtime image processing ...

An FPGAbased integrated environment for computer architecture

Remote Performance Monitoring (RPM): Full-System ... - CiteSeerX

PDIO: High-Performance Remote File I/O for Portals ... - CiteSeerX

PDIO: High-Performance Remote File I/O for Portals ... - CiteSeerX

Performance Bounds for Remote Estimation under Energy

Performance implications of remote-only load balancing ... - CiteSeerX

improving performance of a remote robotic teleoperation ... - CiteSeerX

High Performance Computing for Hyperspectral Remote ... - UMBC

High Performance Computing for Hyperspectral Remote ... - UMBC

improving performance of a remote robotic teleoperation ... - CiteSeerX

Remote Control for Videoconferencing - CiteSeerX

An FPGA based rapid prototyping platform for wavelet coprocessors

REMOTE SENSING OF SNOW COVER FOR ... - CiteSeerX

Automated Remote Microscope for Inspection of ... - CiteSeerX

Visualization of Spectroscopy for Remote Surface ... - CiteSeerX

Accelerating finite-rate chemical kinetics with coprocessors ...

Experiments with HARMONIE on Xeon Phi Coprocessors

Smart Card Crypto-Coprocessors for Public-Key ... - Semantic Scholar

parameter controlled remote performance (pcrp)

SEMPLAR: High-Performance Remote Parallel I/O over SRB - CiteSeerX