Color Space Conversion for MPEG decoding on

0 downloads 0 Views 136KB Size Report
configure a pipelined Color Space Converter (CSC) on FPGA and to unroll the software loop issu- ing an CSC .... decoder, reads its input arguments from and writes the computed values back to the register file. .... CSC_R Y'1,Cb,Cr −> R'ms.
Color Space Conversion for MPEG decoding on FPGA-augmented TriMedia Processor Mihai Simayz

Stamatis Vassiliadisy Sorin Cotofanay Jos T.J. van Eijndhovenz y Delft University of Technology, Delft, The Netherlands z Philips Research Laboratories, Eindhoven, The Netherlands [email protected] http://ce.et.tudelft.nl/˜mihai Abstract

A case study on Color Space Conversion (CSC) for MPEG decoding, carried out on FPGAaugmented TriMedia processor is presented. That is, a transform from Y 0 C bC r color space to 0 0 0 R G B color space is addressed. First, we outline the extension of TriMedia architecture consisting of FPGA-based Reconfigurable Functional Units (RFU) and associated instructions. Then we analyse a CSC (RFU–specific) instruction which can process four pixels per call, and propose a scheme to implement the CSC operation on RFU(s). When mapped on an ACEX EP1K100 FPGA, the proposed CSC exhibits a latency of 10 and a recovery of 2 TriMedia@200 MHz cycles, and occupies 57% of the device. By configuring the CSC facility on the RFU(s) at application load-time, color space conversion can be computed on FPGA-augmented TriMedia with 40% speed-up over the standard TriMedia.

1. Introdu tion

Enhancing a general purpose processor with a reconfigurable core is a common issue addressed by computer architects [8, 15, 2]. The basic idea of this approach is to exploit both the processor flexibility to achieve medium performance for a large class of applications, and FPGA capability to implement application-specific computations. An instance of such enhanced processor is TriMedia+FPGA hybrid [11], on which the user is given the freedom to define and use any computing facility subject to FPGA size and TriMedia organization. Several applications implemented on this hybrid, e.g., Inverse Discrete Cosine Transform [10] and Entropy Decoding [12], proved promising results. In this paper, we address color space conversion which is carried out at the end of the MPEG decoding process. The last stage of the MPEG decoding consists of a color space conversion, which is a linear transform from Y 0 C bC r color space to R0 G0 B 0 color space. Since this transform exhibits large data and instruction-level parallelisms, it can be implemented on TriMedia with very high efficiency. Obtaining improvements for a task having a computational pattern which TriMedia has been optimised for, is indeed challenging. In this paper we demonstrate that significant speed-up for Y 0 C bC r -to-R0 G0 B 0 color space conversion can be achieved on FPGA-enhanced TriMedia over standard TriMedia. The main idea is to configure a pipelined Color Space Converter (CSC) on FPGA and to unroll the software loop issuing an CSC operation such that the penalty associated to firing-up and flushing the CSC pipeline is reduced. In particular, we provide configurable-hardware support for a CSC operation which can

process four pixels per call. When mapped on an ACEX EP1K100 FPGA from Altera, the computing unit performing the CSC operation has a latency of and recovery of TriMedia@200 MHz cycles, and occupies 57% of the device. The simulations carried out on a TriMedia cycle accurate simulator indicate that by configuring the CSC unit on FPGA at application load-time, Y 0 C bC r to-R0 G0 B 0 color space conversion can be computed on extended TriMedia 40% faster over the standard TriMedia. Given the fact that the experimental TriMedia is a 5 issue-slot 64-bit VLIW processor with a very rich multimedia instruction set [14], such an improvement within the target media processing domain indicates that the TriMedia + FPGA hybrid is a promising approach with respect to color space conversion. Summarizing, the paper contributions are:

10

  

2

The syntax and the semantics of the CSC user-defined operation. The CSC computing facility implementation on an ACEX EP1K100 FPGA from Altera. A high performance implementation of color space conversion on FPGA-augmented TriMedia.

The paper is organized as follows. We present several issues related to color space conversion in Section 2. Section 3 outlines the TriMedia architectural extension that incorporates the reconfigurable array. The FPGA-based implementation of a color space converter and its associated instructions are discussed in Section 4. The experimental framework and the results are presented in Section 5. Section 6 concludes the paper. 2. Ba kground

MPEG is a digital compression standard for multimedia [6, 1]. As depicted in Figure 1, the last stage of the MPEG decoding process is an Y 0 C bC r -to-R0 G0 B 0 color space conversion. That is, gamma-corrected red, green, blue ( R0 , G0 , and B 0 ) are computed from a luminance-related quantity, 0 Y , and two color-related quantities, C b and C r . Coded Data

MPEG Decoding

Y’CbCr video data

Color Space Conversion

R’G’B’ video data

111 000 000 111 000 111 000 111 000 111 Display

Figure 1. The video decoding process

To make the presentation self-consistent, we will address some issues related to color space conversion, which is carried out at the end of the MPEG-2 [5] decoding process. 2.1. Color space conversion According to the Trichromatic Theory, it is possible to match all of the colors in the visible spectrum by appropriate mixing of three primary colors. Which primary colors are used is not important as long as mixing two of them does not produce the third. For display systems that emit light, the Red-Green-Blue (RGB ) primary system is used. A color space is a mathematical representation of a set of colors. In the sequel, we will present two color spaces: R0 G0 B 0 and Y 0 C bC r .

2.1.1.

RGB 0

0

0

color space

Film, video, and computer-generated imagery all start with red, green, and blue intensity components. In video and computer graphics, the nonlinearity of the CRT monitor is compensated by applying a nonlinear transfer function to RGB intensities to form Gamma–Corrected Red, Green, and Blue (R0 G0 B 0 ). The gamma-corrected red, green, blue are defined on a relative scale from 0 to 0 0 0 0 E E , where E denotes the 1.0, chosen such that shades of gray are produced when ER G B analog gamma–pre-corrected signal associated with the primary X color. In digital video, the analog signal is uniformly-quantized on 8 bits, so that 256 equally spaced quantization levels are specified. Coding range in computing has a de facto standard excursion, 0 to 255. Studio video provides footroom below the black code, and headroom above the white code; its range is standardized from 16 to 235. However, values less than 16 and greater than 235 are allowed in order to accomodate the transients that result from filtering.

=

2.1.2.

=

Y CbCr color space 0

The data capacity accorded to color information in a video signal can be reduced as follows. First,

0 0 0 0 R G B is transformed into luminance-related quantity called luma (Y ), and two color difference

components called chroma (C b, C r ) [7]. Since the human visual system has poor color acuity, the color detail can then be reduced by subsampling (lowpass filtering) without the viewer noticing. The Y 0 C bC r color space was developed as part of ITU-R Recommendation BT.601 [4]. All components are represented as 8-bit unsigned integers. Y 0 is defined to have a nominal range of 16 to 235; C b and C r are defined to have a range of 16 to 240, with 128 equal to zero. It is Y 0 , 0 C b, C r values that are coded inside an MPEG string. It is worth mentioning that Y , C b, C r are represented on 16-bit signed integers after motion compensation. 2.1.3.

Y CbCr-to-R G B 0

0

0

0

conversion

If the gamma-corrected RGB data has a range of 0 to 255, as is commonly found in computer systems, the following equations describe the R0 G0 B 0 -to-Y 0 C bC r conversion:

8 0 = 1 164( >> 0 < = 1 164( >> : 0 R

:

G

:

B

0 0 Y

Y

= 1 164( :

Y

0

16) + 1 596( 16) 0 813( 0 391( 16) + 2 018( :

Cr

:

Cr

:

Cb

:

Cb

128) 128) 128) 128)

(1)

Even though Y 0 is defined to have a range of 16 to 235, while C b and C r have a range of 16 to 240, sample values outside the above mentioned ranges may occasionally occur at the output of the MPEG-2 decoding process according to ITU-T Recommendation H.262 [5]. As a consequence, 0 0 0 R G B values must be saturated at the 0 and 255 levels after conversion to prevent any overflow and underflow errors. With connection to the subsequent experiment, we would like to mention that the mapping defined by Equation set 1 will benefit from configurable hardware support. 2.2. Y 0 C bC r sampling format conversion As mentioned, since the eye is less sensitive to color information than brightness, the chroma channels can have a lower sampling rate that the luma channel without a dramatic degradation of the perceptual quality. In MPEG, 2:1 horizontal downsampling with 2:1 vertical downsampling is

employed. That is, the C b and C r pixels lie between the Y 0 pixels on every other pixel on both the horizontal and vertical lines. Thus, a two-dimensional 2-fold upsampling has to be carried out before the proper color space conversion. The simplest upsampling method employs a zero-order 00 11 00 11 hold. That is, each and every pixel is replicated to East, 00 11 00 00 11 11 00 11 South-East, and South, as depicted in Figure 2. Conse00 11 00 11 00 11 00 11 00 11 00 11 quently, no additional processing is needed for upsam00 11 00 11 11 00 pling at the expense of a poor frequency response char00 11 00 11 00 11 00 11 00 11 00 00 11 00 11 11 00 11 acteristic. Indeed, since the zero-order hold does not pos00 11 00 11 00 11 00 sess a sharp cutoff frequency response characteristic, be00 11 11 00 11 ing a poor anti-image filter, it passes undesirable image Figure 2. Two-dimensional 2–fold components [13]. The filtering process is beyond the paper scope. Thus, upsampling by replication we will consider that a zero-order hold is used, although this solution does not provide satisfactory image quality. Synthesizing, the following stages has to be performed in the process of color space conversion carried out at the end of MPEG decoding:

11 00 00 00 11 11 00 11 00 11 00 00 11 11 00 11 000 111 000 000 111 111 000 111 000 111 000 000 111 111 000 111

11 00 00 11 000 111 000 111

1. Chroma upsampling; 2.

0

Y C bC r

-to-R0 G0 B 0 linear transform.

Before we will present the FPGA–based implementation of the CSC computing facility and its associated user-defined instruction, we will outline the architectural extension of the experimental TriMedia processor. 3. Ar hite tural extension for TriMedia

From the TriMedia family, TriMedia/CPU64 will be used as an experimental platform in all the subsequent tests. TriMedia/CPU64 is a processor model whose architecture features a rich instruction set optimized for media processing. Specifically, it is a 5 issue-slot 64-bit VLIW core, launching a long instruction every clock cycle [14]. It has a uniform 64-bit wordsize through all functional units, the register file, load/store units, on-chip highway and external memory. Each of the five operations in a single VLIW instruction can in principle read two register arguments and write one register result. The processor also supports 2-slot operations, or super-operations. Such a super-operation occupies two adjacent slots in the VLIW instruction, and maps to a doublewidth functional unit. This way, operations with more than 2 arguments and one result are possible. The architecture supports subword parallelism: for example, operations on 8-bit unsigned integer vectors, or on 16-bit signed integer vectors are possible. Following the methodology described in [10, 12], TriMedia processor can be augmented with one or more FPGA-based Reconfigurable Functional Units (RFU). An RFU is embedded into the TriMedia as any other hardwired functional unit, i.e., it receives instructions from the instruction decoder, reads its input arguments from and writes the computed values back to the register file. Even though only 2-slot operations are supported by the current TriMedia simulator, we propose to extend the concept of super-operations and provide RFUs on which up to 5-slot operations can be executed. This extension will be very useful when vectorial operations are mapped on the configurable hardware. In order to use an RFU, new instructions are provided: SET, and EXECUTE. Loading a new configuration into an RFU is performed under the command of a SET instruction, while EXECUTE

(generic) instructions launch the operations performed by the computing resources configured on the FPGA. With such extension, the user is given the freedom to define and use any computing facility subject to the FPGA size and TriMedia organization. For more details regarding this issue we refer the reader to bibliography [11]. Several considerations about the latency of an RFU-configured computing resource are worth to be provided. Due to layout constraints, the RFU is likely to be located far away from the Register File (RF) in the floorplan of the TriMedia. The immediate effect is that there will be large delays in transferring data between the RFU and RF, and the RFU will not benefit from bypassing capabilities of the RF [14]. Consequently, read and write back cycles have explicitely to be provided. In such circumstances, the latency of an RFU-based computing resource is composed of 1 cycle for read, the number of cycles corresponding to the FPGA delay, and 1 cycle for write back. For the subsequent experiment, two instances of the TriMedia+FPGA hybrid are considered: 1. TriMedia @ 200MHz augmented with a single RFU, which can run at maximum one half of TriMedia clock frequency, that is, 100 MHz. 2. TriMedia @ 200MHz augmented with two RFUs, each running at maximum one quarter of TriMedia clock frequency, that is, 50 MHz. In the sequel, the FPGA–based implementation of a CSC computing unit, and its associated instruction are presented. 4. CSC ustom instru tion and omputing unit

Since three values (red, green, and blue) are to be computed for each pixel, we propose to provide configurable-hardware support for a 3-slot CSC operation which reads the Y 0 C bC r triplet and returns the R0 G0 B 0 triplet:

CSC

0

Y ; C b; C r

!

0

0

R ;G ;B

0

where Y 0 ; C b; C r; R0 ; G0 ; B 0 are all 64-bit registers. Subject of the FPGA logic capacity and the number of FPGA I/O pins, a different number of pixels can be processed in parallel. Given the fact that the luma and chroma are represented as 16-bit signed integers, and gamma-corrected red, green, and blue are represented as 8-bit unsigned integers, at most four pixels can be processed in parallel. Indeed, the CSC is a 4-way SIMD operation which transforms three 16-bit signed integer vectors (Y 0 ; C b; C r ) into three 8-bit unsigned integer vectors (R0 ; G0 ; B 0 ). This translates to a number of     I/O pins, which is acceptable for most FPGAs in general, and ACEX EP1K100 device in particular. Since the current TriMedia simulator does not support super-operations on more than 2 slots, our 3-slot CSC operation has to be emulated by sequences of 1- and/or 2-slot operations. Therefore, we define two 2-slot CSC instructions: CSC R, which performs the proper color space conversion and returns only the red information, and CSC GB, which returns the green and blue information:

3 4 16 + 3 4 8 = 288

CSC R CSC GB 0

Y ; C b; C r

!

! 0

R

G ;B

0 0

I Y’ 16 signed

Cr 16 signed

Clipping

Clipping

8 unsigned

8 unsigned

8 unsigned

Cb 16 signed

Clipping

8 unsigned

8 unsigned

4a8h ( = 1.164) 11 unsigned 19 unsigned 662h ( = 1.596) 11 unsigned 19 unsigned 341h ( = 0.813) 10 unsigned 18 unsigned 190h ( = 0.391) 9 unsigned 17 unsigned 812h ( = 2.018) 12 unsigned 20 unsigned

III

II Rounding & Quantization

IV

10 unsigned 0dfh ( = 223) 10 unsigned

Rounding & Quantization

Rounding & Quantization

Rounding & Quantization

10

unsigned

10 unsigned

12

− + −

Clipping

8

R’

unsigned

unsigned

+ +

12

12 signed

signed

Clipping

8

G’

unsigned

unsigned 115h ( = 277)

10 unsigned Rounding & Quantization

12 signed

088h ( = 136)

unsigned

10

unsigned

+ −

11

unsigned

10

11

10 unsigned

+ +

11

unsigned

+ −

11 unsigned

12 signed

Clipping

8

B’

unsigned

Figure 3. The CSC implementation on FPGA (the Roman numerals indicate the pipeline stages)

We have to emphasize that this approach is carried out only for experimental purpose. Fortunately, our choice does not generate overhead, since it is easier to schedule a single 3-slot instruction than multiple 1- and/or 2-slot instructions. The FPGA–based CSC implementing the Equation set 1, is presented in Figure 3. By writing RTL-level VHDL code, we succeeded to identify a four-stage pipelined implementation which can run at 100 MHz on ACEX EP1K100 device. Adding the penalty of the extra read and write and recovery of TriMeback cycles for an RFU–based operation, the CSC has a latency of dia@200MHz cycles if an RFU@100MHz is considered. For an RFU@50MHz, two pipelines and recovery stages can be merged into one, which translates into a CSC having the latency of of .

10

4

2

10

5. Experimental results

For the first extended TriMedia instance, a CSC computing unit having the latency of 10 and recovery of 2 cycles is configured on the RFU@100MHz, while a CSC computing unit with the latency of 10 and recovery of 4 cycles is configured on each of the two RFUs@50MHz in the second extended TriMedia instance. That is, a lower pipeline frequency at the expenses of a double size FPGA is the trade-off of the second instance. To perform color space conversion for an image, calls to CSC are issued within a software loop. The scheduled code when the RFU@100MHz is considered is presented in Figure 4. First, LOAD operations are issued to fetch the pixels in Y 0 C bC r format from memory. Then, pairs of CSC R + CSC GB operations are launched to perform color space conversion, four pixels per call. After eight pixels have been converted, PACK operations reorganize the R0 G0 B 0 information in 8-bit unsigned integer vectors. Finally, STORE operations send the results to a display FIFO. According to Figure 4, 16 pixels can be processed with the latency 25 cycles. In order to keep the pipeline full, back-to-back CSC R operation is needed. That is, a new CSC R instruction has to be issued every two cycles (or, every four cycles in the RFU@50MHz–based instance). In

this way, color space conversion can be performed with a throughput of 16/8 = 2 pixels/cycle. Unfortunately, this figure corresponds to the ideal case of infinite loop unrolling, which can never be achieved in practice. For a finite loop unrolling, the overhead associated to firing-up and flushing the CSC pipeline has to be taken into consideration. As a rule of thumb, the throughput drops to N= latency = N =ideal throughput , where N is the number of times which the loop is unrolled. For example, the ideal throughput drops to 1.3 pixels/cycle for  loop unrolling, and to 0.64 pixels/cycle for a loop which is fully rolled. The same judgement can be carried out for the second RFU@50MHz–based instance. Since the results are pretty much the same, we will not go into details. The testing database for both pureLATENCY software and FPGA-based color space 25 cycles for 16 pixels Load Cb Load Cr converters consists of a stream of two Load Y’1 THROUGHPUT CSC_R Y’1,Cb,Cr −> R’ms  pixel images, for which the 16 / 8 = 2 pixel/cycle CSC_GB −> G’ms,B’ms 0 Load Y’2 Y , C b, C r components are stored in CSC_R Y’2,Cb,Cr −> R’ls CSC_GB −> G’ls,B’ls separate tables in the main memory. The PACK R’ms,R’ls −> R’ PACK G’ms,G’ls −> G’ PACK B’ms,B’ls −> B’ data is organized as 16-bit signed inteSTORE R’ STORE G’ ger vectors as resulted from motion comSTORE B’ pensation. The Y 0 C bC r -to-R0 G0 B 0 conLoad Y’3 CSC_R Y’3,Cb,Cr −> R’ms CSC_GB −> G’ms,B’ms version is done in an SIMD fashion, by Load Y’4 CSC_R Y’4,Cb,Cr −> R’ls sequentially processing four triplets of CSC_GB −> G’ls,B’ls 0 PACK R’ms,R’ls −> R’ Y , C b, C r values at a time. As menPACK G’ms,G’ls −> G’ PACK B’ms,B’ls −> B’ tioned, chroma 2-fold upsampling is perSTORE R’ STORE G’ STORE B’ formed by means of a zero-order hold; Load Cb this way, no additional processing is Load Cr Load Y’1 needed. Then, the linear mapping deCSC_R Y’1,Cb,Cr −> R’ms CSC_GB −> G’ms,B’ms fined by Equation set 1 is carried out, Load Y’2 CSC_R Y’2,Cb,Cr −> R’ls CSC_GB −> G’ls,B’ls and the result is sent to a display FIFO. PACK R’ms,R’ls −> R’ PACK G’ms,G’ls −> G’ Since all the input data is stored into PACK B’ms,B’ls −> B’ STORE R’ memory, the very first access to it STORE G’ STORE B’ (which is actually the single access in Load Y’3 CSC_R Y’3,Cb,Cr −> R’ms our application) will generate so called CSC_GB −> G’ms,B’ms Load Y’4 compulsory read cache misses. The sum CSC_R Y’4,Cb,Cr −> R’ls CSC_GB −> G’ls,B’ls of the data cache stalls and the instrucPACK R’ms,R’ls −> R’ PACK G’ms,G’ls −> G’ PACK B’ms,B’ls −> B’ tion cycles needed to perform the proper STORE R’ STORE G’ color space conversion translates to a STORE B’ worst case scenario, which provides the Load Cb Load Cr Load Y’1 lower bound of the performance improveCSC_R Y’1,Cb,Cr −> R’ms CSC_GB −> G’ms,B’ms ments. We will also present the results Load Y’2 CSC_R Y’2,Cb,Cr −> R’ls according to the best case scenario, in which all cache misses are counted as part of the motion compensation process. Figure 4. The scheduling result for a CSC unit having That is, improvements in terms of inthe latency of 10 and recovery of 2 struction cycles required to perform strictly color space conversion will be reported. The reference for evaluating the performance of the color space conversion carried out on FPGAaugmented TriMedia is a pure-software implementation on standard TriMedia [9]. The reference color space converter is implemented as a loop, where each iteration processes 16 pixels. Since

16 + (

1

2

3

4

5

6

7

1)

8

)

4

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

256 256

...

(

Y CbCr R G B

Table 1. Performance figures for 0 -to- 0 Experiment 1 RFU@100MHz pure SW FPGA-CSC4 FPGA-CSC4 rolled rolled unrolled 4 Instruction cycles 190,771 190,240 107,821 Instruction cycles / pixel 1.46 1.45 0.82 Pixels / instruction cycle 0.66 0.69 1.22 Instruction-cache stalls 937 697 1,342 Read data-cache stalls 90,112 90,112 90,112 Issues / cycle 4.81 3.47 3.84 I cycles + read D$ stalls 280,883 280,352 197,933 Instruction cycles / pixel 2.14 2.14 1.51 Pixels / instruction cycle 0.47 0.47 0.66

0

0

conversion 2 RFUs@50MHz FPGA-CSC4 FPGA-CSC4 rolled unrolled 4 192,972 109,358 1.47 0.83 0.68 1.20 702 1,368 90,112 90,112 3.35 3.95 283,084 199,470 2.16 1.52 0.46 0.66

this pure-software implementation is beyond the paper scope, we will not go into further details. However, we still mention that by running our pure-software color space converter on a TriMedia cycle accurate simulator, we determined that an iteration which processes 16 pixels can be scheduled into 24 cycles, which translates into : pixels/cycle. It is also worth mentioning that : of issue slots are filled in with operations in the pure-software implementation. This result is indeed a challenging reference for the TriMedia+FPGA hybrid. Therefore, our experiment includes two approaches: pure software and FPGA-based. As mentioned, 0.66 pixels/cycle are decoded in the pure software approach, while 2 pixels/cycle can be decoded in the FPGA-based approach if the loop is unrolled an infinite number of times. The configuration of the RFU is carried out at application load time. The Y 0 C bC r -to-R0 G0 B 0 performance evaluation has been carried out considering two FPGAaugmented TriMedia instances: TriMedia + 1 RFU@100 MHz and TriMedia + 2 RFUs@50 MHz. A program has been written in C, and further compiled and scheduled with TriMedia development tools. To overcome the penalty associated to firing-up and flushing the pipeline, two techniques can be employed: (1) loop unrolling, and (2) software pipelining. Since the TriMedia scheduler uses the decision tree as a scheduling unit [3], all operations return their results in the same decision tree that they are issued, even though the TriMedia architecture does not forbid the contrary. This is the major limiting factor in generating deep software pipelined loops containing long-latency operations. Hence, only loop unrolling technique is considered in the sequel. The loop calling the CSC instruction has been manually unrolled different numbers of times. The best results which corresponds to  unrolling are presented in Table 1. We would like to mention that a  unrolling does not suffice, since the firing-up and flushing overhead is still large. At the same time, an  unrolling generates long decision trees, which in turn translates into reduced performance due to register spilling. As it can be easily observed, about the same figures  are obtained for both FPGA-augmented TriMedia instances. The speed-up is 0:660:470:47  according to the worst-case scenario, and 1:220:660:66  according to the best-case  scenario. Finally, we would like to mention that a more realistic average-case scenario will assume a memory access pattern for reducing the number of read cache misses. However, this is subject to optimization at a complete MPEG decoder level, and therefore, beyond the paper scope. Thus, we do not have statistically relevand data for the time being. Since we are not able to make a reliable estimation according to the average-case scenario, we proceed to a conservative evaluation

0 66

5

2

40%

48

4

8

100 85%

100

by taking into account the entire number of read cache misses (worst-case scenario), and claim that the FPGA-augmented TriMedia/CPU64 can perform color space conversion 40% faster than the standard TriMedia/CPU64. Given the fact that the experimental TriMedia is a 5 issue-slot VLIW processor with 64-bit datapaths and a very rich multimedia instruction set [14], such an improvement within the target media processing domain indicates that the TriMedia + FPGA hybrid is a promising approach with respect to color space conversion. 6. Con lusions and future work

We have described an Y 0 C bC r to R0 G0 B 0 converter on FPGA-augmented TriMedia. For such a task, the lower bound of the performance improvement over the standard TriMedia is 40% in terms of speed. The major lesson learned is that deep pipelines implemented on the RFU can provide significant improvements even for a performant VLIW processor within its target media domain. In future work, we intend to consider performant anti-image filtering in the process of chroma upsampling. A knowledgements

This project is supported by the doctoral fellowship RWC-061-PS-99047-ps from Philips Research Laboratories in Eindhoven, The Netherlands. Referen es

[1] B. G. Haskell, A. Puri, and A. N. Netravali. Digital Video: An Introduction to MPEG-2. Kluwer Academic Publishers, Norwell, Massachusetts, 1996. [2] J. R. Hauser and J. Wawrzynek. Garp: A MIPS Processor with a Reconfigurable Coprocessor. In IEEE Symposium on FPGAs for Custom Computing Machines, pages 12–21, Napa Valley, California, April 1997. IEEE Computer Society Press. [3] J. Hoogerbrugge and L. Augusteijn. Instruction Scheduling for TriMedia. Journal of Instruction-Level Parallelism, 1(1), February 1999. [4] International Telecommunication Unit. Studio Encoding Parameters of Digital Television for Standard 4:3 and Wide-Screen 16:9 Aspect Ratios. ITU-R Recommendation BT.601-5, October 1995. [5] International Telecommunication Unit. Information technology – Generic coding of moving pictures and associated audio information: Video. ITU-T Recommendation H.262, February 2000. [6] J. L. Mitchell, W. B. Pennebaker, C. E. Fogg, and D. J. LeGall. MPEG Video Compression Standard. Chapman & Hall, New York, New York, 1996. [7] C. Poynton. A Technical Introduction to Digital Video. John Wiley & Sons, January 1996. [8] R. Razdan and M. D. Smith. A High Performance Microarchitecture with Hardware-Programmable Functional Units. 27th Annual International Symposium on Microarchitecture – MICRO-27, pages 172–180, San Jose, California, November 1994. [9] M. Sima. Color Space Conversion on VLIW platforms. Private Communication, August 2002. [10] M. Sima, S. Cotofana, J. T. van Eijndhoven, S. Vassiliadis, and K. Vissers. 8  8 IDCT Implementation on an FPGA-augmented TriMedia. In K. L. Pocek and J. M. Arnold, editors, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2001), Rohnert Park, California, April 2001. IEEE Computer Society Press.

[11] M. Sima, S. Cotofana, S. Vassiliadis, J. T. van Eijndhoven, and K. Vissers. MPEG Macroblock Parsing and Pel Reconstruction on an FPGA-augmented TriMedia Processor. In A. Jacobs, editor, IEEE International Conference on Computer Design, pages 425–430, Austin, Texas, September 2001. IEEE Computer Society Press. [12] M. Sima, S. D. Cotofana, S. Vassiliadis, J. T. van Eijndhoven, and K. A. Vissers. MPEG-compliant Entropy Decoding on FPGA-augmented TriMedia/CPU64. In J. Arnold and K. L. Pocek, editors, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2002), pages 261– 270, Napa Valley, California, April 2002. IEEE Computer Society Press. [13] P. Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall, Englewood Cliffs, New Jersey, 1993. [14] J. T. J. van Eijndhoven, F. W. Sijstermans, K. A. Vissers, E.-J. D. Pol, M. J. A. Tromp, P. Struik, R. H. J. Bloks, P. van der Wolf, A. D. Pimentel, and H. P. E. Vranken. TriMedia CPU64 Architecture. In Proceedings of International Conference on Computer Design, pages 586–592, Austin, Texas, October 1999. IEEE Computer Society. [15] R. D. Wittig and P. Chow. OneChip: An FPGA Processor With Reconfigurable Logic. In K. L. Pocek and J. M. Arnold, editors, IEEE Symposium on FPGAs for Custom Computing Machines, pages 126– 135, Napa Valley, California, April 1996. IEEE Computer Society Press.

Suggest Documents