A Unified FPGA-Based System Architecture for 2-D ...

A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform Ishmael Sameen, Yoong Choon Chang, Mow Song Ng, Bok-Min Goi & Chee-Pun Ooi Journal of Signal Processing Systems for Signal, Image, and Video Technology (formerly the Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology) ISSN 1939-8018 Volume 71 Number 2 J Sign Process Syst (2013) 71:123-142 DOI 10.1007/s11265-012-0687-1

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science+Business Media, LLC. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your work, please use the accepted author’s version for posting to your own website or your institution’s repository. You may further deposit the accepted author’s version on a funder’s repository at a funder’s request, provided it is not made publicly available until 12 months after publication.

1 23

Author's personal copy J Sign Process Syst (2013) 71:123–142 DOI 10.1007/s11265-012-0687-1

A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform Ishmael Sameen & Yoong Choon Chang & Mow Song Ng & Bok-Min Goi & Chee-Pun Ooi

Received: 27 September 2011 / Accepted: 3 August 2012 / Published online: 23 August 2012 # Springer Science+Business Media, LLC 2012

Abstract This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom hardware accelerators generated through high-level synthesis. The proposed system architecture, synthesized on an Altera DE3 Stratix III FPGA board, was developed through an iterative design space exploration methodology using Altera’s C2H compiler. Experimental results show that the proposed system architecture is capable of realtime video processing performance for grayscale image resolutions of up to 1920×1080 (1080p) when ran on the Altera DE3 board, and it outperforms the existing 2-D DWT architecture implementations known in literature by a considerable margin in terms of throughput. While the proposed 2-D DWT system architecture satisfies real-time performance constraints, it can also perform both forward and inverse DWT, support a number Y. C. Chang : C.-P. Ooi Faculty of Engineering, Multimedia University, Cyberjaya, Malaysia Y. C. Chang e-mail: [email protected] C.-P. Ooi e-mail: [email protected] M. S. Ng : B.-M. Goi Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Kuala Lumpur, Malaysia M. S. Ng e-mail: [email protected] B.-M. Goi e-mail: [email protected] I. Sameen (*) Datacenter and Connected Systems Group (DCSG), Intel Microelectronics Sdn. Bhd., Penang, Malaysia e-mail: [email protected]

of popular DWT filters used for image and video compression and provide architecture programmability in terms of number of levels of decomposition as well as image width and height. Based from the design principles used to implement the proposed 2-D DWT system architecture, a system design guideline can be formulated for SOC designs which plan to incorporate dedicated 2-D DWT hardware acceleration. Keywords Field-programmable gate arrays (FPGAs) . Discrete wavelet transform (DWT) . Design space exploration . High-level synthesis 1 Introduction In the last two decades, image and video coding standards, such as JPEG (Joint Photographic Experts Group), MPEG (Moving Picture Experts Group) 1/2/4 and H264/AVC, have been developed to meet the demanding requirements of today’s increasingly interactive multimedia applications. While the JPEG standard has been the de-facto image compression standard since 1992, emerging applications areas in the rapidly developing field of mobile and internet communications have shown the inefficiencies of the JPEG standard in those areas. As a result of these developments, the JPEG2000 standard was developed to address the inefficiencies of the JPEG standard for still image compression. The proposed features of the JPEG2000 standard are based on the properties of the Discrete Wavelet Transform (DWT), contrasting to the use of the Discrete Cosine Transform (DCT) which is the primary transform technique used by the JPEG standard. The decision to use DWT for the JPEG2000 standard was based from the observation that DWT-based image compression generally offers better subjective image quality compared to DCT-based image compression, as it does not suffer from ‘blocky’ artifacts normally associated with DCTbased image compression. Two different approaches exist for computing DWT: the convolution-based approach [1] and the lifting-based

Author's personal copy 124

approach [2]. The traditional convolution-based approach starts by convolving the signal through an FIR filter bank consisting of low-pass (g[n]) and high-pass (h[n]) analysis filters, as illustrated in Fig. 1. The signal is then downsampled by a factor of two. This approach proves to be clearly inefficient since half of the computed samples are redundant. To improve the performance of convolutionbased DWT, a polyphase matrix representing the filter bank can be used instead, where the signal is first split into its polyphase components before passing through the filter bank. The lifting scheme [2], on the other hand, provides a more efficient approach for computing DWT. The lifting scheme is based on spatial construction of second-generation wavelets by breaking up the analysis and synthesis filters into a sequence of elementary matrices, which in turn can be converted into a sequence of Predict and Update lifting steps, as shown in Fig. 2. Due to the advantages that the lifting scheme offers over the more traditional convolution-based approaches for performing DWT [3], the lifting-based approach is nowadays preferred over convolution-based approach for the design and implementation of efficient DWT architectures.

2 A Critical Evaluation of Existing DWT Architectures in Published Literature In spite of the considerable advancements achieved over the last decade at making efficient DWT algorithms for practical real-time applications, the high memory and processing requirements for DWT computation make DWT unsuitable for embedded signal processing applications. To overcome DWT’s high memory and processing requirements, a number of VLSI architectures for dedicated DWT processing have been proposed in [4–21]. It should be noted that VLSI architectures for 2-D DWT processing are more complex, since one has to take into account column-wise processing of the image as well as multilevel decomposition of the image. The major disadvantage of the proposed DWT architectures [4–21] is their lack of architecture programmability, with the exception of the wavelet processing core architecture proposed in [14]. The proposed architectures in [6, 15, 16, 21] support both CDF 9/7 and LeGall 5/3 lifting; Figure 1 Convolution-based DWT using an FIR filter bank.

J Sign Process Syst (2013) 71:123–142

however, dual implementations of these DWT architectures are required to perform both forward and inverse transforms. In other words, these DWT architectures do not have a unified implementation to perform both forward and inverse transforms. As a result, the design costs are considerably high if one desires both dedicated forward and inverse DWT processing capability with support for different filter types in an embedded application. The lack of architecture programmability, in terms of image resolution, number of decomposition levels and filter mode, is a major issue in the FPGA implementation of the three major 2-D DWT computation schedules by Angelopoulou et al [7],. namely the Row-Column (RC), Line-Based (LB) [11] and Block-Based (BB) [13] architectures. As a result, an FPGA implementation of the RC, LB and BB architectures needs to be generated for each level of decomposition and image resolution. Hence, an ASIC migration of the RC, LB and BB architectures as implemented in [7], would not be feasible for practical embedded applications as their ASIC versions would then be restricted to a specific DWT functionality in terms of number of decomposition levels, image resolution (M x M) and filter mode (forward transform with LeGall 5/3 integer lifting). Other issues to consider are the high memory capacity and bandwidth requirements for real-time 2-D DWT processing. The LB [11] and BB [13] architectures alleviate these memory performance issues by transferring lines/ blocks of the image from off-chip memory to faster onchip memory, and then performing column-wise DWT processing as soon as a sufficient number of DWT coefficients are made available after row-wise DWT processing on these on-chip buffers. The primary concern is that the output DWT coefficients are not grouped together in their respective sub-bands due to the in-place nature of the LB and BB algorithms as shown in Fig. 3b. Thus, additional processing may need to be carried out for grouping the output coefficients together in their respective sub-bands, nullifying the performance gains of LB and BB over the conventional RC approach in terms of memory bandwidth. If block coding is used to encode these DWT coefficients, as in the case of the EBCOT algorithm which was proposed by Taubman [22] for JPEG2000, utilizing the in-place mapping scheme shown in Fig. 3b instead of contiguous DWT sub-band blocks as shown in Fig. 3a would slow down the overall performance significantly.

Author's personal copy J Sign Process Syst (2013) 71:123–142

Figure 2 Lifting-based DWT with Predict (P) and Update (U) lifting steps.

The programmable and scalable RISC-based wavelet processing core architecture proposed by Lee and Lim [14] was designed to be software-programmable for various filter modes, thus overcoming the lack of architecture programmability of the previously discussed DWT architectures. A dedicated memory controller for specialized 2-D off-chip DRAM accesses addresses the memory bandwidth issues associated with high capacity off-chip DDR-SDRAM. The architecture programmability of the proposed wavelet processing core architecture comes at a price though; to match the performance of dedicated 2-D DWT architectures such as the one proposed by Zhang et al [21],. more wavelet processing engines (PEs) need to be added to the system in conjunction with the dedicated memory controller, incurring expensive hardware costs and high power consumption. Based on the shortcomings of the DWT architectures discussed so far, there is a need for a programmable 2-D DWT system architecture which is optimized in terms of performance and hardware costs. Hence, the proposed 2-D DWT system architecture in this paper aims to overcome the limitations of the architectures reviewed by using a heterogeneous computing system consisting of a general-purpose microprocessor and specialized hardware accelerators. The general-purpose microprocessor coordinates these specialized programmable hardware accelerators, each designed to accelerate a stage of DWT lifting, to realize architecture programmability when determining the lifting mode Figure 3 2-D DWT sub-band grouping representations; a As contiguous sub-band blocks, b Using the in-place mapping scheme.

125

required for JPEG2000. Thus, the proposed architecture is able to perform either forward or inverse transform for any specified level of decomposition, support a variety of lifting-based filter modes such as CDF 9/7 lifting and LeGall 5/3 lifting, and accommodate image resolution sizes up to 2044 × 2044, while maintaining real-time video processing performance up to 1920×1080 grayscale resolutions. The following sections will describe the details of the proposed 2-D DWT system architecture through an architectural analysis mapping the 2-D DWT algorithm to its efficient design and hardware implementation.

3 Design of the Basic Lifting Unit Taking into account the one-sample delay of the oddindexed samples for the Update step, the computations which define both Predict and Update lifting steps shown in Fig. 2 can be generalized using the following equation ynew ½i ¼ M ðx½i þ x½i þ 1Þ þ y½i;

ð1Þ

where x0even or odd, y ¼ odd jx¼even orevenjx¼odd , i00, 1, 2, …, (n/2)-1 and M represents the lifting coefficient. Rather than implementing the sequential hardware logic for both Predict and Update lifting steps as proposed in [15] and [16], we can instead implement the hardware module required to process Eq. 1 while using multiplexing logic to alternate between even- and odd-indexed input coefficients stored in separate memory buffers. The hardware module, implemented as a basic lifting unit, can compute both Predict and Update lifting steps through its coordinated sequential invocation with a general-purpose microprocessor which feeds the appropriate input multiplexor and register values. Thus, the decision to implement the proposed basic lifting



unit which can be coordinated by a general-purpose microprocessor is influenced mainly by two criteria: 1. Architecture programmability – the basic lifting unit can be programmed accordingly for forward and inverse transforms with the JPEG2000 lifting filters, based on the input values it receives from the microprocessor. 2. Resource efficiency – since the basic lifting unit can be reused during its sequential invocation to perform both Predict and Update lifting steps, we can achieve almost full hardware utilization thus minimizing hardware costs. In order to derive an optimally efficient design for the basic lifting unit, we need to first visualize the lifting operations which would be performed by the proposed basic lifting unit. The fundamental steps required to complete lifting-based 1-D DWT and its inverse are Predict, Update and Scale, as illustrated in Fig. 4. Given that LP(z) and HP(z) are the low-pass (‘smooth’) and high-pass (‘detail’) output coefficients respectively and that Xe(z2) and z-1Xo(z2) represent the even- and odd-indexed components respectively of the input signal X(z), then the forward DWT lifting process is represented by:

S m ðzÞ SðzÞ LPðzÞ X e ðz2 Þ e e ¼ ¼ PðzÞ ¼ PðzÞ 1 2 DðzÞ HPðzÞ Dm ðzÞ z X o ðz Þ

ð2Þ

where e ¼ K PðzÞ 0

0 1 K

( Y 1 k¼m

1 Q2k ðzÞ 0 1

Figure 4 Wiring diagrams representing the DWT lifting process; a Forward DWT lifting, b Inverse DWT lifting.

1 Q2k1 ðzÞ

0 1

)

ð3Þ

and Q2k(z) and Q2k-1(z) are Laurent polynomials derived from factorization of the polyphase matrix representing the filter using the Euclidean algorithm [2], while K and 1/K are non-zero scaling factors. For the inverse DWT lifting process, the corresponding equation is given by:

SðzÞ S m ðzÞ X e ðz2 Þ ¼ ¼ PðzÞ Dm ðzÞ DðzÞ z1 X o ðz2 Þ

ð4Þ

where h

i1 e PðzÞ ¼ PðzÞ ¼

(

m Y k¼1

1 Q2k1 ðzÞ

0 1

1 0

Q2k ðzÞ 1

)

1 K

0

0 K

ð5Þ As explained earlier in this section, the Predict and Update lifting steps are generally computed in form of Eq. 1 by the proposed basic lifting unit while alternating between even- and odd-indexed input coefficients stored in their respective memory buffers. We have not yet accounted for the scaling step in our design, which may take up an additional two multipliers in the case of the folded architecture [15]. Since multipliers are typically expensive logic resources, it is desired that we minimize the number of multipliers required for the proposed design. This can be achieved by enabling the proposed basic lifting unit to perform implicit scaling on both even- and odd-indexed coefficients during the Update and Predict lifting steps respectively. The


reduction in the number of multipliers achieved from the implicit incorporation of the scaling step during the Predict and Update lifting steps will be made clear from the following, starting from a mathematical manipulation of the polyphase matrices for CDF 9/7 DWT lifting. Applying the lifting scheme factorization given in Eq. 2 for the polyphase matrix of the CDF 9/7 analysis filter as shown by Daubechies and Sweldens [2] leads to the following result: 1 0 1 d ð1 þ zÞ e ¼ K 01 PðzÞ 0 K 0 cð1 þ z1 Þ 1 1 1 bð1 þ zÞ 1 0 ; 0 1 að1 þ z1 Þ 1 where a 0−1.586134342, b 0−0.0529801185, c 0 0.882911076, d 00.443506852 and K 01.149604398. The corresponding polyphase matrix factorization of the CDF 9/7 synthesis filter after applying Eq. 4 gives 1 0 1 bð1 þ zÞ PðzÞ ¼ 1 að1 þ z1 Þ 1 0 1 0 1 d ð1 þ zÞ K1 0 cð1 þ z1 Þ 1 0 1 0 K e The PðzÞ and P(z) factorizations can be re-formulated as follows. 2 1 0 e ¼ K dK ð1 þ zÞ PðzÞ c 1 1 0 1 K ð1 þ z Þ K 1 0 1 bð 1 þ z Þ ; að1 þ z1 Þ 1 0 1 where dK2 00.586134341 and c/K 00.7680129595 and 1 0 1 bð1 þ zÞ PðzÞ ¼ 1 að1 þ z1 Þ 1 0 1 1 0 dK ð1 þ zÞ K ; cð1 þ z1 Þ K 0 1

where dK00.5098574276. In JPEG2000 lossless compression, the LeGall 5/3 lifting scheme is employed where scaling is not actually required. The polyphase factorizations of the Le Gall 5/3 analysis and synthesis filters are given as follows with their corresponding lifting equations. 1 0:25ð1 þ zÞ 1 0 e PðzÞ ¼ 1 0:5ð1 þ z1 Þ 1 0 1 0 1 0:5ð1 þ z1 Þ PðzÞ ¼ 0:25ð1 þ zÞ 1 0 1 By applying the inverse z-transform on Eqs 2 and 4 using e the re-formulated PðzÞ and P(z) factorizations given above,

127

we can generalize the fundamental lifting step to be performed by the proposed basic lifting unit into the following equation. ynew ½i ¼ M ðx½i þ x½i þ 1Þ þ Sy½i

ð6Þ

Thus, the datapath of the proposed basic lifting unit shown in Fig. 5 is constructed based on Eq. 6 given above. It can be seen that the proposed basic lifting unit only requires two multipliers overall for performing all the fundamental Predict, Update and Scale steps required for DWT lifting. The lifting factors a, b, c, d and K used in the lifting equations for the CDF 9/7 filter are floating-point values, which would then require a specialized floating-point unit for hardware computation. Since a floating-point unit is expensive to implement in hardware, not to mention that floating-point computations tend to be slower than integer computations, a fixed-point implementation was used which rounds the output coefficients to the nearest integer according to the following logic equation: Approximated integer output ¼ ððm v sÞ þ 1Þ 1;

ð7Þ

where m and s are the multiplier coefficient and power-oftwo shift value respectively, and v is the input value. Assuming 16-bit word sizes for m and v, a 16×16-bit multiplier is required for hardware implementation of Eq. 7 with s set to a constant value of 13 for optimal fixed-point performance. The Multiply and Round units shown in Fig. 5 incorporate Eq. 7 to approximate floating-point lifting factors used by the JPEG2000 filters to the nearest integer according to the integer values shown in Table 1. The advantages of a programmable basic DWT lifting unit become apparent in the context of modern embedded systems design, where IP cores are generally augmented to an embedded system to add accelerated hardware functionality. Fig. 6 shows a typical shared system bus architecture which augments both forward and inverse DWT IP cores for complete DWT hardware functionality. Assuming the forward and inverse DWT IP cores are implemented based on the folded architecture [15], we would require a total of eight multipliers (four for each core). However, the proposed programmable basic DWT lifting unit would only require one multiplier, plus an additional multiplier for scaling. Hence, for complete DWT hardware functionality in JPEG2000, we can replace both DWT IP cores in Fig. 6 with a programmable basic DWT lifting unit which can be programmed by the master processor accordingly for both forward and inverse transform modes, resulting in a net logic resource savings of six multipliers. Since even a single multiplier is typically expensive to implement in hardware in terms of logic resources, the net logic resource savings of six multipliers resulting from the



Figure 5 Datapath of the proposed basic lifting unit for hardware computation of Predict and Update lifting steps.

however, for the case of 16-bit word sizes, only five levels of decomposition are supported for CDF 9/7 lifting. This is mainly due to the overflow conditions that will occur when using 16-bit word sizes for DWT coefficient values obtained after five levels of decomposition for CDF 9/7 lifting. Hence, the proposed system architecture can only support hardware-accelerated 2-D DWT processing up to five decomposition levels for the CDF 9/7 filter mode. When the number of decomposition levels is specified beyond this limit (assuming a 32-bit input image), control is transferred back to the master processor to perform software-based 2-D DWT processing of the image in DRAM for the remaining decomposition levels. Since the 2-D DWT is a complete image transform, large images or video frames are normally expected to be stored in inexpensive, high-capacity volatile off-chip memory such as DRAM (Dynamic Random Access Memory) for 2-D

proposed design change in Fig. 6 is very welcome at the expense of added program and control complexity.

4 The Proposed 2-D DWT System Architecture Figure 7 shows a block diagram of the proposed 2-D DWT system architecture with all five DWT hardware accelerators which are interconnected to their associated memory buffers via Avalon Memory-Mapped (MM) Master (M)-Slave (S) interfaces [23]. The master processor (Nios II) coordinates the operations of all five accelerators to perform either forward or inverse 2-D DWT for any specified level of decomposition, as well as to determine the DWT filter to use depending upon the application. The proposed system architecture supports 2-D DWT processing of 8-bit image components using 16-bit or 32-bit word sizes for each pixel;

Table 1 Integer approximation values used for fixed-point implementation of the JPEG2000 filters.

Filter mode

Forward CDF 9-7 lifting -without scaling -with scaling Inverse CDF 9-7 lifting -without scaling -with scaling Forward LeGall 5/3 lifting Inverse LeGall 5/3 lifting

Lifting factors

Multiplexor/register input values used for integer approximation (s013)

M

S

−1.58613 −0.05298 0.88291 0.44351 0.76801 0.58613 1.58613 0.05298 −0.88291 −0.44351 −0.50986 −0.44351 −0.5 0.25 −0.25 0.5

1 1 1 1 0.8699 1.1496 1 1 1 1 0.8699 1.1496 1 1 1 1

lifting_coefficient

scale_factor

sign

25987 868 14466 7266 12583 9603 25987 868 14466 7266 8354 7266 8192 4096 8192 4096

16384 16384 16384 16384 14252 18835 16384 16384 16384 16384 14252 18835 16384 16384 16384 16384

1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 0


129

4.1 Hardware-Accelerated 2-D DWT Processing Algorithm

Figure 6 A conventional system bus architecture augmented with DWT IP cores for DWT hardware functionality.

DWT processing. The problem with processing images directly in off-chip DRAM is the high latency incurred when accessing non-sequential memory locations in DRAM, drastically reducing performance during column-wise DWT processing. In order to mitigate the variable latency issues normally associated with DRAM, we localized memory accesses by transferring lines (rows or columns) of the original image to faster, fixed-latency on-chip memory buffers such as SRAM, and perform 1-D DWT processing on each of these line buffers. To further optimize 2-D DWT performance, dual-port memory was used to double the memory bandwidth. Other memory throughput optimizations for DRAM access, such as transposing images in DRAM for avoiding column-wise accesses and consolidating multiple words into a single wider memory access, will be discussed in Section 4.2. Figure 7 Block diagram of the proposed 2-D DWT system architecture.

Figures 8 and 9 illustrate a single run of the CDF 9/7 lifting scheme for both forward DWT and inverse DWT respectively, where each colored array represents a dual-port RAM being used for multiple memory accesses. The dashed boxes in Figs. 8 and 9 indicate the loop control parameters used to determine the operational mode of the accelerator working at each stage of DWT lifting. Before describing the DWT algorithm flow shown in Figs. 8 and 9, it may be useful to refer to the Nios II C2H Compiler User Guide [24] regarding as to how the Altera C2H compiler interprets pointer dereferencing in hardware, since the loop control parameters are used to derive the read and write addresses of the dualport line buffers for the associated operational modes of do_lifting_step, split_even_odd and merge_even_odd accelerators. Each stage of DWT lifting, differentiated by the dashed lines shown in Figs. 8 and 9, is accelerated by one of three hardware accelerators split_even_odd, do_lifting_step or merge_even_odd. The code snippets given in Fig. 10 describe the processing loops used by these accelerators to perform their respective operational modes during each stage of DWT lifting as shown in Figs. 8 and 9. The commented sections of the code snippets given in Fig. 10 provide the pseudocode showing how the various array pointers are set to their associated memory buffers according to the respective accelerator mode. To allow for concurrent memory accesses in dual-port line buffers and hence increase the memory throughput, non-aliasing pointers using the C __restrict__ qualifier were set to point on the memory locations to be accessed by the respective accelerator. For the do_lifting_step accelerator, three memory reads and one memory write are processed concurrently in both



Figure 8 Algorithm flow for forward CDF 9/7 DWT lifting.

EvenBufferRAM and OddBufferRAM dual-port memories during the Predict and Update lifting steps. Consequently, the output coefficients are shifted right one memory location next to their original memory location after each Predict and Update lifting step, as indicated in Figs. 8 and 9. To deal with signal boundaries, the symmetric extension scheme was employed as shown in Figs. 8 and 9. Figure 11 illustrates how the copy_DRAM_data_to_line_buffer DMA accelerator transfers rows or columns of the

image stored in DRAM to LineBufferRAM1 based from the stride and line index values it receives from the master processor. The highlighted pixels of the 6x6 sample image in Fig. 11 differentiate between row-wise and column-wise processing, where the red-shaded pixels indicate the line index to start row-wise processing, while the blue-shaded pixels indicate the line index to start column-wise processing. On the other hand, the copy_buffer_data_to_DRAM accelerator does not have such a DMA mechanism as used by


131

Figure 9 Algorithm flow for inverse CDF 9/7 DWT lifting.

the copy_DRAM_data_to_line_buffer accelerator to differentiate between row-wise and column-wise processing. Instead, the copy_buffer_data_to_DRAM accelerator only allows for sequential write transfers from LineBufferRAM2 to DRAM. The primary motive for the more restrictive DMA mechanism used by the copy_buffer_data_to_DRAM accelerator as compared to the DMA mechanism used by the copy_DRAM_data_to_line_buffer accelerator is faster performance, the reasons which will be explained in detail in Section 4.2.

4.2 Performance Optimizations We can further optimize the data throughput of the proposed architecture by taking advantage of the fact that the five accelerators can work in ‘pipelined’ fashion, i.e. multiple accelerators can operate in parallel during each stage of 2-D DWT processing differentiated by the dashed lines in Figs. 8 and 9. A flowchart indicating which accelerators are operating during each stage of the processing is given in Fig. 12. Since the even- and odd-indexed data are stored in



Figure 10 Code snippets representing the processing loop used in a split_even_odd accelerator, b do_lifting_step accelerator, c merge_even_odd accelerator.

(a) Processing loop used for Split and Sort modes

(b) Processing loop used for Predict and Update modes

(c) Processing loop used for Merge and Interleave modes EvenBufferRAM and OddBufferRAM memories respectively while the input line data is stored in LineBufferRAM1 memory, the copy_DRAM_data_to_line_ buffer accelerator can fetch the next row or column of the image into LineBufferRAM1 while the do_lifting_step accelerator performs lifting on the even- and odd-indexed data. The output coefficients, stored in LineBufferRAM2 memory, are then transferred back to the image buffer in DRAM by the copy_buffer_data_to_DRAM accelerator, while the split_even_odd accelerator processes the next input line data stored in LineBufferRAM1 memory. One important consequence of the optimization shown in Fig. 12 is that simultaneous operation of the accelerators responsible for the DWT lifting process (do_lifting_step,

split_even_odd and merge_even_odd) with DMA accelerators copy_DRAM_data_to_line_buffer and copy_buffer_ data_to_DRAM effectively hides the high latency normally involved with DMA transfers of image rows or columns to and from DRAM. This ensures that the do_lifting_step accelerator is almost constantly at work and not ‘starved’ for data at any stage of DWT lifting. In order to mitigate the effect of high latency incurred during non-sequential column-wise DRAM accesses, the following schemes for optimizing memory throughput were employed: 1. Coalescing DRAM write accesses – Since the DRAM interface can support 32-bit transfers, we can consolidate


Figure 11 DMA mechanism used by the copy_DRAM_data_to_line_buffer accelerator for 2-D memory accesses.

two 16-bit words into a single 32-bit DRAM write access. This can potentially improve the memory bandwidth utilization as well as power consumption efficiency in certain cases since the number of memory write transfers to DRAM during 2-D DWT processing is effectively halved when the input image consists of 16-bit pixel data. 2. Pipelining DRAM read transfers – We can mitigate the DRAM variable latency effects through Avalon-MM pipelined read transfers [23]. Pipelining memory read transfers can ‘hide’ the high latency incurred during column-wise DRAM read accesses in 2-D DWT processing. Note that pipelined DRAM write transfers are not supported by the Avalon system interconnect fabric,

Figure 12 Flowchart showing the simultaneous operation of multiple accelerators during row-wise/column-wise processing.

133

thus we cannot ‘hide’ the high latency associated with non-sequential column-wise DRAM writes during 2-D DWT processing. 3. Transposing images in DRAM – We can completely avoid non-sequential writes in DRAM by reading columns of the input image first for column-wise 1-D DWT processing, then writing the output coefficients row-wise in a temporary transpose buffer in DRAM, thus transposing the intermediate DWT coefficients of the image. 2-D DWT processing is then completed by reading the intermediate transposed DWT coefficients column-wise again for 1-D DWT processing and then writing the final DWT coefficients row-wise back to original image buffer, thus preserving the original image orientation for the DWTprocessed image. Figure 13 illustrates how such transpositions are carried out. Since column-wise DRAM read accesses are pipelined, reading columns of the input image from DRAM does not incur as much performance penalty compared to non-sequential column-wise DRAM writes on the image, which are not pipelined. Theoretically, since 2-D DWT can be described as a product of separable wavelet/scaling functions, 2-D DWT can be processed in row-column order (R-C ordering), i.e. as a 1-D DWT first performed on all image rows (row-wise 1-D DWT) followed by a 1-D DWT performed on all image columns (column-wise 1-D DWT). Likewise, we can first perform column-wise 1-D DWT followed by row-wise 1-D DWT to complete 2-D DWT processing of the image, or in columnrow order (C-R ordering). The transposition process shown in Fig. 13a implies that C-R ordering is used for optimal memory throughput. For lossy compression, C-R ordering can be



Figure 13 Transposition process according to the ordering sequence of separable 1-D transforms; a Column-Row (C-R) ordering, b Row-Column (R-C) ordering.

applied for both forward and inverse transform modes. However, for lossless compression, we need to perform the forward or inverse 2-D DWT in the reverse order of the corresponding 2-D DWT mode, for e.g. if C-R ordering is performed for the forward 2-D DWT, then R-C ordering has to be performed for the inverse 2-D DWT in lossless compression. This is because the non-linearity introduced from integer approximation of the DWT coefficients restricts the order in which separable transforms are performed to guarantee perfect reconstruction [25]. Figure 13b illustrates the transpositions carried out for R-C ordering scheme of 2-D DWT. The transpositions are necessary even for R-C ordering since the proposed system architecture implementation does not allow for non-sequential writes to DRAM, which would otherwise heavily degrade throughput performance. It can be seen in Fig. 13b that R-C ordering requires an additional transposition to ensure that the DWT processed image remains at the original input image orientation, thus row-column ordering is slower than C-R ordering for 2-D DWT processing. Therefore, unless lossless compression is desired, all DWT filter modes should use C-R ordering for faster 2-D DWT processing.

5 Results and Discussion 5.1 Experimental Setup The platform used for the implementation and testing of the proposed design was the Altera DE3 FPGA board, which is a high performance development and education FPGA board

based on the Altera Stratix III FPGA (EP3SL150F1152C3 model). To effectively utilize the Altera DE3 FPGA board, we used tools such as SOPC builder available within Altera Quartus® II 9.0 to construct and synthesize the proposed design. The Nios II SOPC (System-On-a-Programmable Chip) design was then programmed using the Nios II EDS (Embedded Design Suite) 9.0, which is an Altera-customized C/C++ IDE (Integrated Development Environment) based on the Eclipse platform. The five hardware accelerators augmented to the Avalon System Interconnect Fabric, shown previously in Fig. 7, were generated using the Altera C2H compiler [26] which is available within the Nios II IDE. Of particular importance for this work was the use of Altera’s C2H compiler [26], which aided us greatly in simplifying the implementation of the proposed 2-D DWT system architecture. C2H generates hardware accelerators based on straightforward C-to-hardware mapping, then automates the integration of the generated hardware accelerators to the Avalon system interconnect fabric of the Nios II system. For optimal throughput, C2H pipelines memory accesses and logic implemented for loops, based on memory access latency and the amount of code that can operate in parallel. Through an iterative design space exploration process utilizing various C2H optimization techniques [27], the 2-D DWT hardware processing algorithm was written for optimal C2H generation and performance. The tests were carried out using a system frequency of 133 MHz and a DDR2 clock frequency of 266 MHz in our SOPC configuration. In other words, the five C2H DWT hardware accelerators together with the Nios II CPU run at a

Author's personal copy J Sign Process Syst (2013) 71:123–142 Table 2 Altera Stratix III resource utilization comparison of a Nios II system with and without the C2H DWT hardware modules.

Resource Utilization (Altera EP3SL150F1152C3 FPGA)

Logic Utilization Combinational ALUTs Memory ALUTs Dedicated Logic Registers Total block memory bits DSP block 18-bit elements

specified system frequency of 133.333 MHz, while the high performance DDR2 memory controller IP [28] provided by Altera was configured to run at a specified SDRAM data clock frequency of 266.667 MHz and a specified capacity of 256 MB. An Altera Avalon performance counter [29] was included in the SOPC design to provide accurate performance figures of both software and hardware implementations of 2-D DWT running on the Nios II system. A test program written in C was executed from internal FPGA code memory to initialize the Nios II system for 2-D DWT hardware/software processing of test images stored in an SD card and perform functional verification of the design. The 2-D DWT software implementation was compiled using nios2-elf-gcc compiler with optimization level -02 (release version). 5.2 Implementation Results The results shown in Table 2, obtained from the Altera Quartus II synthesis reports, demonstrate the logic resources taken up by the five C2H accelerators through comparison of a basic Nios II system with and without the five C2H DWT accelerator modules. It can be seen in Table 2 that the five C2H accelerators take up a considerable amount of logic resources as compared to a basic Nios II system with the Nios II CPU only. This is because the C2H compiler generates additional glue logic for integration of the five C2H accelerators to the Avalon system interconnect fabric, as well as generating additional registers, combinational

Table 3 Logic resource utilization of the five C2H DWT hardware modules as shown by the Nios II IDE C2H compiler view.

C2H accelerator entity

do_lifting_step split_even_odd merge_even_odd copy_DRAM_data_to_line_buffer copy_buffer_data_to_DRAM

135

System Configuration Nios II CPU only

Nios II CPU+5 C2H DWT accelerators

9,050/113,600 (8 %) 5,733 320 5,450 1,141,472 4

14,188/113,600 (12 %) 8,522 320 9,889 1,239,776 8 (+2 18×18-bit multipliers)

logic and DSP blocks required to implement the five C2H accelerators. Table 3 gives a more abstract view of the resources required for the five C2H accelerators, as provided in the C2H compiler view of the Nios II IDE. 5.3 Frame Rate Performance Analysis The performance results shown in Tables 4 and 5 indicate that the proposed 2-D DWT system shown in Fig. 7, can provide a frame rate performance speedup of over 30 times, when compared to an optimized 2-D DWT reference software implementation running on the same Nios II platform, depending on the image resolution processed as well as the ordering scheme and type of filter used. Under image resolutions with increasing image width and pixel size, the memory throughput decreases due to the effect of DRAM bank switching when performing row-wise stride accesses of image elements during 2-D DWT column-wise processing, thus reducing the overall performance speedup obtained. Also, since both forward and inverse DWT lifting modes require the same number of stages for completion as shown in Figs. 8 and 9, the frame rate performance results would be similar for both forward and inverse 2-D DWT processing under both R-C and C-R ordering schemes, demonstrated in Tables 4 and 5. The graphs illustrated in Fig. 14 demonstrate the frame rate performance under increasing levels of decomposition for a 1280×720 image using 16- and 32-bit pixel word sizes. The graphical results shown in Fig. 14 apply to

18×18-bit multipliers required

16-bit Avalon masters required

32-bit Avalon masters required

2 0 0 0 0

6 4 4 2 4

0 0 0 0 1

Author's personal copy 136 Table 4 Software (SW) vs. Hardware (HW) performance of DWT lifting with five levels of decomposition for grayscale images using 16-bit pixel word sizes.


Filter mode

Forward LeGall 5/3 lifting

Inverse LeGall 5/3 lifting

Forward CDF 9/7 lifting

Inverse CDF 9/7 lifting

Resolution (16-bit)

Filter mode

Forward LeGall 5/3 lifting

Inverse LeGall 5/3 lifting

Forward CDF 9/7 lifting

Inverse CDF 9/7 lifting

HW frame rate (fps)

Speedup Factor

C-R

C-R

R-C

R-C

720×576 1280×720 1920×1080 720×576 1280×720 1920×1080 720×576 1280×720 1920×1080 720×576

7.55 3.42 1.50 7.56 3.42 1.51 6.08 2.74 1.20 6.09

272.48 111.36 39.95 264.55 109.29 39.54 183.82 91.66 37.24 186.57

190.11 82.17 31.21 190.48 82.24 31.21 141.24 68.97 27.85 143.68

36.09 32.56 26.63 34.99 31.96 26.19 30.23 33.45 31.03 30.64

25.18 24.03 20.81 25.20 24.05 20.67 23.23 25.17 23.21 23.59

1280×720 1920×1080

2.74 1.20

92.59 37.47

69.88 28.05

33.79 31.23

25.50 23.38

various image resolutions as well. It can be seen from the graphs of Fig. 14 that the frame rate performance begins to even out after three levels of decomposition. This is because LL (Low-Low) sub-band blocks obtained after three levels of decomposition are sufficiently small enough that the execution time for further 2-D DWT processing on these blocks becomes negligible compared to the time required to execute the first two levels of decomposition during 2-D DWT processing of the image. It should also be noted that for images using 16-bit pixel word sizes, CDF 9/7 lifting is limited to only five levels of decomposition due to possible overflow. From the graphs shown in Fig. 14, it is clear that the frame rate performance is considerably faster using the C-R ordering scheme compared to using the R-C ordering scheme.

Table 5 Software (SW) vs. Hardware (HW) performance of DWT lifting with 5 levels of decomposition for grayscale images using 32-bit pixel word sizes.

SW frame rate (fps)

5.4 Image Reconstruction Quality Results Figure 15 provides a Peak Signal-to-Noise Ratio (PSNR) analysis indicating the quality of image reconstruction after hardware-accelerated 2-D DWT processing of a 512×512 “lenna” image with increasing levels of decomposition. The image was reconstructed only through hardware-accelerated forward and inverse DWT modes using different combinations of R-C/C-R ordering schemes. This analysis is only valid for the CDF 9/7 lifting scheme with approximated integer scaling in the proposed system architecture, since the LeGall 5/3 lifting scheme is mainly used for lossless compression in JPEG2000. For lossless compression with the LeGall 5/3 lifting scheme, the reverse of the ordering scheme used initially for forward 2-D DWT should be

Resolution (32-bit)

SW frame rate (fps)

HW frame rate (fps)

Speedup Factor

C-R

C-R

R-C

R-C

720×576 1280×720

6.69 2.97

210.97 73.53

158.23 59.52

31.54 24.76

23.65 20.04

1920×1080 720×576 1280×720 1920×1080 720×576 1280×720 1920×1080 720×576 1280×720 1920×1080

1.29 6.66 2.96 1.29 5.51 2.39 1.03 5.49 2.38 1.03

25.10 206.61 72.62 24.94 171.82 66.18 24.04 174.52 66.71 24.14

21.36 158.48 59.56 21.36 134.59 53.56 19.76 136.80 54.08 19.86

19.46 31.02 24.53 19.33 31.18 27.69 23.34 31.79 28.03 23.44

16.56 23.80 20.12 16.56 24.43 22.41 19.18 24.92 22.72 19.28


137

Figure 14 Frame rate analysis for 1280×720 images with increasing levels of decomposition; a using 16-bit pixel word sizes, b using 32-bit pixel word sizes.

applied for the inverse 2-D DWT [25]. The plots shown in the graph of Fig. 15 demonstrate a consistently high quality of reconstruction with PSNR results of over 43 dB up to five levels of decomposition, no matter what combinations of ordering schemes are used for both forward and inverse modes.

model with Altera’s PowerPlay Analysis tool [30]. It can be seen in Table 6 that the dynamic power dissipation for the augmented C2H DWT modules is an additional 157.43 mW over the base Nios II system, which is about 29 % more dynamic power than what a typical base Nios II system would consume.

5.5 Power Analysis 5.6 Comparison with Other Works in Published Literature Table 6 demonstrates the estimated additional power dissipation for the five C2H DWT accelerators augmented to the base Nios II system. The estimated power dissipation results in Table 6 were obtained by using a vector-less estimation

Table 7 provides a relative comparison of the features and performance of 2-D DWT hardware implementations in published literature with the proposed 2-D DWT system



Figure 15 PSNR analysis of a reconstructed 512×512 “lenna” image with increasing levels of decomposition.

architecture. It should be noted that the equivalent ASIC gate count for the Xilinx FPGA implementations of the proposed 2-D DWT architectures [7, 21] is not available in Table 7, since these architectures do not have a single, unified implementation for the various filter modes of JPEG2000. Also, the performance figures for the proposed 2-D DWT architectures [11, 13] were obtained from their hardware implementations benchmarked in [7]. Looking at the features offered by the proposed 2-D DWT architectures [7, 14, 21] as listed in Table 7, it is clear that only the RISCbased wavelet processing architecture [14] and the proposed 2-D DWT system architecture outlined in this paper are completely programmable in terms of DWT functionality out of all the proposed architectures listed in Table 7. The hardware implementations of the proposed 2-D DWT architectures listed in Table 7 run at different operating frequencies, therefore a cycle-per-pixel metric was employed for a fair comparison of the throughput performance of the 2-D DWT hardware implementations [7, 14, 21] instead of frame rate performance. Based from the cycle-per-pixel metric results given in Table 7, the proposed 2-D DWT system architecture provides the lowest cycle count (highest throughput) to process each pixel, completing five levels of DWT decomposition for each pixel in less than two clock cycles. The performance achieved by the proposed 2-D DWT system architecture was made possible due to the data and memory throughput Table 6 Estimated power dissipation results for the proposed 2-D DWT system architecture.

optimizations detailed in Section 4.2. It can also be seen from the ASIC gate count given in Table 7 that the hardware implementation of the proposed 2-D DWT system architecture requires almost three times less logic resources than that of the RISC-based wavelet processing architecture [14], while providing higher throughput and similar DWT functionality compared to the RISC-based architecture. The approximate ASIC gate count equivalent of the Altera FPGA implementation of the five C2H DWT modules was calculated as specified for Altera Hardcopy® ASIC migration [31]. Since the five C2H DWT modules were designed for integration with the Nios II System-on-a-Chip (SOC) platform, we only considered the additional FPGA logic resource utilization of the five C2H DWT modules required to provide the target SOC platform complete, hardware-accelerated 2-D DWT functionality for JPEG2000. Given that the large majority of modern SOC designs integrate a general-purpose microprocessor, the calculated ASIC logic costs of the proposed architecture as given in Table 7 becomes valid for generalized CPU-centered SOC designs. Hence, the comparison given in Table 7 demonstrates the superior performance and resource efficiency of the proposed 2-D DWT system architecture while offering complete DWT functionality for JPEG2000 compression, in spite of the fact that the proposed system architecture was completely described and generated through high-level synthesis.

Power Dissipation (Altera EP3SL150F1152C3 FPGA)

System Configuration Nios II CPU only

Total Thermal Power Dissipation Core Dynamic Thermal Power Dissipation Core Static Thermal Power Dissipation I/O Thermal Power Dissipation

1976.70 mW 544.13 mW 612.42 mW 820.15 mW

Nios II CPU+5 C2H DWT accelerators 2138.50 mW (+161.8 mW) 701.56 mW (+157.43 mW) 616.79 mW (+4.37 mW) 820.15 mW

– M x M frame resolutions up to 1024×1024

LeGall 5/3 lifting implementation only

Up to 6 levels

Forward transform only

172.4 MHz 5.30d

ASIC gate count Frame size support

DWT filter support

Levels of DWT decomposition

Transform modes

Operating frequency Cycles-per-pixel (Lower is better)

117.6 MHz 5.61d


Up to 6 levels


– M x M frame resolutions up to 1024×1024

FPGA (Xilinx)

Block-Based architecture [13]

200 MHz 3.76e (using 4 wavelet PEs)

Both forward and inverse transforms

Programmable at any specified level

259,870b Programmable up to 1024×1024 resolutions Programmable (CDF 9/7 and 5/3 lifting)

ASIC

RISC-based architecturea [14]

138 MHz 2.15f

Both forward and inverse transforms

Up to 5 levels

CDF 9/7 or LeGall 5/3 lifting implementations

– Frame resolutions up to 1024×1024

FPGA (Xilinx)

Hierarchical Pipelining [21]

f. Cycles-per-pixel count for 1024×1024 grayscale images with 16-bit pixel word sizes using CDF 9/7 filter for 5 levels of decomposition

e. Cycles-per-pixel count for 640×480 grayscale images with 16-bit pixel word sizes using LeGall 5/3 filter for 3 levels of decomposition

d. Cycles-per-pixel count for 1024×1024 grayscale images with 16-bit pixel word sizes using LeGall 5/3 filter for 3 levels of decomposition

c. Approximate ASIC gate count equivalent of the five C2H DWT modules (5,138 ALUTs+2 18×18-bit multipliers) using Altera Hardcopy® III technology

b. ASIC gate count for four wavelet processing engines (PEs) and dedicated DDR memory controller

a. DRAM Memory controller parameters: 266 MHz DDR data clock frequency, 64-bit data bus, 8 memory banks

113.6 MHz 2.26d


Up to 6 levels


FPGA (Xilinx) – M x M frame resolutions up to 1024×1024

FPGA (Xilinx)

Technology

Line-Based architecture [11]

Row-Column architecture [7]

Features

Table 7 Relative feature and performance comparison of existing 2-D DWT hardware implementations with the proposed 2-D DWT system architecture.

1.55f (C-R)

87,070c Programmable up to 2044×2044 resolutions Programmable (CDF 9/7 and 5/3 lifting with scaling) Programmable at any specified level Both forward and inverse transforms 133.3 MHz 1.21d (C-R) 1.18e (C-R)

FPGA (Altera)

Proposed architecturea

Author's personal copy

J Sign Process Syst (2013) 71:123–142 139



6 Conclusion This paper proposed a novel unified and programmable 2-D DWT system architecture for JPEG2000. Previously proposed DWT architectures in published literature [4–21] did not consider the current trends in embedded systems design, where a high level of performance, functionality and integration is desired in a modern SOC design. Thus, the proposed 2-D DWT system architecture was designed to address the following key aspects of a modern SOC design, which are & & &

Architecture programmability Data throughput Resource efficiency

The novelty of the proposed system architecture lies on the concept of using multiple programmable hardware accelerators coordinated by a master processor for architecture programmability during 2-D DWT processing, as illustrated in Fig. 7. Each augmented accelerator present in the system architecture is designed to accelerate a key stage of 2-D DWT processing. Overall, the implementation of the proposed 2-D DWT system architecture required five programmable DWT hardware modules in conjunction with the master processor. The main hardware accelerator in the proposed system architecture is a programmable basic lifting unit, detailed in Section 3, which is repeatedly invoked by the master processor to perform the Predict, Update and Scale lifting steps for the JPEG2000 lifting filters. The implementation of the proposed 2-D DWT system architecture was realized using a high-level synthesis approach based on the Altera C2H design methodology. The throughput results obtained after benchmarking the final FPGA implementation of the proposed 2-D DWT system architecture demonstrate the real-time performance capability of the proposed system architecture to process images up to 1080p resolutions, while maintaining high image reconstruction quality. The comparative benchmarks shown in Table 7 also prove that the proposed system architecture provides superior throughput while providing full architecture programmability for 2-D DWT processing in JPEG2000. Hence, the proposed 2-D DWT system architecture can be used as a viable solution for practical real-time 2-D DWT processing in image and video processing applications.

References 1. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 674–693. 2. Daubechies, I., & Sweldens, W. (1998). Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications, 4(3), 247–269.

3. Reichel, J. (2001). On the arithmetic and bandwidth complexity of the lifting scheme (pp. 198–201). Greece: International Conference on Image Processing. 4. Acharya, T. (1997). A high speed reconfigurable integrated architecture for DWT. IEEE Global Telecommunications Conference, Phoenix, 2, 669–673. 5. Acharya, T., & Chakrabarti, C. (2006). A survey on lifting-based discrete wavelet transform architectures. The Journal of VLSI Signal Processing, 42(3), 321–339. 6. Andra, K., Chakrabarti, C., & Acharya, T. (2002). A VLSI architecture for lifting-based forward and inverse wavelet transform. IEEE Transactions on Signal Processing, 50(4), 966–977. 7. Angelopoulou, M. E., Masselos, K., Andreopoulos, Y., & Cheung, P. (2008). Implementation and comparison of the 5/3 lifting 2D discrete wavelet transform computation schedules on FPGAs. The Journal of Signal Processing Systems, 51(1), 3–21. 8. Chakrabarti, C., Vishwanath, M., & Owens, R. (1996). Architectures for wavelet transforms: a survey. Journal of VLSI Signal Processing, 14(2), 171–192. 9. Chakrabarti, C., & Vishwanath, M. (1995). Efficient realizations of the discrete and continuous wavelet transforms: from single chip implementations to mappings on SIMD array computer. IEEE Transactions on Signal Processing, 43(3), 759–771. 10. Chang, Y. N., & Li, Y. S. (2001). Design of highly efficient VLSI architectures for 2-D DWT and 2-D IDWT (pp. 133–140). Belgium: IEEE Workshop on Signal Processing Systems. 11. Chrysafis, C., & Ortega, A. (2000). Line-based, reduced memory, wavelet image compression. IEEE Transactions on Image Processing, 9(3), 378–389. 12. Dai, Q., Chen, X., & Lin, C. (2004). A novel VLSI architecture for multidimensional discrete wavelet transform. IEEE Transactions on Circuits and Systems for Video Technology, 14(8), 1105–1110. 13. Lafruit, G., Nachtergaele, L., Vanhoof, B., & Catthoor, F. (2000). The local wavelet transform: a memory-efficient, high-speed architecture optimized to a region-oriented zero-tree coder. Integrated Computer-Aided Engineering, 7(2), 89–103. 14. Lee, S. W., & Lim, S. C. (2006). VLSI design of a wavelet processing core. IEEE Transactions on Circuits and Systems for Video Technology, 16(11), 1350–1361. 15. Lian, C. J., Chen, K. F., Chen, H. H., & Chen, L. G. (2001). Lifting based discrete wavelet transform architecture for JPEG2000 (pp. 445–448). Sydney: IEEE International Symposium on Circuits and Systems. 16. Liu, C. C., Shiau, Y. H., and Jou J. M. (2000). Design and implementation of a progressive image coding chip based on the lifted wavelet transform. Proceedings of the 11th VLSI Design/ CAD Symposium, Taiwan. 17. Marino, F. (2001). Two fast architectures for the direct 2-D discrete wavelet transform. IEEE Transactions on Signal Processing, 49 (6), 1248–1259. 18. Vishwanath, M., Owens, R., & Irwin, M. J. (1995). VLSI architectures for the discrete wavelet transform. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 42 (5), 305–316. 19. Wu, P. C., & Chen, L. G. (2001). An efficient architecture for two-dimensional discrete wavelet transform. IEEE Transactions on Circuits and Systems for Video Technology, 11(4), 536–545. 20. Yu, C., & Chen, S. J. (1999). Design of an efficient VLSI architecture for 2-D discrete wavelet transforms. IEEE Transactions on Consumer Electronics, 45(1), 135–140. 21. Zhang, C., Long, Y., & Kurdahi, F. (2007). A hierarchical pipelining architecture and FPGA implementation for liftingbased 2-D DWT. Journal of Real-Time Image Processing, 2 (4), 281–291.

Author's personal copy J Sign Process Syst (2013) 71:123–142 22. Taubman, D. (2000). High performance scalable image compression with EBCOT. IEEE Transactions on Image Processing, 9(7), 1158–1170. 23. Altera Corporation. (2010). Avalon Interface Specifications [Online]. Available: http://www.altera.com/literature/lit-fs.jsp [2011, March 6]. 24. Altera Corporation. (2009). Nios II C2H Compiler User Guide [Online]. Available: http://www.altera.com/literature/lit-nio2.jsp [2011, March 6]. 25. Adams, M. D., & Kossentini, F. (2000). Reversible integer-tointeger wavelet transforms for image compression: performance evaluation and analysis. IEEE Transactions on Image Processing, 9(6), 1010–1024. 26. Lau, D., Pritchard, O., & Molson, P. (2006). Automated Generation of Hardware Accelerators with Direct Memory Access from ANSI/ISO C Functions, 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, California, 45–56 27. Altera Corporation. (2008). Optimizing Nios II C2H Compiler Results, Embedded Design Handbook [Online]. Available: http:// www.altera.com/literature/lit-nio2.jsp [2011, March 6]. 28. Altera Corporation. (2010). DDR and DDR2 SDRAM Controllers with ALTMEMPHY IP User Guide, External Memory Interface Handbook [Online], 3, Available: http://www.altera.com/literature/ lit-ip.jsp [2011, March 6]. 29. Altera Corporation. (2010). Performance Counter Core, Embedded Peripherals IP User Guide [Online]. Available: http://www.altera. com/literature/lit-ip.jsp [2011, March 6]. 30. Altera Corporation. (2010). PowerPlay Power Analysis, Quartus II Handbook v10.1.0 [Online], 3, Available: http://www.altera.com/ literature/lit-qts.jsp [2011, March 6]. 31. Altera Corporation. (2011). Hardcopy III Device Family Overview, Hardcopy III Device Handbook [Online], 1, Available: http://www.altera.com/literature/lit-hardcopy-iii.jsp [2011, March 6].

Ishmael Sameen received the BEng. (First Class Honors) majoring in Computer degree in June 2008 and the M.Eng.Sc. degree in December 2011, both from Multimedia University, Cyberjaya, Malaysia. Since June 2011, he has been with Intel Malaysia working on automated graphics driver validation frameworks utilizing sophisticated image and video processing algorithms. His research interests and works include digital signal processing, image and video processing as well as high-performance parallel architectures such as FPGAs and GPU-based architectures. He has also been

141 involved in consultancy courses for nVidia CUDA as well as being a part-time contracted programmer for implementing optimized DSP algorithms on custom-designed computer architectures. Currently, his research interest is in high-performance GPU computing on the cloud.

Chang Yoong Choon obtained his B.Eng. (First Class Honors) in Electrical & Electronic Engineering from University of Northumbria at Newcastle (Northumbria University), UK, M.Eng.Sc. and Ph.D. (Engineering) degrees from Multimedia University, Malaysia respectively. He currently holds two posts at Multimedia University, which are the Deputy Dean of Institute for Postgraduate Studies and senior lecturer in the Faculty of Engineering. He has been teaching multimedia & telecommunication engineering subjects at Multimedia University since 2003. Besides academic teaching, he is an active researcher where he is currently the project leader of digital home project funded by Malaysian Communications and Multimedia Commission. He is currently supervising five postgraduate research students in the area of multimedia communications and four postgraduate students have graduated under his supervision. His main research interests are video transmission over broadband networks, video compression and digital home.

Ng Mow Song received his B.Eng (Hons), Electrical Engineering from University of Malaya, Malayisa, and M.Eng.Sc from Multimedia University, Malaysia. He was a lecturer at the Universiti Tunku Abdul Rahman before leaving the university in February

Author's personal copy 142 2012 to work with his mentor on Network-on-Chip designs. His current research interests are VLSI design and signal processing.

Goi Bok-Min received his B.Eng degree from University of Malaya (UM) in 1998, and the M.Eng.Sc and Ph.D degrees from Multimedia University (MMU), Malaysia in 2002 and 2006, respectively. He is now the Deputy Dean (R&D and Postgraduate Programmes) and an associate professor in the Faculty of Engineering and Science, Universiti Tunku Abdul Rahman (UTAR), Malaysia. Dr. Goi is also the chairperson of Centre for Healthcare Science and Technology (CHST) of

J Sign Process Syst (2013) 71:123–142 UTAR. He was the General Chair for ProvSec 2010 and CANS 2010, and the PC members for many crypto / security conferences. His research interests include cryptology, security protocols, information security, digital watermarking, computer networking and embedded systems design.

Ooi Chee-Pun is a senior lecturer in the Faculty of Engineering, Multimedia University. He received his MSc. degree from Queens’ University of Belfast, UK prior to his Ph.D. degree from the University of Malaya, Malaysia in 2010. His current research interest is in FPGA-based controller for visual based analytical functions.

A Unified FPGA-Based System Architecture for 2-D ...

A Unified FPGA-Based System Architecture for 2-D ...

Suggest Documents

An FPGAbased integrated environment for computer architecture

A UNIFIED ARCHITECTURE FOR THE IMPLEMENTATION OF

OPC Unified Architecture: A Service-Oriented Architecture for Smart

A Unified Framework for 2D & 3D Shape Retrieval - College of ...

A Microservice System for Unified Communication ...

A Hybrid System Framework for Unified Impedance

A Unified Smoking-Automated Calorimetric System for

Compute-unified device architecture

D21.2 - Unified architecture for programmable secure ... - Practice

Ericsson’s Unified Data Center Architecture for Service ...

D21.2 - Unified architecture for programmable secure ... - Practice

Communicating Software Architecture using a Unified ... - CiteSeerX

toward a unified enterprise architecture framework - International ...

The Architecture of Platforms: A Unified View

A Unified Multitask Architecture for Predicting Local ... - PLOSwww.researchgate.net › publication › fulltext › A-Unified

A Novel Architecture for Autostereoscopic 2D/3D ... - OSA Publishing

Towards a Unified Architecture for Resilience, Survivability ... - S-Cube

A Unified Service Discovery Architecture for Wireless ...

A Unified Multitask Architecture for Predicting Local Protein ... - PLOS

A Transaction-Based Unified Architecture for ... - Computer Science

A Unified Architecture and Key Techniques for Interworking between ...

Orchestra: Developing a Unified Open Architecture for ... - CiteSeerX

A Unified Architecture for Cognition and Motor Control ... - Google Sites

A Unified Architecture for Utility Service in Cloud