7 wavelet lifting scheme on DSP∗

3 downloads 0 Views 348KB Size Report
Aug 7, 2012 - Nowadays wavelet transform has been one of the most effective transform means in the realm of image processing, especially the biorthogonal ...
Optimization technology of 9/7 wavelet lifting scheme on DSP∗ Zhengzhang Chena, Xiaoyuan Yang, Rui Yang Digital Media Laboratory, School of Computer Science and Engineering, Beihang University, Beijing 100083, China ABSTRACT Nowadays wavelet transform has been one of the most effective transform means in the realm of image processing, especially the biorthogonal 9/7 wavelet filters proposed by Daubechies, which have good performance in image compression. This paper deeply studied the implementation and optimization technologies of 9/7 wavelet lifting scheme based on the DSP platform, including carrying out the fixed-point wavelet lifting steps instead of time-consuming floating-point operation, adopting pipelining technique to improve the iteration procedure, reducing the times of multiplication calculation by simplifying the normalization operation of two-dimension wavelet transform, and improving the storage format and sequence of wavelet coefficients to reduce the memory consumption. Experiment results have shown that these implementation and optimization technologies can improve the wavelet lifting algorithm’s efficiency more than 30 times, which establish a technique foundation for successfully developing real-time remote sensing image compression system in future. Keywords: wavelet transform, lifting, remote sensing image compression, DSP, algorithm optimization, software pipeline

1. INTRODUCTION As a rapidly developing branch of applied mathematics began in the late 1980s, wavelet transform is certainly a milestone in the history of traditional Fourier analysis. Meanwhile, it has become a powerful tool in the realm of digital image compression. The main advantage of wavelet transform over discrete cosine transform (DCT) is that it has both time and frequency localization ability, which result in better performance in image compression. Thus, researchers have paid much attention to wavelet construction and proposed some well-known wavelet bases. Especially, the CDF (Cohen, Daubechies and Feauveau) 9/7 biorthogonal wavelet, with outstanding transform properties, have been wildly used in many areas including the new generation of static image compression standard JPEG2000. The wavelet lifting scheme [1][2], proposed by WIM SWELDENS, is a simple construction of second generation wavelets. By transforming the signal in space domain, it successfully completes the task of frequency-domain signal analysis. Compared to the traditional Mallat algorithm based on convolution computation, the lifting algorithm has advantages of simpler operation procedure, lower storage demand, as well as consistent positive transformation and reverse transformation etc, thus it leads to a faster, in-place calculation of wavelet transform. However, according to the lifting structure, it still requires substantial time-consuming floating-point operations to calculate wavelet coefficients, which increases the difficulties in hardware realization of wavelet transform and affects the efficiency of whole compression algorithm. In addition, with the development of modern remote sensing techniques and improvement of optical sensor resolution, the general PC platform can’t gratify the demand of real-time application, low-energy consumption, and mass processing data etc. So it is urgent to develop an effective wavelet compression algorithm. Under the circumstance of limited hardware resources, the key of the solution to the problem lies in improving the algorithm’s efficiency on the hardware platform. According to the special necessity of remote sensing image compression, in this paper we transplanted the 9/7 biorthogonal lifting wavelet algorithm to the DSP hardware platform, and deeply studied concrete implementation and optimization technologies of the algorithm, integrating the characteristics of software and hardware. ∗

Supported by the NSFC (60573150), the Program for New Century Excellent Talents in University and 863 Program (HI-Tech Research and Development Program of China) . The research was made in the State Key Laboratory of Virtual Reality Technologies. a Email:[email protected] MIPPR 2007: Medical Imaging, Parallel Processing of Images, and Optimization Techniques, edited by Jianguo Liu, Kunio Doi, Patrick S. P. Wang, Qiang Li, Proc. of SPIE Vol. 6789, 67892B, (2007) · 1605-7422/07/$18 · doi: 10.1117/12.749914 Proc. of SPIE Vol. 6789 67892B-1 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

The paper is organized as follows. We start out by describing the operating hardware platform in section 2 and the principle of wavelet lifting algorithm in section 3. Section 4,5,6,7 deeply studies the algorithm optimizing measures, effective pipelining technologies and suited storage ways of wavelet transform on DSP. The comparative experiment results between the optimized algorithm and the original one are given in Section 8. Finally, section 9 draws the conclusion and points out the future research.

2. HARDWARE PLATFORM Considering the mass processing data and real-time requirements of the remote sensing image compression system, it is necessary to adopt a proposition based on the highly scalable multiprocessor [3]. In recent years, the technology of DSP has developed rapidly with greater processing speed, stronger functions and better reliability, which all provide advantages for the real-time image compression. Our research group develops a special hardware platform based on the 4-DSP modular unit (given by Fig. 1) for image compression. The modular unit in remote sensing image compression model machine is mainly responsible for the parallel and pipeline process of the algorithm, including the image data transform and coding part. In the compression process, input image data are distributed by PCI bus interface to four single DSP. And each DSP plays the role of compression. Besides DSP #1 is also responsible for the reading and distribution of the image date in dual-port memory #1, and DSP #4 is in charge of writing the compressed result to the dual-port memory #2 and notifying the network transmission module to convey the data via Ethernet as well.

I

_______ DSFWI

Fig. 1. Constitution and conjunction relation of the 4-DSP modular unit

The modular unit adopts four ADSP-TS201 of AD Company as the core hardware processor, which has 600 MHz frequency and 24 Mb memories. Each of them has four link ports with a transmitter and a receiver. Both of them have four pairs of differential data lines. Every link port has their own corresponding DMA channels, through which data transmitted. The input and output rate of link port can reach 500 MB/s. And the communication between the DSP chips is controlled by the program-controlling circuit switch mode based on the principle of DMA, which is low overhead and high efficiency.

3. THE PRINCIPLE OF THE WAVELET LIFTING ALGORITHM The lifting wavelet [1][2] is a simple and practical wavelet decomposition technology. It adopts three basic steps (split, predict and update) to achieve wavelet decomposition and reconstruction (as shown in Fig. 2). In split step, the signal is divided into two data sets according to the parity; then in predict step, the even sequential numbers are used to forecast

Proc. of SPIE Vol. 6789 67892B-2 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

and rectify the odd ones, thus high-frequency signal generates; finally, low-frequency signal can be gained by using the high-frequency signal to renew the even sequential numbers in update step. evil'

e,n

04

Fig. 2. Decomposition and reconstruction procedure of lifting scheme According to the basic principle of lifting scheme, the decomposition procedure of CDF9/7 lifting wavelet can be expressed by following matrix:

⎡ X l ( z ) ⎤ ⎡ He ( z −1) Ho ( z −1)⎤ ⎡ X e ( z ) ⎤ ⎢ X ( z )⎥ = ⎢ ⎥ −1 −1 ⎥ ⎢ ⎣ h ⎦ ⎣⎢ Ge ( z ) Go ( z ) ⎦⎥ ⎣ X o ( z )⎦

(1)

With the help of Z transform and Euclidean factorization method, it can be decomposed further into:

⎡ Xl ( z) ⎤ ⎡K 0 ⎤ ⎡1 δ (1 + z−1)⎤ ⎡ 1 0⎤ ⎡1 β (1 + z −1)⎤ ⎡ 1 0⎤ ⎡ Xe ( z)⎤ = ⎢ ⎥ ⎢ X ( z )⎥ 0 1 ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥⎢ 1 1 ⎦ ⎣γ ( z + 1) 1⎦ ⎣0 ⎦ ⎣α ( z + 1) 1⎦ ⎣ Xo ( z)⎦ ⎣ h ⎦ ⎣ K ⎦ ⎣0

(2)

In the two formulas above, X o ( z ) 、 X e ( z ) denote the input odd and even signals, and X l ( z ) 、 X h ( z ) express the lowpass filter and high-pass filter output respectively. Based on the decomposition formulas, the quadratic-iteration lifting process of the CDF9/7 can be deduced: ⎧ (0) (0) ⎪ (3.1) sl = x0 [ n ], dl = x1[ n ] ⎪ (1) ( 0 ) ⎛ ( 0 ) ( 0 ) ⎞ ⎪ ⎪ (3.2) dl = dl +α ⎜ sl + sl +1 ⎟ ⎝ ⎠ ⎪ ⎪ 1) ( 0 ) ⎛ (1) (1) ⎞ ( +β ⎜ d +d ⎪ (3.3) s = s l l l −1 ⎟⎠ ⎪ ⎝ l ⎪ ⎪ ( 2 ) (1) ⎛ (1) (1) ⎞ ⎨ (3.4) dl = dl + γ ⎜ sl + sl +1 ⎟ ⎝ ⎠ ⎪ ⎪ ⎪ (3.5) s ( 2 ) = s (1) +δ ⎛⎜ d ( 2 ) + d ( 2 ) ⎞⎟ l l l −1 ⎠ ⎪ ⎝ l ⎪ (2) K ⎪ ⎪ (3.6) sl = sl ⎪ ⎪ (3.7) d = Kd ( 2 ) l l ⎪⎩

(3)

(i )

(i )

In this formula group, sl 、 d l are high-frequency and low-frequency coefficients; sl 、d l

i = 1, 2 express

intermediate results of the lifting process. The values of the lifting coefficients α、β、γ、δ and proportion coefficient K are obtained as follows:

⎧α =-1.586134342 ⎪β =-0.052980118 ⎪ ⎨γ =0.882911075 ⎪δ =0.443506852 ⎪ ⎩K=1.149604398

Proc. of SPIE Vol. 6789 67892B-3 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

(4)

Boundary extension is an important issue in the process of wavelet transform. Generally speaking, it has the following four ways: (1) replenishing zeros to the boundary; (2) repeating boundary value; (3) expanding the boundary periodically; (4) extending the boundary bilaterally (including odd or even symmetry two situations). To carry out CDF9/7 wavelet transform algorithm, we must adopt the fourth method. This is because its filter length is odd, which requires oddsymmetry mirror extension. Moreover, it guarantees the perfect signal reconstruction.

4. IMPLEMENTATION OF FIXED-POINT WAVELET LIFTING ALGORITHM [4]

The BPP algorithm we adopted is a new Bit Plane Predicting image compression algorithm based on the CDF9/7 Mallat wavelet transform. In order to improve the wavelet transform processing speed, we transplanted the algorithm to the DSP platform (discussed above), and implemented the lifting scheme instead of the Mallat algorithm. For hardware processor, the logic of processing floating-point operation is more complex than fixed-point one, because it requires more hardware resource and energy consumption. Thus, the fixed-point operation takes the priority in achieving the prospective efficiency of the complex time-consuming programs especially the lifting wavelet algorithm. As shown in formula (4), the CDF9/7 lifting and proportion coefficients are all irrational numbers. Accordingly, to achieve the fixed-point wavelet lifting algorithm, it is necessary to convert the floating-point format of coefficients. Based on the fixed-point processing principle, we first express all coefficients in binary system format; then, use the shift-plus operation of binary to realize the floating-point multiplication of wavelet lifting algorithm. The specific process is described as follows:

(1) Firstly, the wavelet transform coefficients α、β、γ、δ and the proportion coefficients K、1/K were expressed in binary: α≈ -1.586134342 Æ |α| = 1.1001011000001100 β≈ -0.052980118 Æ |β| = 0.0000110110010000 γ≈ 0.882911075 Æ |γ| = 0.1110001000000110 δ≈ 0.443506852 Æ |δ| = 0.0111000110001001 K≈ 1.149604398 Æ |K| = 1.0010011001001100 1/K≈0.869864452 Æ |1/K| = 0.1101111010101111 (2) Then, implemented the multiplication between the coefficients. For instance, using the shift-plus operation, the iterative formula (3.2) can be converted into:

(0) (0) t1=−(sl +sl+1) t2 =t1+(1 t >>1) +(1 t >> 4) +(1 t >> 6) +(1 t >> 7) +(1 t >>13)

(5)

(1) (0) dl = dl +t2 In the formula above, t1 and t2 are both temporary variables. Other iterative calculations (from (3.3) to (3.7)) in formula group (3) could be achieved by the same way. Consequently, we successfully implemented the fixed-point lifting algorithm. To ensure accuracy, this paper retained five valid binary digits. This is because if reserved too many digits, the computational complexity will increase; on the other hand, with too few digits remained, the precision of result will be greatly reduced. Experiment statistical result has shown that, reserving at least five valid digits, the Peak Signal to Noise Ratio (PSNR) of the reconstructed image based on the fixed-point operation is nearly equivalent to the one based on the floating-point operation (although PSNR has a little lost).

Proc. of SPIE Vol. 6789 67892B-4 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

5. PIPELINE IMPROVEMENT OF WAVELET LIFTING ALGORITHM Through analyzing the process of the wavelet lifting algorithm, we notice the adverse pipeline characteristic of the quadratic iteration procedure in the algorithm. As the Fig. 3 shows, the step 1 to step 4 represent the quadratic iteration procedure, and X(n) sequence means the image pixel points. In the original Lifting algorithm, completing a quadraticiteration process requires nine input pixels X (0) ~ X (8). A pair of wavelet coefficients Y (3) and Y(4) can only be gained after a pyramid-counting process. This is a multi-input single-output calculation process, and each input data group overlaps with the next group. Therefore, the original algorithm not only can’t be processed by pipeline smoothly, but also have a lot of redundancy in the calculation process. Y (4 )

Y (3 )

1 /k

s te p 6

s c a lin g

k

s te p 5

STEP4

d

d

STEP3 c

c

c

c

L IF T IN G

STEP2 b

STEP1

a

X (0 )

X (1)

a

b

b

a

X (2 )

b

a

X (3 )

b

a

X (4)

b

a

X (5 )

a

X (6)

a

X (7 )

X (8)

Fig. 3. Pipelining procedure of the lifting algorithm

In order to improve the pipeline ability of wavelet lifting algorithm, in this paper, we append data preprocessing buffer and intermediate results buffer in the implementation process. Data preprocessing refers to pre-calculating the iteration result of several forefront input pixels before the real iterative calculation, and storing the pre-calculated results in the buffer as the intermediate ones to input at the beginning of the pipeline. Intermediate results buffer is needed, because the intermediate results have to be the input data in the next iterative operation. As shown in Fig. 3, the intermediate buffer is used to store the calculation results of the dashed circle part. Without the appended buffers, the pipeline would intermit as a result of lacking intermediate data. By virtue of the data preprocessing and intermediate results buffers, we now only have to input X(7) and X(8) in order to get the Y(3) and Y(4). Other needed data all have been stored in the buffers. Thus, the iterative procedure has become a one-on-one process, with one pair of input data result in same number of output ones. Although the following pixels input continually, the pipeline of four calculation modules (four bigger real-line circle shown in Fig. 3) will not break anymore. The constantly updated intermediate results buffer ensures each unit can obtain the input data they need. As a result of the pipeline adjustments above, the efficiency of the lifting process has increased effectively. One important problem remains to solve in improving pipeline is how to deal with the boundary extension. For CDF9/7 wavelet, the length of filter coefficient determines necessity of expanding 3-4 signal sampling points at the boundary, which would add at least eight more memory units on the basis of the whole line buffer. What is more, the buffer and signal extension would bring in counteractive effect on the four-level pipeline. Thus, logical arrangement of the extension time is crucial. The solution adopted in this paper is to deal with the signal extension before the iteration operation, instead of expanding the signal in the process of the iteration operation. Although this method will add limited processing clock cycles, it can remarkably avoid the negative impacts of the boundary extension on pipelining, and enhance the performance the algorithm efficiency.

Proc. of SPIE Vol. 6789 67892B-5 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

6. SIMPLIFYING THE NORMALIZATION PROCESS [5]

Normalization in lifting algorithm is a process step after the iterative calculations in order to satisfy the condition of signal normalization. The third step of CDF9/7 wavelet lifting process (including formula (3.6) and (3.7)) is actually the normalization process. As can be seen in the normalization calculation of the original algorithm, it is necessary to multiply the proportion coefficient to every pixel on each vertical and horizontal line. Although the process seems simple, it costs plenty of running time by the multiplication operation.

LL

HL

LH

HH

Fig. 4. The distribution of frequency bands

In this paper we simplified the complex normalization process. In the original algorithm, the procedure of twodimensional wavelet transform is: processing the lifting steps and normalization step in horizontal direction firstly; then realizing the steps in vertical direction. After improvement, the new method combines the normalization steps of the horizontal direction and vertical one. We can prove that in the LH and HL frequency bands, the normalization process in horizontal direction can be offset by the vertical direction calculation; And in the LL and HH bands, the horizontal and vertical direction proportion coefficients can be combined as single one. Therefore, the process order can change as follows: First, deal with the horizontal direction lifting steps; followed by the same steps in the vertical direction; at last, process the normalization with horizontal and vertical directions together according to the formula (6). It will be able to avoid many multiplication operations and reduce the algorithm complexity finally. ⎧ 2 1 ⎪ X LL ( z ) = K ⋅ X e ( z ) ⎪ 1 ⎪ X LH ( z ) = X e ( z ) ⎨ 1 ⎪ X HL ( z ) = X o ( z ) ⎪ 1 ⎪ X HH ( z )= 1 2 ⋅ X o ( z ) ⎩ K

(6)

7. IMPROVEMENT OF THE WAVELET COEFFICIENT STORAGE WAY While processing large-scale images especially the remote sensing images, the memory system becomes the bottleneck of the algorithm’s performance. From the storage level, studying how to optimize the efficiency of memory access and reduce the cost of memory access is very important. For the sake of decreasing the system memory expense, this paper improves the storage format and sequence of wavelet coefficient after wavelet transform. s]gn

e}jsolute Im

I2IhI0 Fig. 5. Storage format of wavelet coefficient

Proc. of SPIE Vol. 6789 67892B-6 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

In the original BPP [4] algorithm, the absolute value and sign of the wavelet coefficient saved in two image-size arrays respectively, and it is required to access the two arrays when carrying out bit plane coding. This kind of storage format not only takes up a great deal of memory space, but also brings in a large number of memory access operations. New improvement is to merge the absolute value and sign storages, with the top digit for the sign (1 means negative, otherwise places 0) and the lower 31 digit for storing absolute value of the coefficient (as Fig. 5 shows). New storage format reduced storage cost and memory access times. It uses only four bytes to store the eight bytes data in the original algorithm. Meanwhile, it enables accessing the memory only once to obtain both symbolic and absolute values of the wavelet coefficient, and facilitates the bit "and" operation and bit "or" operation, which remarkably raises the coding operation speed. In addition, this paper adopts the frequency and direction storage sequence instead of the line storage sequence. According to the new storage sequence, the frequency band degree has the priority, in other words, storing the coefficients from the lowest frequency band to the highest one for different frequency band degrees. While with the same frequency band degree, follow the direction order: diagonal, plane, vertical, otherwise, in the same frequency band, accord to the scanning beam order. Moreover, creating objects for every frequency band, which would contain selfmaintenance information such as head address, top bit plane number, frequency size, father-frequency address, and brother-frequency address etc. Consequently, the new storage sequence matches well with the access sequence of the coefficient in coding, and can guide the coding effectively by the storage object information.

8. EXPERIMENT RESULTS After the implementation and optimization of the wavelet lifting algorithm on DSP platform, comparative experiments are executed on a number of typical remote sensing images to validate the effectiveness of the optimization technologies discussed above. Table 1 gives the optimization results with 256×256 image size at 0.4 bpp. Table 1. The optimization results of the wavelet lifting algorithm (unit: million clock cycle time) Image

Original algorithm

Optimized algorithm

Increased multiple of speed

City

119.69

3.44

34.79

Factory

119.49

3.39

35.23

Beijing

119.59

3.41

35.12

HongKong

119.52

3.43

34.86

Barbara

119.87

3.53

34.01

From Table 1, we can see that the original algorithm to complete a five-level wavelet transform needed more than 1 billion clock cycles, while the optimized algorithm required only 3.5 million clock cycles. The efficiency of the wavelet lifting algorithm has increased over 30 times. In addition, the optimization technologies will not affect the quality of the reconstructed image.

9. CONCLUSION Based on the principle of wavelet lifting algorithm, in this paper we propose a series of optimization technologies of 9/7 lifting scheme on DSP platform. To integrate the characteristics of the hardware and software better, we adopt fixedpoint operation instead of time-consuming floating-point one, and improved the pipeline efficiency by supplementing pre-processing buffer as well as intermediate buffer. In addition, we merge the normalization operations in horizontal and vertical directions, which effectively reduced the number of multiplication. Moreover, the storage format and sequence of the wavelet coefficient are modified to enhance memory access efficiency. All these technologies are applied to the compression algorithm BPP, which achieves the goal of real-time processing in high-resolution remote sensing image. Experimental results have shown that it cost less than 7 ms to complete a five-level wavelet transform in

Proc. of SPIE Vol. 6789 67892B-7 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

256×256 size image. Meanwhile, the optimization technologies are irrelevant to the coding scheme. Thus they can be incorporated into any other wavelet image coders to improve compression efficiency. Our work is only an elementary probe into optimizing wavelet lifting scheme for image compression on DSP. There is still work to make further improvement. For example, 9/7 biorthogonal wavelet is only one of the outstanding wavelet bases in the realm of image processing. May be other wavelet bases with different lengths is the next research point. Another work is to study the deep-seated optimization technologies including constructing suitable wavelet bases for DSP and improving the algorithm at instruction-level.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

I.Daubechies, Wim Sweldens. Factoring wavelet transforms into lifting steps. J.Fourier Anal, 1998, 4(3):245-267. W Sweldens. The Lifting Scheme: a Construction of Second Generation Wavelets [J]. SIAM J Math Anal, 1997, 29(2): 511-546. Joseph A. Sgro. A Highly Scalable Multiprocessor Based on the ADI SHARC DSP[C]. Proceedings of International Conference on Signal Processing Application & Technology, 1 997,1396-1400 Libo, Wanghai, Bit Plane Predicting Image Compression Algorithm Based on Wavelet Packet Transform [J], Chinese J. Computers Vol22 (7): 686-691, July 1999(in Chinese) Z. Xiong, “Representation and Coding of Images Using Wavelets,” Ph.D. dissertation, University of Illinois at Urbana-Champaign, 1996. ADSP Hardware Reference for TigerSHARC TS-201S DSP Processors[Z].USA: Analog Devices corp.inc,2004 ADSP Programming Reference for TigerSHARC TS-201S DSP Processors[Z].USA: Analog Devices corp.inc,2004

Proc. of SPIE Vol. 6789 67892B-8 Downloaded from SPIE Digital Library on 07 Aug 2012 to 129.105.215.146. Terms of Use: http://spiedl.org/terms

Suggest Documents