An efficient adaptive binary range coder and its VLSI ...

1

An efficient adaptive binary range coder and its VLSI architecture Evgeny Belyaev, IEEE Member, Kai Liu, Moncef Gabbouj, IEEE Fellow and YunSong Li

Abstract—In this paper we propose a new hardware-efficient adaptive binary range coder (ABRC) and its VLSI architecture. To achieve this we follow an approach which allows to reduce the bit capacity of the multiplication needed in the interval division part and show how to avoid the need to use a loop in the renormalization part of ABRC. The probability estimation in ABRC is based on a look-up table free virtual sliding window. To obtain a higher compression performance we propose a new adaptive window size selection algorithm. In comparison with ABRC with a single window the system provides a faster probability adaptation at the initial encoding/decoding stage, and more precise probability estimation for a very low entropy binary sources. We show that the VLSI architecture of the proposed ABRC attains a throughput of 105.92 MSymbols/sec on the FPGA platform, and consume 18.15 mW for the dynamic part power. In comparison with the state-of-the-art MQ-coder from JPEG2000 standard and the M-coder from H.264/AVC and H.265/HEVC standards the proposed ABRC architecture provides comparable throughput, less memory and power consumption. Experimental results obtained for wavelet video codec with JPEG2000-like bitplane entropy coder shows that the proposed ABRC allows to reduce the bit rate from 0.8–8% in comparison with the MQcoder and from 1.0–24.2% in comparison with the M-coder.

(ABRC) can be efficiently used for video coding [10], [11] as well as for universal data compression [12]. For software implementations it was shown that ABRC has up to 40% less computational complexity than the M-coder [13]. However, from a hardware implementation point of view, the main drawback of ABRC is the use of a multiplication in the interval division part. On the other hand, in modern architectures the multiplication cost can be comparable with the cost of a look-up table usage. Herewith, the byte renormalization is much less complex than bit renormalization. Therefore, it is very important to answer the question, which adaptive binary coding is more attractive for future high-performance hardware image and video compression systems: state-of-theart ABAC’s, such as the MQ-coder and the M-coder, or ABRC? Up to the authors knowledge this work is the first one related to this issue.

Index Terms—Arithmetic coding, range coder, image/video compression, VLSI architecture.

1) We propose a modification of adaptive binary range coding which is suitable for hardware implementations. We show how to reduce the bit capacity of multiplication needed in the interval division part and how to avoid a loop usage in the renormalization part of ABRC. 2) The proposed ABRC uses virtual sliding window (VSW) [14], [15] for probability estimation, which does not require look-up tables. To obtain a higher compression performance we propose a new adaptive window size selection algorithm. In comparison with the ABRC with a single window, the proposed algorithms provides a faster probability adaptation at the initial encoding/decoding stage, and more precise probability estimation for a very low entropy binary sources. 3) We introduce a VLSI architecture of the proposed ABRC and show that it provides comparable throughput in comparison with the state-of-the-art MQ-coder (from JPEG2000) and M-coder (from H.264/AVC and H.264/HEVC) and provides a higher throughput in comparison with ABRC [12]. At the same time it does not use additional memory such as look-up tables and has less power consumption than the other coders. 4) Experimental results obtained for wavelet video codec with JPEG2000-like bit-plane entropy coder [11] shows that the proposed ABRC allows to reduce the bit rate from 0.8–8% in comparison with the MQ-coder and from 1.0–24.2% in comparison with the M-coder depending on video sequence and compression ratio.

I. I NTRODUCTION daptive binary arithmetic coding (ABAC) plays an important role in various state-of-the-art image and video compression standards, such as JPEG [1], JPEG2000 [2], H.264/AVC [3] and H.265/HEVC [4]. Taking into account that its computational complexity is a key bottleneck in image or video compression systems, great importance has been devoted to the development of an algorithmic design that allows efficient hardware and software implementations. In earlier architectures the multiplication cost has been much higher than look-up table usage. Therefore, the first fast ABAC implementation, called Q-coder [5], and its followers (QMcoder in JPEG, MQ-coder in JPEG2000 and M-coder in H.264/AVC and H.264/HEVC) use look-up tables for probability estimation and approximation of multiplication in the interval division part as well as bit renormalization, which makes this approximation possible. As an alternative to arithmetic coders, range coders use bytes as output bit stream element and perform byte renormalization at a time [6], [7], [8], [9]. Adaptive binary range coder

A

Evgeny Belyaev and Moncef Gabbouj are with the Department of Signal Processing, Tampere University of Technology, Finland, e-mail: [email protected]. Kai Liu are YunSong Li with the computer school of Xidian University, Xi’An, China, email: kailiu@ mail.xidian.edu.cn.

In this paper we present a hardware-efficient adaptive binary range coder and its comparison with current state-of-the-art ABAC. The main contribution of this paper are the following:

2

The rest of the paper is organized as follows. Section II reviews the integer implementation of arithmetic coding, a probability estimation based on virtual sliding window and byte renormalization in range coders. Sections III and IV introduce the proposed hardware-efficient ABRC as well as its VLSI architecture respectively. The experimental results showing the advantages of the proposed ABRC is given in Section V. Conclusions are drawn in Section VI. II. BASICS OF ADAPTIVE BINARY ARITHMETIC AND RANGE CODING

A. Integer implementation of binary arithmetic coding Let us consider a stationary discrete memoryless binary source with p denoting the probability of ones. In a binary arithmetic encoding, a codeword for a binary sequence xN = {x1 , x2 , ..., xN }, xt ∈ {0, 1} is represented as d− log2 P (xN ) + 1e bits of a number Q(xN ) + P (xN )/2, N

(1)

N

where P (x ) and Q(x ) are the probability and the cumulative probability of a sequence xN , respectively, which can be calculated by the recurrent relations: If xt = 0, then Q(xt ) ← Q(xt−1 ) (2) P (xt ) ← P (xt−1 )(1 − p), if xt = 1, then Q(xt ) ← Q(xt−1 ) + P (xt−1 )(1 − p) P (xt ) ← P (xt−1 )p.

(3)

In this paper we use “←“ as the assignment operation, ““ and ““ as left and right arithmetic shift, ⊕ as bitwise XOR operation, and “!“ as bitwise NOT operation. An integer implementation of an arithmetic encoder is based on two registers: L and R size of b bits (see Algorithm 1). Register L corresponds to Q(xN ) and register R corresponds to P (xN ). The precision required to represent registers L and R grows with the increase of N . In order to decrease the coding latency and avoid registers underflow, a renormalization procedure is used for each output symbol (see Algorithm 2). Algorithm 1 : Binary symbol xt encoding procedure 1: 2: 3: 4: 5: 6: 7: 8:

T ← R×p T ← max(1, T ) R←R−T if xt =1 then L ← L+R R←T end if call Renormalization procedure

B. Probability estimation using Virtual Sliding Window In real applications the probability of ones is unknown. In this case for an input binary symbol xt the probability estimation of ones pˆt is calculated and used in line 1 of

Algorithm 2 : Bit renormalization procedure 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

while R < 2b−2 do if L ≥ 2b−1 then bits plus f ollow(1) L ← L − 2b−1 else if L < 2b−2 then bits plus f ollow(0) else bits to f ollow ← bits to f ollow + 1 L ← L − 2b−2 end if L ← L 1 R ← R 1 end while

Algorithm 1 instead of p. One of the well known probability estimation algorithms is based on a sliding window concept. The probability of a source symbol is estimated by analyzing the content of a special buffer. This buffer keeps W previously encoded symbols, where W is the length of the buffer. After the encoding of each symbol the buffers content is shifted by one position, a new symbol is written to the free cell and the earliest symbol in the buffer is erased. According to the Krichevsky-Trofimov formula, the probability of pˆt+1 for binary symbol with index t + 1 can be estimated as st + 0.5 pˆt+1 = , (4) W +1 where st is the number of ones in the sliding window before coding the symbol with index t + 1. The sliding window estimation can achieve a more accurate evaluation of the source statistics and a fast adaptation to the source statistics by increasing the window length W . However, the sliding window must be kept in the encoder and the decoder side, which can incur memory problem with an increasing size of W . In order to avoid memory usage, in the following two-step approach for recalculating the number of ones after encoding each symbol xt was proposed in [14]. Step 1. Delete an average number of ones from the window st+1 ← st −

st . W

(5)

Step 2. Add a new symbol from the source st+1 ← st+1 + xt .

(6)

By combining (5) and (6), the rule for recalculating the number of ones can be given as follows: 1 st+1 = 1 − · st + xt . (7) W Based on (7), a probability estimation using a Virtual Sliding Window was proposed in [14]. Let us multiply both sides of (7) by W : 1 0 st+1 = 1 − · s0t + W xt , (8) W

3

where s0t = W st . Let us define W = 2w , where w is an integer positive value. After integer rounding of equation (8), we obtain  2w 0 w−1   s0t + 2 − st + 2 , if xt = 1    2w (9) s0t+1 = 0  w−1  s + 2  t 0  , if xt = 0,  st − 2w and the probability of ones is calculated as s0t . (10) 22w Thus, the adaptive binary arithmetic coding algorithm based on virtual sliding window is presented in Algorithm 3. Here, s is a state value of the virtual sliding window length of w and MPS stands for the Most Probable Symbol. pˆt =

Algorithm 3 : Binary symbol xt encoding procedure 1: 2: 3: 4: 5: 6: 7: 8: 9:

T ← (R × s) (2w) T ← max(1, T ) R←R−T if xt =MPS then L ← L+R R←T end if call Probability update procedure call Renormalization procedure

Algorithm 4 : Probability update procedure 1: 2: 3: 4: 5: 6: 7: 8: 9:

if xt =MPS then s ← s + ((22w − s + 2w−1 ) w) if s > 22w−1 then MPS ← 1− MPS s ← 22w − s end if else s ← s − ((s + 2w−1 ) w) end if

C. Byte renormalization As an alternative to arithmetic coders, range coders use bytes as output bit stream elements and perform byte renormalization at a time [6], [7], [8], [9]. Byte renormalization from [7] is presented in Algorithm 5. Here, L and R are 32bit registers, TOP=224 and BOTTOM= 216 . III. P ROPOSED ABRC MODIFICATION

Algorithm 5 : Byte renormalization 1: 2: 3: 4: 5: 6: 7: 8:

while (L ⊕ (L + R)) 6 then T2 can be equal to zero. s < 2−8 . This can happen if the probability estimation pˆ = 22w Therefore, for all states s which correspond to the probability interval [0, 2−8 ) the multiplication result in line 18 of Algorithm 6 will be zero. To overcome this problem we propose to uniformly quantize the probability interval [0, 2−8 ) into several subintervals, and use a single probability estimation value for each subinterval without using T2 . From compression performance point of view, we need to use as large as possible number of subintervals, while from a hardware architecture point of view, it is better to minimize this number. During our experiments we found that a good balance is achieved when this interval is quantized into eight subintervals (see Figure 1).

A. Reduction of a bit capacity of the multiplication As it was mentioned in the Introduction, the main drawback of the range coder is a necessity to use multiplication in the interval division part. For more detailed illustration Table I shows bit capacity of R × s multiplication in Algorithm 3 depending on different w values. One can see that windows

(11)

Fig. 1: Probability estimation in interval [0, 2−8 )

4

TABLE I: Parameters for the proposed reduction of a bit capacity of the multiplication w Initial multiplication R × s bit capacity in Algorithm 3 Proposed R register shift, bits Proposed s register shift 2w − 8, bits min {T2 } = smin (2w − 8) Probability subinterval index, ∆(w, s) Resulting multiplication T1 × T2 bit capacity in Algorithm 6

3 32 × 6 8 -2 12 8 24 × 8

4 32 × 8 8 0 7 8 24 × 8

5 32 × 10 8 2 3 8 24 × 8

6 32 × 12 8 4 1 8 24 × 8

7 32 × 14 8 6 0 s × 2−3 24 × 8

8 32 × 16 8 8 0 s × 2−5 24 × 8

9 32 × 18 8 10 0 s × 2−7 24 × 8

The subinterval index ∆(w, s) calculation for a given window 2w and state s is shown in Table I. Then we use the following probability estimation values depending on the subinterval index (see black circles in Figure 1): for ∆(w, s) = 0 the probability estimation is set to pˆ = 2−13 , for ∆(w, s) = 1 the probability is set to pˆ = 2−11 , for ∆(w, s) = 2 the probability is set to pˆ = 2−10 , for ∆(w, s) = 3 the probability is set to pˆ = 2−10 + 2−11 and for ∆(w, s) = 4, ..., 7 the probability is set to pˆ = 2−9 + 2−10 . In all of these cases the multiplication R×ˆ p can be implemented by arithmetical shifts (see lines 2–13 in Algorithm 6). As a result, all probability estimation values in the interval [0, 2−8 ) are approximated as it is described above, while the remaining values in the interval [2−8 , 0.5] are calculated using the multiplication (see lines 16–18 in Algorithm 6).

B. Simplification of byte renormalization

Algorithm 6 : Proposed binary symbol xt encoding procedure

Algorithm 7 : Proposed byte renormalization

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29:

Calculate ∆(w, s) as in Table I. if ∆(w, s) < 8 then if ∆(w, s) = 0 then T ← R 13 else if ∆(w, s) = 1 then T ← R 11 else if ∆(w, s) = 2 then T ← R 10 else if ∆(w, s) = 3 then T ← R 10 + R 11 else T ← R 9 + R 10 end if T ← max(1, T ) else T1 ← R 8 T2 ← s (2w − 8) T ← T1 × T2 end if if xt =MPS then H ←L+R L ← H−T R←T else R←R−T H ←L+R end if call Probability update procedure call Renormalization procedure

10 32 × 20 8 12 0 s × 2−9 24 × 8

As it was mentioned above, if after renormalization R ≥ 28 , then the resulting value of T after multiplication in line 18 of Algorithm 6 is always positive. Therefore, we can output no more than one byte and apply one arithmetic shift of R by 8 per input symbol and keep it positive as well. On the other hand, the resulting value of T after lines 3–13 of Algorithm 6 can be equal to zero. To prevent this we set the probability estimation value to pˆ = R1 (see line 14 in Algorithm 6). As a result, the value of R will be positive in both cases. Therefore, we can replace while loop by if in Algorithm 4 which is enough to keep R positive during the encoding/decoding process. Taking this into account we propose the following simplification of the byte normalization (see Algorithm 7):

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

if (L ⊕ H) 24 = 0 then PUTBYTE (L 24) R←R8 L←L8 else if R 16 = 0 then R ← (!L + 1) ∧ (216 − 1) PUTBYTE (L 24) R←R8 L←L8 end if

Let us notice that in comparison with lines 3–7 of Algorithm 3, we use additional register H in lines 20-27 of Algorithm 6. In this case the complexity of L and R updating does not change in case of parallel computation, but it helps to simplify line 1 in Algorithm 7 where we need to use only the highest bytes of registers L and H. C. Adaptive window size selection The virtual sliding window allows to easily control the trade-off between the speed of probability adaptation and the precision of probability estimation: small window lengths provide a fast adaptation to a new source statistics, but have low probability estimation precision, while longer windows provide better estimation precision, but have a slower adaptation speed. As a result, the best coding performance is achieved for a window which has a good balance of these two features. As an example, Table II shows compression performance expressed in bit per pixel (bpp) for a video codec

5

based on the three-dimensional discrete wavelet transform (3D DWT) [11] with JPEG2000-like bit-plane entropy encoder in case when for all binary contexts the same window length w is used. One can see that window w = 6 provides the best compression performance in comparison with other windows. TABLE II: Compression performance for ”Ballroom”, bpp Y-PSNR, dB 29.34 35.18 40.48 45.42 lossless

w=3

w=5

w=6

w=8

w = 10

0.0638 0.2511 0.8205 2.5770 6.3023

0.0414 0.1615 0.6141 2.2608 5.8270

0.0385 0.1486 0.5847 2.2195 5.7661

0.0421 0.1579 0.6075 2.2618 5.8126

0.0520 0.1892 0.6980 2.4697 6.1198

Window W = 23 is used at the initial stage of coding. Taking into account that 16 LPS (Least Probable Symbol) symbols is enough for window W = 23 to switch from the probability estimation value pˆ = 0.5 to the minimum possible probability estimation value, we use window W = 23 for the first 16 symbols. Then the algorithm switches to window W = 25 (see lines 19–21 of Algorithm 8), which is used as the main window length. Windows W = 28 and W = 210 are used 1 ). Lines 11–12 for probability estimation in the interval [0, 64 of Algorithm 8 perform the transition from window W = 25 to W = 28 , while lines 13–14 perform the transition from window W = 28 to W = 210 . Lines 6–7 of Algorithm 8 perform the switching back to window W = 25 . Algorithm 8 : Proposed probability update procedure

To achieve an even better compression performance we introduce the following adaptive window size selection algorithm. First, similarly to our previous work [15] we propose to use a high speed of probability adaptation at the initial stage of the coding using a small window lengths, and then, we propose to switch to a longer window length to provide a lower adaptation speed, but a better precision of the probability estimation. This approach is also adopted in the MQ-coder utilizing three adaptation mechanisms, which are embedded into one state machine implemented by look-up tables. Let use define as 2w1 the initial window length, and s1 as the corresponding state value. We encode the first n1 symbols by window 2w1 , and starting from symbol with index n1 + 1 we use window 2w2 > 2w1 for the next n2 symbols and so on, until the main window length 2wm is reached. Let us notice that following (10), each time, after encoding nj symbols, we need to scale the state of window 2wi to the state of window 2wj as sj = si ·22(wj −wi ) to keep the same probability estimation value. The main window length 2wm is used for the vast majority of encoded symbols. From (10) and (12) it follows, that using wm −1 this window the probabilities in the interval [ 2 22wm−1 , 0.5] can be estimated. However, in some cases the statistics of the binary source is such that the required probability estimation wm −1 value belongs to the interval [0, 2 22wm−1 ) which cannot be reached by window 2wm . To overcome this problem we propose to use even longer window lengths forj this intervalk of wm −1 = probabilities. If the current symbol is LPS and sm +2 2wm 0 (minimum possible probability estimation value for window w 2wm is reached), we switchjto window 2wm+1 k > 2 m . If the sm+1 +2wm+1 −1 current symbol is LPS and = 0, we switch 2wm+1 to window 2wm+2 > 2wm+1 and so on, until the maximum window length is reached. Finally, if the current probability wm −1 estimation is more than 2 22wm−1 then we switch back to the main window 2wm . On one hand, the overall amount of window lengths used in the algorithm should be selected to provide the best compression performance. On the other hand, the amount is restricted by the hardware area and power consumption. During our experiments we found that a good balance between compression performance and the area is achieved when four window lengths are used: W = {23 , 25 , 28 , 210 } (see Algorithm 8).

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

if xt =MPS then s ← s + ((22w − s + 2w−1 ) w) if s > 22w−1 then MPS ← 1− MPS s ← 22w − s else if w > 5 and s > (15 2(w − 5)) then s ← s 2(w − 5), w ← 5 end if else ∆s ← ((s + 2w−1 ) w) if ∆s = 0 and w = 5 then w ← 8, s ← s 6 else if ∆s = 0 and w = 8 then w ← 10, s ← s 4 else s ← s − ∆s end if end if if n = 16 then w ← 5, s ← s 4 end if n←n+1

To illustrate the advantages of the proposed adaptive window selection algorithm, let us assume that a stationary binary memoryless source with probability of ones p is to be compressed. A coding redundancy r(p) caused by the probability estimation precision can be calculated as [15]: R − h(p), (13) r(p) = lim N →∞ N where N is the number of input binary symbols, R is the size of the compressed bit stream and h(p) = −p · log2 p − (1 − p) · log2 (1 − p) is the entropy of the binary source. Table III show the coding redundancy (13) for the proposed and state-of-the-art coders when the value N = 108 . One can see that for a high probability values (p ∈ [0.1 − 0.5]) the M-coder has a much better performance than the MQ-coder, because the approximation of the multiplication in the interval division part in the M-coder is based on the quantization of the range interval into four subintervals, while the MQ-coder uses

6

only one. At the same time the proposed ABRC outperforms both the M-coder and the MQ-coder because it uses a full multiplication instead of its approximation. On the other hand, the MQ-coder and the proposed ABRC with adaptive window size selection provide a much better performance for a low probability values p < 0.01, because their minimum possible probability estimation value are significantly less than the Mcoder and in the ABRC with a single window W = 25 . As a result, the proposed adaptive window length selection algorithm provides a good performance for a high as well as for a low probability values. TABLE III: Coding redundancy r(p) comparison p

MQ-coder

M-coder

Proposed coder, w = 5

0 10−5 10−4 10−3 10−2 0.02 0.03 0.04 0.08 0.1 0.3 0.5

0.000033 0.000015 0.000093 0.000493 0.003314 0.006146 0.008672 0.010904 0.019194 0.023482 0.030931 0.045450

0.029 0.0287 0.0281 0.0239 0.0102 0.008 0.009 0.012 0.02 0.021 0.022 0.02

0.017 0.0167 0.0161 0.0128 0.00511 0.00657 0.00929 0.0114 0.0133 0.0131 0.0124 0.0127

Proposed coder, adaptive 0.000177 0.0000977 0.0000759 0.000712 0.00187 0.00663 0.0151 0.0233 0.0178 0.0142 0.0124 0.0127

IV. VLSI ARCHITECTURE FOR THE PROPOSED ABRC A. General scheme of the ABRC architecture Figure 2 illustrates the general scheme of the proposed ABRC hardware architecture which consists of four processing stages. After reset is stable, the ABRC is ready to receive an input binary symbol, which is driven by the clock signal. At the first stage, the ABRC calculates new probability estimation values by using four probability updating cores operating in parallel for windows W = 23 , 25 , 28 and 210 . Each core calculates the values of window size 2w , state s, MPS, T2 and ∆(w, s) for the next binary symbol according to Algorithm 8. At the same time, the T calculation circuit generates six different T values according to lines 2–19 of Algorithm 6. At the second stage one of the six values of T is selected depending on ∆(w, s) value and a new window size 2w with corresponding variables are selected for the next binary symbol. At the third stage, the L, R registers as well as the temporal register H are updated according to lines 20–27 of Algorithm 6. At the fourth stage, the renormalization part is triggered by the output of the upper eight bits of L and H vectors according to Algorithm 7. The pair (cv, cs) of the output bit stream interface is employed: cv stands for a bit signal to indicate a code valid signal while cs stands for a code stream with a byte width. Finally, a control module is configured for the whole processing of the encoding. Let us describe each stage of the proposed architecture with more details. Figure 3a shows the probability update circuit structure for window 25 . Here, state w5, MPS w5, w5 are updated variables corresponding to s, MPS and w variables in Algorithm 8, for the initial w = 5. Variables ∆ w5 and

T2 w5 correspond to variables ∆(w, s) and T2 in line 1 and 17 of Algorithm 6. Probability updating cores for windows 23 , 28 and 210 are implemented in a similar way. Figure 3c illustrates the T calculation structure, which consists of six branches in parallel for the candidate values: T w0, T w1, T w2, T w3, T w4 and T w8. Calculation of T w8 needs a multiplier of 24×8 (line 18 of Algorithm 6) while the remaining values use five adders and multiplexers (lines 2–15 of Algorithm 6). From the synthesis report, the multiplier lays on the critical path of the whole design. Figure 3b demonstrates the internal logic for the ∆(w, s), w, s, T2 and MPS selection depending on the current w values. Five registers are used for the current ∆(w, s), s, T2 and MPS respectively. Figure 3d illustrates the logic for the update of R, L and H registers. The structure needs three adders, multiplexers and registers. The Q ports of the registers are used for the renormalization part to emit the output bit stream. In case of renormalization, we utilize a brute force skill to unroll the internal logic for the byte output. Though providing a shifting network after each calculation for internal variables used in ABAC, the renormalization can be performed ”on the fly”. Figure 3e illustrates the internal structure used in the brute force renormalization for the internal variable L and R. Figure 3f gives the logic for the byte out stream. After the internal calculation by combinational logic gates, the cs and cv are registered in the D-type registers. The actual byte stream is emitted through cs port if the condition cv holds. V. E XPERIMENTAL RESULTS A. Coding performance Practical results were obtained for the test video sequences [23], [24] ”Akyio” (352×288), ”Mother Daughter” (352×288), ”Vassar” (640×480), ”Ballroom” (640×480), ”Pedestrian Area” (1920×1080) and ”Rush Hour” (1920×1080). Bit rate values for 3-D DWT codec [11] using the MQ-coder, the M-coder and the proposed coder for different Y-PSNR values are presented in Tables IV–IX. For fair comparison, in all cases a fixed number of bit-planes were truncated, therefore, the reconstructed video is identical for all coders. One can see that the M-coder is worse than the MQ-coder at low bit rates and better at high bit rates. These results can be confirmed by coding redundancies presented in Table III. A low performance of the M-coder can be explained due to its high coding redundancy for low entropy binary sources which are typical for wavelet-based image/video context modeling. For a high entropy sources the M-coder has less redundancy than the MQ-coder. Therefore, its performance is better for lossless video compression. At the same time the proposed coder is better than both the MQ-coder and the M-coder, because it combine a good performance for low as well as high entropy sources. The proposed ABRC outperforms the MQ-coder by 0.8–8% and the M-coder by 1.0–24.2% in coding efficiency depending on video sequence and the compression ratio. B. Throughput and memory consumption Table X shows the implementation results for the proposed ABRC design on the Xilinx Virtex platform, as it is reported

7

Stage2

R[31:0]

T2[7:0]

Probability update for w=3

_w3 w3 state_w3 T2_w3 MPS_w3







state[19:0]

w, state ,MPS and T2 selection

L[31:24]

MPS

R and L update

T[31:0]

T0 T1 T2 T3 T4 T8

T calculation

Stage4

Stage3

8

Renormalization

Stage1

H[31:24]

cs cv

Control clock

reset

binary_symbol

Fig. 2: General scheme of the proposed ABRC hardware architecture TABLE X: Synthesis result for the proposed ABRC Design

FPGA type

Slice Flip Flops

Slice LUTs

Clock frequency, MHz

Critical path, ns

Throughput, MSymbols/Sec

Adaptive window, w = 3, 5, 8, 10 Fixed window, w = 5

XC4VLX80

131

1314

105.92

9.441

105.92

XC4VLX80

104

470

109.69

9.117

109.69

TABLE XI: Performance comparison for difference coders Design

AC type

Dyer [20] MQ-coder Dyer [20] MQ-coder Kai [21] MQ-coder Osorio [22] M-coder Shcherbakov [12] ABRC Proposed ABRC Proposed ABRC

Clock frequency, MHz

Throughput, MSymbols/Sec

Symbols per clock

Memory, bits

Number of Slices/LEs

Power, mW/MHz

57.05 48.85 48.30 91.74 100.34 105.92 54.25

57.05 97.70 96.60 183.48 51.60 105.92 54.25

1.00 2.00 2.00 1.9-2.3 0.51 1.00 1.00

2675 8192 4269 >5734 15×36×1024 0 0

761 1596 6974 490 1544 750 1296

N/A N/A 14.38 N/A N/A 6.05 3.46

TABLE IV: Coding performance for ”Akiyo” Y-PSNR, dB 32.38 38.60 44.27 48.12 lossless

Proposed, bpp 0.0251 0.0863 0.2858 0.9500 2.7884

MQ-coder, bpp

M-coder, bpp

0.0273 0.0926 0.2996 0.9796 2.8600

0.0267 0.1002 0.3279 1.0179 2.8733

(+8.0%) (+6.8%) (+4.6%) (+3.0%) (+2.5%)

(+6.1%) (+13.9%) (+12.8%) (+6.7%) (+3.0%)

by the ISE 14.3 synthesis tool. The critical path of the design is the R register updating, which includes the multiplier and adders. Therefore, due to parallel implementation of the

Technology

Altera Stratix Altera Stratix Xilinx XC4VLX80 FPGA Xilinx Virtex II 2000 FPGA Xilinx ML507 Xilinx XC4VLX80 FPGA Altera Stratix

TABLE V: Coding performance for ”Mother daughter” Y-PSNR, dB 31.93 37.15 42.11 45.90 lossless

Proposed, bpp 0.0227 0.1000 0.3918 1.5648 4.9324

MQ-coder, bpp

M-coder, bpp

0.0241 0.1048 0.4050 1.6057 5.0666

0.0232 0.1046 0.4089 1.5930 4.9831

(+5.8%) (+4.5%) (+3.2%) (+2.5%) (+2.6%)

(+2.1%) (+4.3%) (+4.2%) (+1.8%) (+1.0%)

probability updates (see Stage 1 in Figure 2) the amount of windows used in the architecture does not affect significantly the throughput: the design with four windows 23 , 25 , 28 and

8

T_w0 T_w1

1040

>>

+

-

5

>>

state

=

MPS >

+

6

>>

8

>>

D

_w10

5

0

state_w10 SET

CLR

=

Q

state

Q

Q

T2_w3

Q

T2_w5 D

T2_w10

w3

-

SET

CLR

T2_w8

SET

CLR

Q

T2

Q

w5

+

MPS_w5 2

>>

_w5 _w8

D

state_w8

state_w5

+

An efficient adaptive binary range coder and its VLSI ...

An efficient adaptive binary range coder and its VLSI ...

Suggest Documents

An efficient adaptive binary range coder and its VLSI

The Z-Coder Adaptive Binary Coder

Design and VLSI Implementation of an Adaptive

A highly efficient multiplication-free binary arithmetic coder ... - CiteSeerX

An Adaptive Parallel Genetic Algorithm for VLSI

AN EFFICIENT VLSI ARCHITECTURE FOR MC ... - CiteSeerX

An Efficient VLSI Architecture for CORDIC Algorithm

vlsi architecture of an area efficient image

An Efficient VLSI Architecture for Fingerprint

Measurements of Binary Stars with the Starfire Optical Range Adaptive ...

an efficient pipeline architecture of jpeg2000 mq-coder - FIT@MTA

An Adaptive Modulation and Efficient Power ...

an adaptive and efficient algorithm for detecting

BINARY PARTITION TREE AS AN EFFICIENT ... - CiteSeerX

An Analog VLSI Multilayer Perceptron and its Application Towards ...

efficient adaptive learning for classification tasks with binary units

An Efficient Block-based Dynamic Range

Firing proprieties of an adaptive analog VLSI neuron - PUB

Efficient binary tomographic reconstruction

Analog VLSI Circuit Implementation of an Adaptive Neuromorphic ...

MDCT-based Coder for Highly Adaptive Speech and ... - CiteSeerX

An Efficient VLSI Architecture for New Three-Step Search ... - VTVT

An Efficient VLSI Architecture for CORDIC Algorithm - Semantic Scholar

An Efficient VLSI Architecture for Lifting-Based Flipping Discrete ...