example, using HC for coding a binary alphabet is inef- ... must be coded is in the cache (hit), at line m l mod q , ... (alphabet of N symbols, cache of n lines).
An ASIC Implementation of Adaptive Arithmetic Coding Giuseppe Acunto, Miquel Sans, Andreas Burg, Wolfgang Fichtner Integrated Systems Laboratory, ETH-Zurich
Abstract In this work, we present an improved version of an ASIC implementation of the adaptive arithmetic coding algorithm which uses a two-level memory hierarchy. We propose algorithmic modifications and a special hardware structure to speed-up the design without degrading the compression ratio obtained using this memory hierarchy. Moreover, several new features which increase the compression efficiency are introduced. Finally, a VLSI implementation based on the results of our work is presented.
1 Introduction Arithmetic coding (AC) [2][3] is a source coding method which provides a major advantage: The source model is independent of the coding process. Hence, it allows the encoding of a symbol with a fractional number of bits, which is very useful for binary alphabets or for information sources that have a distribution with a sharp peak. Using an alphabet of symbols, AC assigns a unique codeword to a sequence of L symbols, as if this sequence was a symbol of an alphabet with symbols. However, it does not need to allocate codewords for the other possible sequences, because the model is restricted to the original symbols. Moreover, AC can easily be made adaptive, since the model can be updated without modifying the coding process. In contrast, in Huffman Coding (HC) [1] the source model is not independent of the coding process. HC cannot encode a symbol with a fractional number of bits. For example, using HC for coding a binary alphabet is inefficient [2][3]. To achieve an efficiency comparable to the one obtained with AC, i.e. to be able to encode a symbol with a fractional number of bits, HC must be used with an extended alphabet of size to model the source. This becomes quickly infeasible, because has to be long to increase the compression ratio [3]. This problem becomes worse for adaptive HC, because at each adaptation step, the codewords must be recomputed. The encoding procedure of AC can be expressed with the following recursive equations. For a more detailed analysis, refer to [3][4][5].
! " $#&%'"(
" ) ! " +* -,/.0 1
*2 ! "+#3% ' 1)4 ! " 7
(1) (2) (3)
5 !*6 "
(4)
is the coding interval length at cycle . is where the state of the codeword. is the rescaling factor. is the next unscaled coding interval length and the next unscaled state of the codeword. is the occurrence probability of the encoded symbol and is its cumulative probability, with
)
.8 9
. :9; >= < - B ? ?@0A
(5)
The decoding process is almost the inverse of the encoding one. The equations are the same, except for Eq. 3, 4 that become
) ! " +* 0. 1
*C
"DE# %' 1)1
" "F 5HG JIK
(6) (7)
Equations 1, 2, 6, 7 are used to update recursively the codeword during decoding. Equation 8 is used to decode the codeword to find the encoded symbol.
. B9L9CMN*C POQ SR . :9L " (8) where A3is < W[ decoded. belongs to the alphabet Y!X LZsymbol T VUW XY Y!the . Storing and updating the model are the main difficulties for adaptive AC, particularly when the size of the alphabet is large. A new strategy for managing the model, based on a cache-RAM system and a virtual table, is proposed in [5]. The cache and the virtual table represent the model. The RAM stores each that is not used currently. All the coding operations related to the model are restricted to the cache, eliminating the expensive task of managing the entire set of data. This architecture greatly improves timing and area, while maintaining a compression ratio that is close to the one obtained without cache.
9]\^
2 Architectural basis
8 9L\
G
G . .8
8
. B 5 J5 K 5 5
:
. 99;
?@8A
CACHE (HIT)
Pi(n-1) Pi(1) Pi(0)
Figure 1: Memory hierarchy with cache and virtual table (alphabet of N symbols, cache of n lines).
3 Optimization 3.1 Two-level memory hierarchy
The chain of operations for encoding is the following (Fig. requires the reading of the 2a): the incoming symbol source model (READ) to find the corresponding and to compute . With these results, the encoding iteration (ENCODE) is performed according to Eq. 1, 2, 3, 4. In parallel, the adaptation of the model (UPDATE) can begin. Then, the updated model is written back (WRITE).
. B
a)
8 :9
ENCODE symbol
UPDATE
b)
encoded data
READ
encoded data
READ WRITE
DECODE COMPUTE & COMPARE
WRITE
decoded symbol
(10)
cache content plus the virtual table, Eq. 11) is equal to 1, no incrementation is performed. Since the RAM is not bound to the model anymore, a cache/RAM exchange can cause to become larger than 1. As this is not allowed in AC, a rescale procedure is needed. memory of our friend L.O. Maturin. occurrence probability is in fact an occurrence counter.
index(k1)=1 (miss)
Figure 2: Theoretical chain of operations: a) encoding, b) decoding.
2
1 In
Si(kl)=lPmiss
index(k0)=0 (miss)
(9)
;, > < (11) ?@8A
For every hit, the corresponding in the cache is incremented . Note that in the case of a hit, when the model saturates, i.e. when the coding interval ." G (the
2 An
index(kl)=l (miss)
UPDATE
.0 G ;
.0 G
index(kN-1)= N-1(miss)
DECODER
Si(kM)=Si(m)=NPmiss+Pline0
ENCODER
The scheme proposed in [5] consists in storing each in a RAM of lines and introducing a direct mapped cache of lines and a virtual table of lines. All the coding and adaptation operations are restricted to the cache, which provides parallel accesses. When the symbol that must be coded is in the cache (hit), at line , the content of the cache provides the required and is also used to compute , according to Eq. 9, 10 and Fig. 1. When the symbol is not in in the cache (miss), the symbol is not coded with its the RAM, but with the constant escape probability of the virtual table, see Eq. 9 and Fig. 1. Consequently, the coding process must not wait for the output of the RAM, which is only available at the next cycle. The RAM access is only required for the replacement procedure (adaptation) and the coding is done in only one cycle. The virtual table does not need to be stored, since it contains times the constant .
index(kl)=(l mod n)=m=1 (hit line m)
VIRTUAL TABLE (MISS)
Pmiss
In the following, we propose some algorithmic and design modifications of the implementation proposed in [5], which can further improve both compression ratio and cycle time. Moreover, we construct a special addition tree, that we call Maturin1 tree, to speed up the computation of all cumulative probabilities needed by the decoding procedure (Eq. 8), without increasing the silicon area for this operation.
:
The chain of operations for decoding is more complex (Fig. 2b): the model is read to provide all (READ). Using these results, all are computed (COMPUTE) as in Eq. 5. In a second step, they are compared with the encoded bits (COMPARE), according to Eq. 8. The decoded symbol , its and are output. Then, the decoding iteration (DECODE) is performed according to Eq. 1, 2, 6, 7. In parallel, the model is adapted (UPDATE) and written back (WRITE). Fig. 3 shows the chain of operations for the architecture proposed in [5]. The reading of the model is split into two parts: the first one is the RAM address generation and setup time (READ ACCESS), the second is the time to provide a valid RAM output (READ OUTPUT). With the use of the cache and the virtual table, the coding is completed in one cycle. Indeed, the probabilities in the cache are available in the current cycle. A probability that is not
.
9
.0
. :
5
5
.
(UPDATE RESCALE). Therefore, the computation of all (COMPUTE) at cycle depends on the rescale procedure (UPDATE RESCALE) of cycle , which depends itself on the output of the RAM (READ OUTPUT) of cycle .
5
5 #
encoded data
5
decoded symbol
COMPUTE
COMPARE
UPDATE INCREMENT WRITE (setup)
DECODE
DECODER
in the cache but only in the RAM is replaced by for the coding, since the content of the RAM is not available in the current cycle. Normally, the encoding iteration in the encoder (ENCODE) and the computation of all (COMPUTE) in the decoder must wait for the rescale process (UPDATE RESCALE), which is the first part of the adaptation of the model3 . [5] proposes the following solution: if at the current cycle , the output of the RAM (READ OUTPUT) requires the cache content to be scaled, the incoming symbol at cycle is processed with in all cases, even if it is a hit.
READ ACCESS
UPDATE RESCALE READ OUTPUT WRITE (hold)
encoded data
ENCODE WRITE
ENCODER
a)
UPDATE INCREMENT READ OUTPUT UPDATE RESCALE
symbol
b)
READ ACCESS
WRITE
COMPUTE & COMPARE
READ OUTPUT
DECODER
UPDATE RESCALE
DECODE UPDATE INCREMENT
encoded data
READ ACCESS decoded symbol
Figure 3: Chain of operations for the two-level memory hierarchy: a) encoding, b) decoding. With this implementation, the cycle time in the decoder is defined by the output of the RAM (READ OUTPUT), the computation of all (COMPUTE), the comparisons (COMPARE) and the decoding iteration (DECODE). These are all time consuming operations. Moreover, the cycle time is dependent on the type of RAM used.
.
3.2 Algorithmic modifications To further shorten the cycle time, we propose to change the chain of operations in the following way (Fig. 4): If there is a miss at cycle , the output by the RAM at cycle does not enter the cache immediately, but only at cycle . Hence, the cache contains a line that is set to 0 during cycles and , to avoid the creation of a dead subinterval in the coding interval. The rescale process (UPDATE RESCALE) is postponed to the beginning of the next cycle. Therefore, the rescale process at cycle does not depend on the output of the RAM (READ OUTPUT) at cycle , but on the one of cycle . In the same way, the computation of all (COMPUTE) does not wait for the result of the rescale procedure
5 5, #
5
3 The
5 $
B
5,
5
5
5
.
second part is the incrementation (UPDATE INCREMENT).
Figure 4: Chain of operations for decoding, after algorithmic modifications.
." 5 5,
The rescale process uses the output of the RAM with 1 uses it with cycle delay, while the computation of all 2 cycles delay. Consequently, at cycles and , the encoding and decoding procedures must work with a cache content that is not completely uptodate. This requires algorithmic modifications. During two cycles, one or two lines of the cache are set to 0 for the computation of all , if one or two misses occurred at cycles or/and . However, the symbol that has produced a miss at cycle or , is supposed to be in the cache. If at cycle or , the same symbol must be en/decoded again (hit), an escape probability must be used, according to Tab. 1. Where two possibilities are available, the choice is made according to Eq. 12.
5
. : 5 # # 5 5 5 5,
6
6 8. G CR 5 . : G 5 6 . : G
(12)
Table 1: Selection of the adequate probability. The variable in brackets is the line-index where the hit/miss occurs. at hit miss(x) miss(x) hit hit miss(x) miss(x) miss(x) hit/miss
at hit hit hit miss(x) miss(x) miss(y) miss(y) miss(y) hit/miss
at hit hit(y) hit(x) hit(y) hit(x) hit(z) hit(x) hit(y) miss
probability from cache from cache or from cache or from cache
or
The main advantage of this solution is that the parallelism is increased and thus the cycle time is improved. Another advantage is the following one: when the cache content must be scaled, the incoming symbol is coded according to Tab. 1 and not always with , as in [5].
3.3 Maturin tree
. B . B M R G O3# \ < >
. : , : (13) ?@8A and each . : of the second half (G O M R G ) with . : . G > < : (14) ?@ \ since the coding interval .0 G is available in a register,
In the decoder, the computation of all (COMPUTE) is a significant part of the longest path. For a first optimization, each of the symbols in the first half of the cache ( ) is computed with
because it is already used by the rescale procedure. The second optimization focuses on the addition tree. are A simple binary tree cannot be used, since all required as partial results of Eq. 10. The binary tree must therefore be complemented as in Fig. 5 to provide the partial results. However, as shown in Tab. 2, this solution is very area consuming when the number of inputs, i.e. the number of lines in the cache, increases. The proposed Maturin tree is a compromise between the linear adder and the complemented binary tree.
.
G
A
B
A
B
C
D
Table 2: The complexity (number of adders used) and the timing (number of adders in the longest path (lp)) vary with the size of the cache ( ). The basis trees of the Maturin have 4+1 inputs. # add. 15, 31, 63 32, 80, 192 16, 32, 64
linear adder compl. binary Maturin
# add. in lp 15, 31, 63 4, 5, 6 6, 12, 24
complemented binary trees of adders/subtracters (Fig. 6) expanded with an additional adder/subtracter to allow their concatenation. Figure 7 shows an example. virtual table
line15
line0 line1
line2
line12
line3
S(2) S(1)
S(3) line5
S(4) S(5)
S(n)=S(16)
line14
line11
line4
S(0)
line13
line6
line7
line8
S(6) S(7)
line9
line10
S(8)
S(13) S(15) S(n)=S(16) S(14)
S(10) S(12) S(9) S(11)
Figure 7: Maturin tree for a cache of 16 lines with basis trees of 4+1 inputs. Figure 8 shows the cycle time of the optimized decoder. With the Maturin tree, the timing trade-off. In [6] we propose more configurations of the Maturin tree that can be useful for other applications.
C
WRITE (hold)
D A
A+B
A+B+C
A+B+C+D
COMPUTE
READ OUTPUT
DECODE
UPDATE RESCALE A+B
A+B+C
A+B+C+D
Figure 5: Linear adder (left); complemented binary tree (right). PREV
A
A B
SA
SB
SC SD
C
NEXT
D
D
C
READ ACCESS
UPDATE INCREMENT
Figure 8: Cycle time of the decoder using Maturin tree. The length of the block of each procedure is proportional to the duration of the procedure for a typical VLSI implementation.
PREV
B
SD=NEXT SC SB
WRITE (setup)
COMPARE time
A
SA
Figure 6: Complemented basis binary trees with puts used to construct the Maturin tree.
in-
As the computation of is performed according to Eq. 13, 14, the Maturin tree is therefore split into two parts: an addition tree and a subtraction tree. Both trees are composed of several basis trees. These basis trees are
4 Performance improvements 4.1 Incrementation/decrementation If a hit occurs at line , the corresponding occurrence counter ! in the cache is incremented. If the model is saturating, the occurrence counter of another line in the cache is decremented. As opposed to the implementation
proposed in [5], this allows the adaptation of the model still to be performed when the model saturates. If the model is not saturating, no decrementation is required to compensate for the incrementation. A system with two pointers is implemented to efficiently find a line in the cache that can be decremented if needed.
4.2 Rescale procedure
. G
The cache/RAM exchanges due to the misses can imply that the coding interval becomes larger than 1. In this case, the content of the cache must be scaled to assure that remains smaller or equal to 1. The simplest and fastest rescale procedure consists of multiplying each in the cache by a rescale factor 0.5. This solution is used in [5]. However, the systematic multiplication of with 0.5 when it becomes larger than 1 is not optimal for the compression, particularly when a cache/RAM exchange makes slightly greater than 1. Indeed, al most the half the interval is unused by the next coding interval . In the implementation proposed here, the rescale procedure is not on the longest path anymore, due to the algorithmic modifications presented in section 3. Hence, a more complex and slower rescale procedure can be used, without lengthening the cycle time. When the rescale factor has a resolution of 1 bit, the content of the cache can only be multiplied by 0.5, as in [5]. If it provides a resolution of 2 bits, there is two possibilities to rescale the cache: the content of the cache can be multiplied by 0.5 or 0.75. With a resolution of three bit, 0.5, 0.675, 075 or 0.875 can be used. A rescale factor with a high resolution allows to scale the cache in such a way that the coding interval, after the rescale procedure, is always as close to 1 as possible, which improves the compression. In our implementation, we propose the use of a rescale factor with a resolution of 3 bits.
. B G
. : G
.0
G
.0 G
X
value. If a miss occurs, there are two possibilities: If the hit counter of the miss line is equal to zero, the cache/RAM exchange takes place; If the hit counter is larger than zero, it is decremented and no exchange takes place. Thus the hit counter increases the inertia of the model and its stability at the cost of its ability to react on context changes.
5 Results Tab. 3 compares the entropy limit4 H with the compression ratio obtained with the implementation proposed in [5] and our implementation. The comparisons are made using standard test images. The first comparison is made with original format (without preprocessing). The second comparison is made after applying a preprocessing which encodes the difference between two subsequent symbols. Based on simulations done in [5][6], a cache of 16 lines is used for the original images and a cache of 32 lines for the preprocessed ones. Both use a rescale factor with a resolution of 3 bits. As expected, for the original images, the adaptive AC mostly outperforms the entropy limit. This is because the adaption process allows the model to follow the context changes of the data stream. Obviously for the preprocessed images, the compression ratio becomes better. However, in both setups, adaptive AC does not outperform the entropy limit anymore, because the data streams became more stationary due to the preprocessing. Our results outperform those presented in [5], because of the use of the additional features. Table 3: Compression ratio given in %, without and with preprocessing. Less is better. H is the entropy limit. OI is our implementation. images
4.3 Hit counter The replacement algorithm is implicitly defined by the use of a direct mapped cache. This can be unsuitable for some type of information sources, e.g. sources that have a distribution with a sharp peak or that are known to have very slow context changes. Indeed, symbols with high hit frequencies, and so with a high , may leave the cache for a very short period: when a very infrequent symbol occurs and produces a miss, it replaces a frequent symbol in the cache. This frequent symbol is very likely to reenter the cache very soon. These undesirable cache/RAM exchanges degrade the compression ratio. To overcome this problem, a hit counter is added in each line of the cache. With each hit, the hit counter of the line is incremented as long as it is smaller than a defined maximal
barbara boat lena airplane baboon pool
H [5] OI without preprocessing 95 92 88 90 82 85 92 84 80 84 77 78 92 104 97 67 54
H [5] OI with preprocessing 76 87 84 70 72 75 63 79 70 58 67 79 91 35 44
6 Conclusion We have focused on the optimization of the slowest procedure of the implementation of [5], which introduced a new method to model the source, that is well suited for 4 is the entropy divided by It is computed assuming a stationary process.
Table 4: ASIC Data Overview Functional data: - Adaptive AC mode (2 encoders/decoders) - Lossless Video Compression mode PAL/NTSC (diff. prediction, configurable quantization, zoom fct.) - Configurable parameters: alphabet size , , rescale factor, hit counter max - Configurable initial cache-RAM content - Scan mode (read terminal cache-RAM content) Physical data: Chip Process: UMC , 5 metal CMOS technology Total area: 25 mm Core Area: 16 mm Transistors: 1’200’000 On-chip memory: 29 kbit Physical data: Maximal system clock [MHz]: Area [mm ] Transistors: Memory:
Encoder: 104 2.71 210’000 3 kbit
Decoder: 71 3.91 300’000 3 kbit
an ASIC implementation. Efficient solutions have been proposed. Indeed, the algorithmic modifications clearly speed-up the design without degrading the compression ratio. Moreover, we have proposed a new tree, the Maturin tree, to compute the cumulative probabilities. It is an optimized trade-off between speed and area. Finally, several new features allow the tuning of the design for different applications to improve the compression ratio. Tab. 4 summarizes the characteristics of our VLSI implementation applied in particular application: A chip that performs adaptive AC encoding and decoding in one mode, or lossless video compression through adaptive AC in the other mode. Fig. 9 shows the layout of this chip.
Acknowledgments The authors would like to thank Frank Gurkaynak, Norbert Felber and the Integrated Systems Laboratory of the ETHZurich for supporting this work.
References [1] D. Huffman:“ A method for construction of minimum redundancy codes”, in Proc. IRE, Sep. 1952, pp. 1098-1101
[2] T.M. Cover, J.A. Thomas:“Elements of Information Theory”, Wiley Series in Telecommunication, WileyInterscience; ISBN: 0471062596 (August 12, 1991)
[3] Khalid Sayood:
“Introduction to Data Compression, Second Edition”, Morgan Kaufmann Publishers; ISBN: 1558605584; 2nd edition (March 2000)l
[4] Giuseppe Acunto, Miquel Sans: “Arithmetic Coding and its efficient Implementation”, Semester Thesis, Summer Semester 2001, Integreted System Laboratory, ETH Zurich, Switzerland, 2001
[5] Roberto R. Osorio: “Algoritmos y arquitecturas para la codificacion aritmetica: explotacion de la localidad utilizando memorias cache”, PhD thesis, Dept. Electronica y Computacion. Universidad de Santiago de Compostela, October 1999
[6] Giuseppe Acunto, Miquel Sans: “Adaptive Arithmetic Coding and its efficient Implementation”, Diploma Thesis, WinterSemester 2001/02, Integreted System Laboratory, ETH Zurich, Switzerland, 2002
[7] Roberto R. Osorio and Javier Bruguera, “Architectures for arithmetic coding in image compression”, Proc. X European signal Processing Conf. EUSIPCO2000, Tampre, Finland 2000, pp. 2121-2123
[8] Roberto R. Osorio, Javier D. Bruguera, “New Model for Arithmetic Coding/Decoding of Multilevel Images based on a Cache Memory”, Proc. Int. Conf. On Electronics, Circuits and Systems (ICECS’99). Cyprus. pp. 697-700. (1999). DECODER 1
DECODER 2
[9] Roberto R. Osorio, Montserrat Boo and Javier D. Bruguera, “Arithmetic image coding/decoding architecture based on a cache memory”, Proc. Int. Conf. Euromicro’98, Vasteras, Sweden 1998, pp. 139-146
[10] M. Boo, J. D. Bruguera and T. Lang, “A VLSI ArchitecENCODER 1
ENCODER 2
ture for Arithmetic Coding of Multilevel Images”, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 1, JANUARY 1998
[11] Mercedes Peon, Roberto R. Osorio and Javier D. Bruguera, Figure 9: Layout plot
“A VLSI Implementation of an Arithmetic Coder for Image Compression”, In Euromicro, pages 591-598, 1997