A Hardware Architecture for the LZW Compression and ... - Springer Link

Journal of VLSI Signal Processing 26, 369–381, 2000 c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands. °

A Hardware Architecture for the LZW Compression and Decompression Algorithms Based on Parallel Dictionaries∗ MING-BO LIN Department of Electronic Engineering, National Taiwan University of Science and Technology, 43, Keelung Road Section 4, Taipei, Taiwan Received July 28, 1998; Revised October 22, 1999

Abstract. In this paper, a parallel dictionary based LZW algorithm called PDLZW algorithm and its hardware architecture for compression and decompression processors are proposed. In this architecture, instead of using a unique fixed-word-width dictionary a hierarchical variable-word-width dictionary set containing several dictionaries of small address space and increasing word widths is used for both compression and decompression algorithms. The results show that the new architecture not only can be easily implemented in VLSI technology because of its high regularity but also has faster compression and decompression rate since it no longer needs to search the dictionary recursively as the conventional implementations do. Keywords:

1.

lossless data compression, lossless data decompression, lossy data compression, lossy data decompression, LZW algorithm, parallel dictionary, and PDLZW algorithm

Introduction

Data compression is a method of encoding rules that allows substantial reduction in the total number of bits to store or transmit a file. Two basic classes of data compression are applied in different areas currently [2, 3]. One of these is lossy data compression that is widely used to compress image data files for communication or archives purposes. The other is lossless data compression that is commonly used to transmit or archive text or binary files that are required to keep their information intact at any time. In this paper, we only consider the latter case. Up to date, the most commonly used lossless data compression methods are tree-based codes and dictionary-based codes. The most famous representatives of the former are Huffman codes [4, 5], ShannonFano codes [6], universal codes of Elias [7], and the Fibonacci codes [8, 9]. The latter mainly includes LZ77 code [10], LZ78 code [11] and LZW code [12]. In this paper, we only consider the VLSI implementation of LZW codes. ∗ This

in [1].

manuscript is an extended version of a paper that appeared

The LZ77 algorithm is based on the concept of encoding a string of source symbols whose length is less than a prescribed integer F via a fixed-length codeword, hi, j, ai. The pointer i and the matching length j indicate the position and the length, respectively, of the longest matching substring. The last symbol a is the symbol that follows the longest matching substring. A recent study [13, 14] shows that this algorithm is well for small volumes of arbitrary data. The VLSI implementation difficulty of LZ77 algorithm can refer to [1, 14, 15]. Another LZ-based algorithm called LZ78 algorithm [11], which outperforms the LZ77 algorithm for large volumes of arbitrary data [13]. The LZ78 algorithm is organized around a translation table, called string table and sometimes called dictionary, that maps input substrings into fixed-length codewords hi, ai, where i is the index of matched entry in the string table for the current input substring and a is the character immediately following the string i. The essential property of LZ78 coding is that the string table has a prefix property in that every substring in the table its prefix string is also in the table. The compression and decompression speed for LZ78 algorithm depends upon finding the maximum

370

Lin

matching substring from the dictionary as well as updating and downdating the dictionary. For software implementation, several approaches, such as hash data structure [16], trie [15, 16, 17], and dynamic table [12, 16], have been proposed for the dictionary structure. As for updating and downdating the dictionary, the following schemes are proposed.

r Freeze policy [16]: It no longer inserts any new incoming words into the dictionary once the dictionary is full so as the hash function can be easily found. r Flush policy [12, 16]: It is used by the UNIX utility compress and combines with dynamic table structure. The table size is dynamically adjusted according to the compression ratio. The dictionary is flushed and returned to its original size as the compression ratio is below the given threshold. Therefore, it is called flush policy. r Swap method [18]: In this scheme, two dictionaries are maintained. One of them is used to look up the incoming characters and the other is used for insertions. Once the dictionary used for inserting new entries is full, the roles of these two dictionaries are then changed. r LRU-based scheme [16, 17]: An LRU implementation uses a doubly linked list of pointers to trie nodes to order nodes by recency of use. It requires an extra 4D log2 D bits memory for these pointers. Another implementation is using TAG scheme [16] in which a special trie satisfied cyclic heap property is maintained during the course of compressing and decompressing. Although it reduces the hardware required, it costs too much time to maintain the cyclic heap property. However, these dictionary structures and maintaining the dictionary are not trivial in hardware implementation. A current state-of-art compression/ decompression software called PKZIP archives an average data rate of 1 Mbytes/s on a Pentium system [19]. However, its updating and downdating the dictionary is too complex to realize in hardware. To improve the performance and reduce the large memory required in hardware realizations of LZ78 algorithm, a modified version called LZW algorithm was proposed [12]. The major difference between it and LZ78 algorithm is that it only outputs the pointer and eliminates the redundant character a. The reason is that character a is just the prefix character of the next

substring and if we concatenate the previous substring and the prefix character of current output substring, the resulting substring is the new entry of the decompression table. In addition, the dictionary is initialized with the underlying alphabet set and is built up from top to bottom according to the lengths of matching substrings. The major shortcoming of above LZW implementation is the overhead to adjust the dictionary. The alternative algorithms of LZW are DLZW (dynamic LZW) and WDLZW (word-based DLZW) [20]. Both improve LZW algorithm in the following ways. First, they initialize the dictionary with different combinations of characters instead of single character of the underlying character set. Second, they use a hierarchy of dictionaries with successively increasing word widths. Third, each entry in the dictionary associates a frequency counter to use LRU policy. It was shown that both algorithms outperform LZW algorithm [20]. However, they are suitable for software implementation but not for hardware realization due to the complication in their nature. In this paper, we propose a simplified DLZW architecture suited for VLSI realization called PDLZW (parallel dictionary LZW) algorithm and its corresponding hardware architecture. PDLZW algorithm modifies the features of both LZW and DLZW algorithms in the following ways. First, a virtual dictionary with the initial |6| address space is reserved. This dictionary only takes up the address space but does not actually occupy any part of dictionary hardware. Second, a hierarchical parallel dictionary set of successively increasing word widths is used. Third, the simplest dictionary update policy called FIFO (first-in first-out) is used to simplify the hardware implementation. The resulting architecture shows that it outperforms Huffman algorithm in all cases and about only 5% below UNIX compress, which uses a 64K-entry dictionary [15], on the average case but in some cases outperforms the compress utility. The rest of the paper is organized as follows. Section 2 describes the proposed PDLZW compression and decompression algorithms. The various design considerations of the hierarchical parallel dictionary set is discussed in Section 3. Section 4 presents both hardware architectures for PDLZW compression and decompression processors. A prototype VLSI implementation of the architectures of PDLZW compression and decompression algorithms is discussed in Section 5. The paper is concluded in Section 6.

A Hardware Architecture

2.

Compression and Decompression Algorithms

In this section a modified LZW and DLZW algorithm, called PDLZW is presented. 2.1.

Compression Algorithm

The basic concept of LZW algorithm is that it finds the longest sequence of characters not encountered previously. To implement this, a fixed size address space lookup table or dictionary is usually used to keep those substring sequences already encountered. In general, the total number of entries of the dictionary is 4096 (i.e., 12-bit codes) since beyond this the performance improvement is of little significance [1, 12]. Hence, in this paper, this size is considered as the upper bound of the dictionary. However, in the next section, we will show that 1K-entry dictionary is good enough in most applications if PDLZW algorithm is used. Two basic types of dictionary may be used to realize LZW algorithm: fixed-word-width dictionary and variable-word-width dictionary. The former has the disadvantage that it takes longer time to decompress the compressed codewords than the latter since it must recursively decode the compressed codewords into their corresponding uncompressed substrings although it has better dictionary utilization. One way to avoid the recursive operations required in the decoding process is to allocate a large fixed word width (for example 20 bytes) for each entry of the dictionary. However, this implies a large dictionary will be used. Consequently, in this paper, we will adopt the variable-word-width dictionary to balance both compression and decompression time. The major problems of the unique variable-wordwidth dictionary-based implementations of the LZW compression algorithm are as follows. First, not all en-

371

tries of the dictionary have the same size. Second, for every character the search through the entire dictionary is required. To overcome both shortcomings, the dictionary is partitioned into several dictionaries of different address spaces and sizes, and the resultant LZW algorithm is called PDLZW algorithm. The search time is reduced significantly since these dictionaries can operate independently and thus can carry out their search operations in parallel. In the next section, we will discuss how to partition the dictionary. Since the inclusion of an explicit character in the output buffer of LZ78 after each matched substring is often wasteful, like in LZW, PDLZW manages to eliminate these characters altogether so that the output contains pointers only. To achieve this, a list characters containing the input alphabet is initialized, which corresponds to the first 256 entries of the dictionary set. Having done these, the operation of PDLZW compression algorithm can be described as follows. The input substrings are mapped into fixed-length codewords hii, via the dictionary set. The concatenation of the current substring and the next input character is inserted into the dictionary set as a new entry. This operation is usually called “one character of look-ahead” rule. The detailed operation of proposed PDLZW compression algorithm is shown as follows. In the algorithm, two variables and one constant are used. The constant max dict no denotes the maximum number of dictionaries, excluding the first single-character dictionary (i.e., dictionary 0), in the dictionary set. The variable max matched dict no is the largest dictionary number of all matched dictionaries and the variable matched addr registers the matched address within the max matched dict no dictionary. Each compressed codeword is a concatenation of max matched dict no and matched addr.

Algorithm: PDLZW Compression Algorithm Input: The string to be compressed. Output: The compressed codewords with each being a log2 k-bit codeword, which consists of max matched dict no and matched addr , where k is the total number of entries of the dictionary set. begin 1: Initialization. 1.1: string-1 ←− null. 2: while (the input buffer is not empty) do 2.1: Prepare next max dict no + 1 characters for searching. {max matched dict no is reset to max dict no initially and the dictionary number of the dictionary set counts from 0 up to a constant max dict no.} 2.1.1: string-2 ←− read next (max matched dict no + 1) characters from the input buffer.

372

Lin

2.1.2: string ←− string-1 || string-2. {Where || is the concatenation operator.} 2.2: Search string in all dictionaries in parallel and set the max matched dict no and matched addr . 2.3: Output the compressed codeword containing max matched dict no || matched addr . 2.4: if (max matched dict no < max dict no) then Add the first max matched dict no + 2 characters of string to the dictionary(max matched dict no + 1). 2.5: if (the dictionary is full) then set next address to be inserted 0. {FIFO update rule. } 2.6: string-1 ←− shift string out the first (max matched dict no + 1) bytes. end {End of PDLZW Compression Algorithm. }

An example to illustrate the operation of PDLZW compression algorithm is shown in Fig. 1. Here assume that the alphabet set 6 is {a, b, c, d} and the input string is aaabbccbbccaaa. The dictionary address space is 16. The initial dictionary contains all single characters: a, b, c, and d. Figure 1(a) illustrates the operation of PDLZW compression algorithm. The input string is grouped together by characters. These groups are denoted by a number with or without brackets “( ).” The number without brackets “( )” denotes the order to be searched of the group in the dictionary set and the number with brackets “( )” denotes the order to be updated of the group in the dictionary set. After the algorithm exhausts the input string, the contents of the dictionary set and the compressed output codewords

Figure 1.

will be: {a, b, c, d, aa, bb, bc, cc, cb, . . . , aab, bbc, cca, aaa}, and {0, 4, 1, 1, 2, 2, 5, 7, 4, 0}, respectively. 2.2.

Decompression Algorithm

To recovery the original string from the compressed one is by doing the reverse operation of the PDLZW compression algorithm. This operation is called PDLZW decompression algorithm and is described as follows. For decompressing the original substrings from the input compressed codewords, each input compressed codeword is used to read out the original substring from the dictionary set. To do this without losing any information, it is necessary to keep that the dictionary sets used in both algorithms have the same contents.

An example to illustrate the operation of PDLZW compression algorithm.


The update operation of the dictionary set is carried out by adding the concatenated substring of the last output substring and the first character of the current output substring as a new entry. However, because of the look-ahead update operation used in the PDLZW compression algorithm, a special case may be arisen and have to be resolved. That is, it is possible to read out a undefined entry of the dictionary set. To compensate this, the substring concatenated of the last output substring and its first character is used as the current output substring and the next entry of the dictionary set.

373

The detailed operation of PDLZW decompression algorithm is described as follows. In the algorithm, three variables and one constant are used. The constant max dict no denotes the maximum number of dictionaries in the dictionary set. This is the same as that of PDLZW compression algorithm. The variable last dict no registers the dictionary address part of the previous codeword. The variable last output keeps the decompressed substring of the previous codeword while the variable new output records the current decompressed substring. The output substring always takes from the last output that is updated by new output in turn.

Algorithm: PDLZW Decompression Algorithm Input: The compressed codewords with each containing log2 k bits, where k is the total number of entries of the dictionary set. Output: The original string. begin 1: Initialization. 1.1: if (input buffer is not empty) then new output ←− empty; last out put ←− empty; addr ←− read next log2 k-bit codeword from input buffer. {Where codeword = dict no||dict addr and || is the concatenation operator. } 1.2: if (dictionary(addr ) is defined) then new output ←− dictionary(addr ); last out put ←− new out put; out put ←− last out put; last dict no ←− dict no. 2: while (the input buffer is not empty) do 2.1: addr ←− read next log2 k-bit codeword from input buffer. 2.2: if (dictionary(addr ) is undefined) then { special case } 2.2.1: if (last dict no < max dict no) then Add (last out put || the first character of new out put) to dictionary(last dict no + 1). 2.2.2: if (the dictionary is full) then set next address to be inserted 0. 2.2.3: new out put ←− dictionary(addr ); last out put ←− new out put; last dict no ←− dict no; out put ←− last out put. 2.3: else { output and insert } 2.3.1: new out put ←− dictionary(addr ); last dict no ←− dict no. 2.3.2: if (last dict no < max dict no) then Add (last out put || the first character of new out put) to dictionary(last dict no + 1). 2.3.3: if (the dictionary is full) then set next address to be inserted 0. 2.3.4: last out put ←− new out put; out put ←− last out put; last dict no ←− dict no. end {end of PDLZW Decompression Algorithm. }

374

Lin

Table 1.

Some possible ways of partitioning a 1K-address space dictionary.

Index

Total size (bytes)

Partitions of dictionary

1

2304

256

256

256

256

2

2432

256

256

256

128

128

3

2816

256

256

128

128

128

128

4

2880

256

256

128

128

128

64

64

5

3072

256

256

128

128

64

64

64

64

6

3712

256

128

128

128

128

64

64

64

64

7

4032

256

128

128

128

64

64

64

64

64

64

8

4480

256

128

128

64

64

64

64

64

64

64

9

5056

256

128

64

64

64

64

64

64

64

64

64

64

10

5760

256

64

64

64

64

64

64

64

64

64

64

64

64 64

: Virtual dictionary

The operation of the PDLZW decompression algorithm can be illustrated by the following example. Assume that the alphabet set 6 is {a, b, c, d} and input compressed codewords are {0, 4, 1, 1, 2, 2, 5, 7, 4, 0}. Initially, the dictionaries 1 and 2, as shown in Fig. 1(b) are empty. The input compressed codeword 0 reads out “a” from dictionary 0 but the next compressed codeword 4 cannot read out a valid substring from the dictionary set since at this time this entry is undefined. As described above, the output substring is the concatenated substring of the last output substring “a” and its first character “a”, that is, “a”||“a.” This substring also updates the entry 4 of the dictionary set. By applying the entire input compressed codewords to the algorithm, it will generate the same content as that shown in Fig. 1(b) and output the decompressed substring {a, aa, b, b, c, c, bb, cc, aa, a}.

3.

Dictionary Design Considerations

The variable-word-width dictionary used in PDLZW compression or decompression algorithms is partitioned into m smaller variable-word-width dictionaries, numbered from 0 to m−1, with each of which increases its word width by one byte. That is to say, dictionary 0 has one byte word width, dictionary 1 two bytes, and so on. These dictionaries constitute a dictionary set. In general, different address space distributions of the dictionary set will present significantly distinct performance of the PDLZW compression algorithm. However, the optimal distribution is strongly dependent on the actual input data files. Different data profiles have

their own optimal address space distributions. Therefore, in order to find a more general distribution, several different kinds of data samples are run with various partitions of a given address space. Each partition corresponds to a dictionary set. For instance, the 1K address space is partitioned into ten different combinations and hence ten dictionary sets as shown in Table 1. The 2K address space is partitioned into fifteen different dictionary sets as shown in Table 2 and the 4K address space is partitioned into twenty different dictionary sets as shown in Table 3. Ten data files with different attributes and sizes are then used as profiles. The resultant average compression ratios for 1K, 2K, and 4K address spaces are depicted in Figs. 2–4, respectively. Please note that every partition has an address space of 2ki words, where ki is an integer, and the sum of 2ki is equal to 1K, 2K, or 4K, depending on what address space is used. This partition rule is to avoid the hardware overhead due to address decoders needed in multiple CAM or RAM configurations. Hence, it simplifies the hardware design. As we have seen from the figures, in general, the compression ratio is improved as the address space of dictionary increases. Thus, the algorithm with 4-K address space has the best average compression ratio in all cases that we have simulated. However, as we examine the partitions in different address spaces, each exhibits some optimal partitions. For instance, in the case of 1-K address space, the partitions with index 4 and 5 are optimal and have the compression ratio of 54%. In comparison with three different address spaces, the best compression ratio is 54% and appears in both 1-K and


Table 2.

375


Index

Total size (bytes)


1

4608

256

1024

512

256

2

5888

256

512

512

512

256

3

6400

256

512

512

256

256

256

4

6528

256

512

512

256

256

128

128

5

7424

256

512

256

256

256

256

256

6

6592

256

512

512

256

256

128

64

64

7

7552

256

512

256

256

256

256

128

128

8

8960

256

256

256

256

256

256

256

256

9

7936

256

512

256

256

256

128

128

128

128

10

11776

256

256

256

256

256

256

256

128

128

128

128

11

12416

256

256

256

256

256

256

128

128

128

128

128

128

12

13312

256

256

256

256

256

128

128

128

128

128

128

128

128

13

14464

256

256

256

256

128

128

128

128

128

128

128

128

128

128

14

15872

256

256

256

128

128

128

128

128

128

128

128

128

128

128

128

15

19456

256

128

128

128

128

128

128

128

128

128

128

128

128

128

128

128

128


Table 3.


Index

Capacity (bytes)


1

13312

256

1024

1024

1024

512

256

2

14592

256

1024

1024

512

512

512

256

3

16896

256

1024

512

512

512

512

512

256

4

20224

256

512

512

512

512

512

512

512

5

15104

256

1024

1024

512

512

256

256

256

6

17408

256

1024

512

512

512

512

256

256

256

7

20736

256

512

512

512

512

512

512

256

256

256

8

15296

256

1024

1024

512

512

256

256

128

64

64

256

9

17536

256

1024

512

512

512

512

256

256

128

128

10

20864

256

512

512

512

512

512

512

256

256

128

128

11

15232

256

1024

1024

512

512

256

256

128

128

12

18560

256

1024

512

512

512

256

256

256

256

128

128

13

21888

256

512

512

512

512

512

256

256

256

256

128

128

14

20096

256

1024

512

512

256

256

256

256

256

256

128

128

15

20480

256

1024

512

512

256

256

256

256

256

128

128

128

128

16

23424

256

512

512

512

512

256

256

256

256

256

256

128

128

17

24192

256

512

512

512

256

256

256

256

256

256

256

256

128

128

18

28032

256

512

512

256

256

256

256

256

256

256

256

256

256

128

128

19

31104

256

512

256

256

256

256

256

256

256

256

256

256

256

256

128

128

20

34688

256

256

256

256

256

256

256

256

256

256

256

256

256

256

256

128


128

376

Lin

Figure 2.

Compression ratios of ten test data profiles on the 1-K address space dictionary sets.

Figure 3.


Figure 4.


2-K address spaces with partition indices 4 and 5, and 4, 6, and 9, respectively. As a consequence, the compression ratio is not only determined by the correlation property of underlying data files to be compressed but also depends on an appropriate partition. An important consideration for hardware implementation is the required dictionary address space that

dominates the chip cost for achieving an acceptable compression ratio. From this point, the optimal solution of dictionary address space is the 4th or 5th partition of 1-K address space since it has not only the best compression ratio but also the smallest address space of the dictionary used. Therefore, in the following we will use the 5th partition of 1-K address space as


377

Table 4. Performance comparison of compress, compact, and PDLZW (with 1-K address space and the index of partition = 5) algorithms. Compression ratio File

Size

comp

65536

54

68

49

PDLZW compression(binary)

decom

61440

58

71

51

PDLZW decompression(binary)

06model

62657

34

56

42

Cadence model file

chap3.doc

23397

55

74

54

Chinese document

dftt-t8.fot

1316

26

31

36

Chinese font of table

dftt-t8.ttf

1430501

69

79

78

Chinese fonts

mbox

14877

53

71

62

Unix mail box

tc2ps

1176380

55

59

57

PostScript file

245760

70

83

75

Unix vi command

12288

33

44

38

Unix whoami command

48.7

63.6

54.2

vi whoami

Average

compress

compact

the foundation for constructing our hardware architectures of both PDLZW compression and decompression processors. To demonstrate the performance of our design, a comparison with both UNIX standard commands: compress and compact along with the test files is listed in Table 4. On the average, our design has the performance between UNIX commands compress and compact. However, in some cases the design even has the better compression ratio than compress although the dictionary set has only 3072 bytes instead of 64K words used in the compress. This reveals the attractive feature of our design: simple but efficient, which makes the architecture very suitable for VLSI realization. 4.

Hardware Architecture

In this section, architectures for PDLZW compression and decompression processors are presented. These designs can also be applied to other address spaces and partitions although they are designed on the basis of the 5th partition of 1-K address space shown in Table 1. 4.1.

Compression Processor Architecture

In the conventional dictionary implementation of LZW algorithm, it uses a unique dictionary with large address space (say, 4096 entries) so that the search time of the dictionary is quite long even with CAM (content addressable memory). In our design the unique dictionary

PDLZW

Comment

is replaced with a dictionary set that is composed of several smaller dictionaries of different address spaces and word widths. Each dictionary can be a RAM or CAM and can be operated independently with the others. In general, CAM is used to gain the search speed at the cost of hardware; RAM is used to save hardware at the expense of search speed. Since the dictionary set in the compression processor only contains 3072 bytes, we will use CAMs to construct it. As a consequence, the maximal search time for the dictionary set is the time that searches the largest CAM since all dictionaries in the set are operated in parallel. Comparing this with the conventional implementations of LZW, the search time required in our architecture is only the one to search a 256 words CAM rather than 4096 words. The architecture of PDLZW compression processor is depicted in Fig. 5. It consists of a CAM dictionary set, an 8-byte shift register, a shift and update control, and a codeword output circuit. The dictionary set consists of 7 CAMs. The largest one contains only 256 2-byte words and the smallest one has 64 8-byte words. The word widths of CAMs increase gradually from 2 bytes up to 8 bytes with three different address spaces as specified in the 5th row of Table 1. The input string is shifted into the 8-byte shift register. Thus there are 8 bytes can be searched from all CAMs simultaneously. In general, it is possible that there are several dictionaries in the dictionary set matched with the incoming substring with different substring lengths at the same time. The matched address within a dictionary as well as the dictionary

378

Lin

Figure 5.

The architecture of proposed PDLZW compression processor.

number of the dictionary that has largest number of bytes matched is output as the output codeword, which is detected and combined by the priority encoder and the 8-to-1 multiplexor, respectively, as shown in the figure. The maximum length substring matched concatenated with the next input character comprises the new entry of the dictionary set and then is written into the next entry pointed by the update pointer (UP) of the next dictionary (CAM) enabled by the shift and dictionary update control circuit. Each dictionary has its own UP that always points to the word to be inserted next. Each update pointer counts from 0 up to its maximum value and then back to 0. Hence, the FIFO update policy is realized. The update operation is inhibited if the next dictionary number is greater than or equal to the maximum dictionary number set by max dict no. As described above, the compression rate is one byte per memory cycle at least. In general, it is much higher than this, depending on how many bytes of the input substring are matched within the dictionary set. 4.2.

Decompression Processor Architecture

The PDLZW decompression algorithm performs the reverse operations of PDLZW compression algorithm. Thus, it requires the same lookup table, i.e., the same

content and structure of the dictionary set used in the compression processor. However, the operations related to the dictionary set of PDLZW decompression algorithm are only readout and update but no interrogation. As a consequence, the lookup table may be constructed with the most common RAM instead of CAM used in the PDLZW compression processor to achieve faster readout and update speed. To facilitate the required operations of PDLZW decompression algorithm, the processor is composed of an input buffer, a RAM-based dictionary set, two registers: new out put and last out put, eight 2-to-1 multiplexors, and a shift and dictionary update control unit. The last out put register also requires the shift operation to output the decompressed substring to the outside world. The input buffer actually consists of two 10-bit registers to speed up the decompression rate. One of them is used for inputting serial data and the other for using internally to latch the input data during the entire decompressing process. The input codeword has two parts: dictionary number and dictionary address. The former is routed into a decoder for determining which dictionary in the dictionary set will be searched while the latter is used to read data from the selected dictionary. Once the specific dictionary is determined, two


Figure 6.

379

The architecture of proposed PDLZW decompression processor.

possibilities arise. One is that the required entry is already in the dictionary (i.e., it is defined) and the other is not (i.e., undefined). This can be discriminated by comparing the dictionary address part of input codeword with the update pointer (UP) of the selected dictionary. The result is undefined if they are equal and defined otherwise. To solve the problem of single character matched, i.e., when the dictionary number of the incoming codeword is 0, the dictionary address of the incoming codeword is placed directly into new output, then to last output, and output in turn. This corresponds to the case that dictionary 0 is used. The other cases operate in accordance with the step 2 of PDLZW decompression algorithm. Here, the variables last dict no and max dict no are kept in the shift and dictionary update control circuit. Like the PDLZW compression processor, the dictionary update is required in the PDLZW decompression processor. As mentioned earlier, this is done by writing the last matched substring concatenated with the

first character of current matched substring. To facilitate this operation, eight 2-to-1 multiplexors are used to route the first character in the new output register into its proper position. Also all outputs of last output register are routed into RAMs. By proper controlling the write enable signal of specific dictionary and enabling the appropriate 2-to-1 multiplexor, the update operation can be done as expected. The output can be taken from last output register in a bit-by-bit or byte-by-byte manner depending on the actual requirement. As PDLZW compression processor, the decompression rate of this processor is one byte per memory cycle at least and in general it is much higher than this, depending on how many bytes are decompressed in the given input compressed codeword. Please note that in actual implementation both compression and decompression processors can be combined into one codec chip since both have the same dictionary set structure, which takes up the major chip area and can be implemented by using CAMs or RAMs.

380

Lin

Figure 7.

5.

The chip layout of PDLZW processor.

Prototype VLSI Implementation

To verify our architecture for PDLZW compression and decompression algorithms, a prototype VLSI chip was designed using 0.6 µm CMOS SPDM technology from TSMC. Seven CAMs of different sizes are used in the chip. Since they are not standard RAMs, they are designed and implemented with full-custom technology. On the contrast, the control circuit of the chip is synthesized by using HDL for simplicity. Based on the results from TimeMill and PowerMill postlayout simulations, the maximum operating clock rate is 62.5 MHz in the worst case, which is actually limited by the clock rate of I/O pads used, and consumes 295.3 mW. The chip occupies 6.743 × 10.174 mm2 area in 0.6 µm CMOS SPDM technology. The complete chip layout is shown in Fig. 7. The time required for compression operation is 7 clocks and for shift operation is one byte per clock. Hence, the minimum compression rate is occurred in the case of only one byte matched per comparison MHz × 1 × 8 bits = and is: compression rate = 62.57+1 62.5 Mbits/sec. On the contrary, the maximum compression rate is occurred in the case of all 8 bytes in the shift register are matched with the widest CAM for each MHz × comparison and is: compr ession rate = 62.57+8

8 × 8 bits = 266.67 Mbits/sec. The decompression rate is between 133 Mbits/sec and 364 Mbits/sec since the time required for decompression operation is only 3 clocks and for shift operation is one byte per clock. Please note that the number of clocks required for both processors can be reduced by a more careful design. 6.

Concluding Remarks

In this paper, a parallel dictionary based LZW algorithm called PDLZW algorithm and its hardware architecture for compression and decompression processors are proposed. The main features of PDLZW algorithm are as follows. 1. A hierarchical parallel dictionary set that has successively increasing word widths is used. Comparing with conventional implementations of LZW algorithm [20], our design only uses 1K differentlength words, 3072 bytes in total, instead of 10880 entries with each of 20 bytes. Although this will deteriorate a little bit of compression ratio, it is very simple to realize in VLSI technology. The resultant chip also has an faster compression and decompression rate since in our architecture both the search


time and the read time are faster due to that a smaller dictionary set, with less address space and narrow word width, is used. 2. Since the compression ratio is a strong function of the correlation property of underlying data files, it is hard to find a general scheme that is suitable for all kinds of data files and has good compression ratio even using software realization with large dictionary, such as compress or PKZIP. However, in this paper, we propose an alternate dictionary scheme that provides parallel search capability for LZW algorithm. Based on this, the dictionary size is dramatically reduced without deteriorating too much performance. 3. Both compression and decompression processors are structured on a similar dictionary set, any of which can build on CAMs or RAMs. Therefore, they can be combined into a codec chip in practical implementations. 4. Although CAM-based design in the compression processor is used as an example, the architecture is also suitable for RAM-based one. Acknowledgment The author would like to thank anonymous reviewers and the communicating editor for useful comments that improved the presentation of the paper. In addition, the author would like to thank Mr. Yung-Sen Chen for his assistance in implementing the design layout and doing postlayout simulation of the chip.

381

8. A. Mukheriee, N. Ranganthan, and M. Bassiouni, “Efficient VLSI designs for data transformation of tree-based codes,” IEEE Trans. Circuits Syst., vol. 38, 1991, pp. 306–314. 9. A. Mukheriee, N. Ranganthan, and J. W. Flieder, “MARVLE: A VLSI chip for data compression using tree-based codes,” IEEE Trans. VLSI Syst., vol. 1, no. 2, 1993, pp. 203–214. 10. J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Information Theory, vol. IT-23, no. 3, 1977, pp. 337–343. 11. J. Ziv and A. Lempel, “A compression of individual sequences via variable-rate coding,” IEEE Trans. Information Theory, vol. IT-24, no. 5, 1978, pp. 530–536. 12. Terry A. Welch, “A technique for high-performance data compression,” IEEE Computer, vol. 17, no. 6, 1984, pp. 8–19. 13. D.J. Craft, “ADLC and a pre-processor extension, BDLC, provide ultra fast compression for general-purpose and bit-mapped image data,” Proc. Data Compression Conf., 1995, p. 440. 14. Bongjin Jung and Wayne P. Burleson, “Efficienct VLSI for Lempel-Ziv compression in wireless data communication networks,” IEEE Trans. VLSI Syst., vol. 6, no. 3, 1998, pp. 475– 483. 15. Gilbert Held, Data and Image Compression: Tools and Techniques, 4th edn., New York: John Wiley & Sons, 1996. 16. S. Bunton and G. Borriello, “Practical dictionary management for hardware data compression,” Communications of ACM, vol. 35, no. 1, 1992, pp. 95–104. 17. E. Fiala and D. Greene, “Data compression with finite windows,” Communications of ACM, vol. 32, no. 4, 1989, pp. 490–505. 18. J. Storer, Data Compression Methods and Theory, Rockville, MD: Computer Science Press, 1988. 19. T. Halfhill, “How safe is data compression,” BYTE, 1994, pp. 56–74. 20. J. Jiang and S. Jones, “Word-based dynamic algorithms for data compression,” IEE Proceedings-I, vol. 139, no. 6, 1992, pp. 582–586.

References 1. Ming-Bo Lin, “A parallel VLSI architecture for the LZW data compression algorithm,” International Symposium on VLSI Technology, Systems, and Applications, June 3–5, 1997, Taiwan, pp. 98–101. 2. T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression, Englewood Cliffs, N.J.: Prentice-Hall, 1990. 3. Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Reading, Massachusetts: Addison-Welsley Publishing Company, 1992. 4. D. Huffman, “A method for the construction of minimum redundancy codes,” Proceeding of IRE, vol. 40, 1952, pp. 1098–1101. 5. J.S. Vitter, “Design and analysis of dynamic Huffman codes,” J. Association for Computing Machinery, vol. 34, no. 4, 1987, pp. 825–845. 6. C.E. Shannon and W. Weaver, The Mathematical Theory of Communication, Urbana, IL: Univ. Illinois Press, 1949. 7. P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inform. Theory, vol. 21, 1975, pp. 194– 203.

Ming-Bo Lin received the B.Sc. degree in electronic engineering from the National Taiwan University of Science and Technology, Taipei, the M.Sc. degree in electrical engineering from the National Taiwan University, Taipei, and the Ph.D. degree in electrical engineering from the University of Maryland, College Park. Since August 1992, he has been an associate professor with the department of electronic engineering at the National Taiwan University of Science and Technology, Taipei. His research interests include VLSI systems design, parallel architectures and algorithms, computer arithmetic, and fault-tolerant computing. [email protected]