Sunzip user tool for data reduction using Huffman

0 downloads 0 Views 352KB Size Report
bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. ..... Mathematica Journal, 1995,pp.81-88. [5] J. Ziv ...
International Journal of Modern Computer Science and Applications (IJMCSA) Volume No.-1, Issue No.-2, May, 2013

ISSN: 2321-2632 (Online)

Sunzip user tool for data reduction using Huffman algorithm Ramesh Jangid #1

Sandeep Kumar#2

M.Tech-Computer Science Jagannath University Jaipur, India E-mail: [email protected]

Asst. Prof., Computer Science Jagannath University Jaipur, India E-mail: [email protected]

Abstract: Smart Huffman Compression is a software appliance designed to compress a file in a better way. By functioning as an JSP, it provides high level abstraction of java Servlet. For example, Smart Huffman Compression encodes the digital information using fewer bits, reduces the size of file without loss of data in a single, easy-to-manage software appliance form factor. It also provides us the decompression facility also. Smart Huffman Compression provides our organization with effective solutions to reduce the file size or lossless compression of data. It also expedites security of data using the encoding functionality. It is necessary to analyze the relationship between different methods and put them into a framework to better understand and better exploit the possibilities that compression provides us image compression, data compression, audio compression, video compression etc. [1] Keywords: Data Reduction, Java Servlet, Compression, Encoding, JSP

I.

INTRODUCTION

Smart Huffman Compression Decompression is a softwareapplication which is designed to simplify for compressing a file and makes more efficient use of disk space. It also allows better utilization of bandwidth for transfer of data. Form factors of data which are easy to manage through this application are: Data Compression: Simplify the text compression in a digital form. The text is encoded using fewer bits and original text is replaced by the bits. Image Compression: Includes segmentation, filtration of pixels, altering the colours to reduce the size of digital image. Audio Compression: It helps to reduce the size of digital audio streams and files. It has the potential to reduce the transmission bandwidth and storage requirements of audio data. Video Compression: Reduce the size of digital video streams and files. It combines the spatial image compression and temporal motion compensation. It is a practical implementation of source coding in information theory. Video compression typically operates on square-shaped groups of neighboring pixels, often called macro blocks. The concept behind Huffman Algorithm is: It uses a variable length code for each of the elements within the information, which analyze the information to determine the probability of elements within the information. The most probable elements are coded with a few bits and the least probable coded with a greater number of bits

II. HUFFMAN ALGORITHM Huffman Algorithm is a compression technique with variable length codes. On behalf of the data symbols and their frequency of occurrence (their probabilities), it constructs a set of variable-length codeword‘s with the shortest average length and assigns them to the symbols. It generally produces better RES Publication © 2012 http://www.resindia.org

codes, and like the Shannon-Fanon method, it produces the best variable-length codes when the probabilities of the symbols are negative powers of 2. The main difference between the two methods is that Shannon-Fanon constructs its codes from top to bottom and the bits of each codeword are constructed from left to right, while Huffman constructs a code tree from the bottom up and the bits of each codeword are constructed from right to left.

Huffman Encoding Algorithm Step 1: Find out the occurrence or probability of a each symbol in the given text. Step 2: List all the source symbols in order of decreasing probability in a tabular format. Step 3: Combine the probabilities of the two symbols having the lowest probabilities, and reorder with the resultant probability in decreasing order. This step is called reduction 1. Step 4: Repeat step 2 until there are two ordered probabilities are remaining. Step 5: Now go back and assign 0 and 1 to the remaining probabilities that were combined in the previous reduction step, retaining all assignments made in step 3. Step 6: Keep regressing this way until the first column is reached. Example Let the given text is: SIDVICIIISIDIDVI There are five symbols in this text. Now we find out the probability of the symbols from this text. These are: Symbol Probability 'C' 1/16 0.0625 'D' 3/16 0.1875 'I' 8/16 0.5 'S' 2/16 0.125 'V' 2/16 0.125 Now according to step 2 Symbol Probability ‗I‘ 0.5 ‗D‘ 0.1875 ‗S‘ 0.125 ‗V‘ 0.125 Page | 11

International Journal of Modern Computer Science and Applications (IJMCSA) Volume No.-1, Issue No.-2, May, 2013

‗C‘

ISSN: 2321-2632 (Online)

0.0625

Huffman Decoding According to step 3, 4, 5

The codes of each symbol are based on the probabilities or frequencies of occurrence of the symbols. The probabilities or frequencies have to be written, as side information, on the output, so that any Huffman decompressor (decoder) will be able to decompress the data. This is easy, because the frequencies are integers and the probabilities can be written as scaled integers. It normally adds just a few hundred bytes to the output. It is also possible to write the variable-length codes themselves on the output, but this may be awkward, because the codes have different sizes. The algorithm for decoding is simple. Start at the root and read the first bit off the input i.e. the compressed file. If it is zero, follow the bottom edge of the tree; if it is one, follow the top edge. Read the next bit and move another edge toward the leaves of the tree. When the decoder arrives at a leaf, it finds there the original, uncompressed symbol (normally its ASCII code), and that code is emitted by the decoder. The process starts again at the root with the next bit.

Figure 1: Procedure of Huffman Encoding Now we can write the codes for particular symbols as: Symbols Code Code Length 'C' 1001 4 'D' 11 2 'I' 0 1 'S' 101 3 'V' 1000 4 We have the code lengths and we can calculate average code length for this text is: Formula: L = ∑ Pi * Ni ¥ i=1 to m Where Pi = Probability of symbol at i value Ni = Code length of symbol at i value So L = (0.0625*4) + (0.1875*2) + (0.5*1) + (0.125*3) + (0.125 * 4) L=2 Encoded Message becomes: SIDVICIIISIDIDVI = 101 0 11 1000 0 1001 0 0 0 101 0 11 0 11 1000 0 The spaces are only to make the reading easier. So, the compressed output takes 32 bits and we need at least 10 bits to transfer the Huffman tree by sending the code lengths. The message originally took 48 bits, now it takes at least 42 bits. The codes are used to construct the Huffman Tree. Huffman Tree

Figure 2: Huffman Tree RES Publication © 2012 http://www.resindia.org

III. HUFFMAN PERFORMANCE Huffman is the subject of intensive research in data compression. As we know it is an algebraic approach to construct the Huffman code. Robert Gallager shows that the redundancy of Huffman coding is at most p1 + 0.086 where p1 is the probability of the most-common symbol in the alphabet. The redundancy is the difference between the average Huffman codeword length and the entropy. Given a large alphabet, such as the set of letters, digits and punctuation marks used by a natural language, the largest symbol probability is typically around 15–20%, bringing the value of the quantity p1 + 0.086 to around 0.1. This means that Huffman codes are at most 0.1 bit longer per symbol than an ideal entropy encoder, such as arithmetic coding. The Huffman method assumes that the frequencies of occurrence of all the symbols of the alphabet are known to the compressor. In practice, the frequencies are seldom, if ever, known in advance. One approach to this problem is for the compressor to read the original data twice. The first time, it only counts the frequencies; the second time, it compresses the data. Between the two passes, the compressor constructs the Huffman tree. Such a two-pass method is sometimes called semi-adaptive and is normally too slow to be practical. The method that is used in practice is called adaptive (or dynamic) Huffman coding. This method is the basis of the UNIX compact program. The method was originally developed by Faller and Gallagher with substantial improvements by Knuth. The main idea is for the compressor and the decompressed to start with an empty Huffman tree and to modify it as symbols are being read and processed (in the case of the compressor, the word ―processed‖ means compressed; in the case of the decompressed, it means decompressed). The compressor and decompressed should modify the tree in the same way, so at any point in the process they should use the same codes, although those codes may change from step to step. We say that the compressor and decompressed are Page | 12

International Journal of Modern Computer Science and Applications (IJMCSA) Volume No.-1, Issue No.-2, May, 2013

synchronized or that they work in lockstep, although they don‘t necessarily work together; compression and decompression normally take place at different times. The term mirroring is perhaps a better choice. The decoder mirrors the operations of the encoder. Initially, the compressor starts with an empty Huffman tree. No symbols have been assigned codes yet. The first symbol being input is simply written on the output in its uncompressed form. The symbol is then added to the tree and a code assigned to it. The next time this symbol is encountered, its current code is written on the output, and its frequency incremented by 1. Since this modifies the tree, the tree is examined to see whether it is still a Huffman tree (best codes). If not, it is rearranged, an operation that results in modified codes.

The decompress or mirrors the same steps. When it reads the uncompressed form of a symbol, it adds it to the tree and assigns it a code. When it reads a compressed variable-length code, it scans the current tree to determine what symbol the code belongs to, and it increments the symbol‘s frequency and rearranges the tree in the same way as the compressor. It is immediately clear that the decompressed needs to know whether the item it has just input is an uncompressed symbol normally, an 8-bit ASCII code or a variable-length code. To remove any ambiguity, each uncompressed symbol is preceded by a special, variable-size escape code. When the decompress or reads this code, it knows that the next eight bits are the ASCII code of a symbol that appears in the compressed file for the first time The trouble is that the escape code should not be any of the variable-length codes used for the symbols. These codes, however, are being modified every time the tree is rearranged, which is why the escape code should also be modified. A natural way to do this is to add an empty leaf to the tree, a leaf with a zero frequency of occurrence, that‘s always assigned to the 0-branch of the tree. Since the leaf is in the tree, it is assigned a variable-length code. This code is the escape code preceding every uncompressed symbol. As the tree is being rearranged, the position of the empty leaf-and thus its codechange, but this escape code is always used to identify uncompressed symbols in the compressed file.

IV. ADVANCEMENTS IN HUFFMAN Huffman coding is a process that replaces fixed length symbols of 8-bit bytes with changing length codes. GNU zip , also known as GZIP, is a compression technique which is originally intended to replace the compress program used in the early Unix systems. GZIP is the advancements in Huffman algorithm. It is based on an algorithm known as DEFLATE. This is also a lossless data compression algorithm. It uses both the LZ77 algorithm and Huffman coding. Essentially, GZIP refers to the file format of the same name. This format is a 10byte header which contains a magic number, which means a numerical or text value that never changes and is used to signify a file format or protocol, an unnamed numerical value that never changes, or distinct values that cannot be mistaken for anything else, extra headers that may or may not actually be necessary (original file name, for example), a body that contains a DEFLATE -compressed payload which is the data that the headers carry, and an 8-byte footer which contains a RES Publication © 2012 http://www.resindia.org

ISSN: 2321-2632 (Online)

CRC-32 checksum, as well as the actual length of the original uncompressed data. It is used when huge file is compressed. It is very beneficial when we need more space and save time. It compresses file using very low space. GZIP compresses one large file instead of multiple smaller ones, it can take advantage of the redundancy in the files to reduce the file size even further. GZIP is a purely compression tool to compress a file. But it uses another tool i.e. Tar to archive a file. Compression is a technique which is used to reduce the size of a file while Archive is a technique which is used to combine multiple files into a single one after compression. GZIP archive all the files into single tarball before compression. GZIP is used in UNIX like Operating system such as the Linux distribution.

V. BENIFITS OF HUFFMAN ENCODING Huffman Encoding is one the best Compression technique. It is fast, simple and easy to implement. It starts with a set of symbols whose probability are known and helps us to construct a code tree. When the tree is completed, it determines the variable length prefix codeword‘s for the individual symbols in the text. Huffman Compression Algorithm is used to handle the following problems. The implementer of Huffman compressor /decompress or selects a set of documents that are judged typically. The implementer analyse the document and count the occurrence of the each symbol. Based on the occurrence, he construct the Huffman code tree. These codes may not conform the symbol‘s probability of any particular input file i.e. being compressed. This approach is simple and fast so it is used in FAX Machines. It is a two pass compression job which produces the ideal codewords for the input file. The input file is read twice so this approach is slow. In the first pass, the encoder counts the symbol occurrence, and determines the probability of each symbol. It uses this information to construct the Huffman codewords for the input file which is being compressed. In the second pass, the encoder actually compress the data by replacing the each symbols with is respective codewords. The Adaptive Huffman Compression starts with a empty Huffman code tree and update tree as the input symbol are read and processed. When a symbol is input the tree is searched for it. If the symbol is in the tree, the codeword is used otherwise it is added to the tree and a new codeword is assigned to a particular symbol. In this case the tree is examined and rearranged to keep it Huffman Code tree. This process has to be done carefully to make sure that the decoder can perform it in the same way as the encoder in lockstep. This is difficult to implement.

Page | 13

International Journal of Modern Computer Science and Applications (IJMCSA) Volume No.-1, Issue No.-2, May, 2013

ISSN: 2321-2632 (Online)

VI. Sample Screen Shots VII. COMPARISION TABLE

Step wise steps-

Type Of File:- TXT Algorithm Name

S No

Original File Size

Compressed File Size

Compr ession Ratio

Distinct Charact ers

HUFFMAN

1

1702

1081

50

2

334

321

3

48890

32249

1

1702

1114

2

334

331

3

48890

33666

1

1702

812

2

334

183

3

48890

10734

1

1702

1335

2

334

304

3

48890

42880

1

1702

1205

2

334

276

3

48890

37950

1

1702

1273

2

334

333

3

48890

23058

63.51 % 96.11 % 65.96 % 65.45 % 99.10 % 68.86 % 47.71 % 54.79 % 21.96 % 78.44 % 91.02 % 87.71 % 70.80 % 82.63 % 77.62 % 74.79 % 99.70 % 47.16 %

SHANNON FANO

GZIP

Figure 1 COSMO

JUNK CODE BINARY

LZW

Best

45 93 50 45 93 yes

50 45 93 50 45 93 yes

Table: 1 Remark:Time:- GZIP, HUFFMAN, SHANNON FANO, JUNK CODE BINARY, LZW Compression Order:- GZIP, LZW, HUFFMAN, JUNK CODE BINARY, SHANNON FANO, COSMO. Space Required:- GZIP, LZW, HUFFMAN, JUNK CODE BINARY, SHANNON FANO, COSMO.

Figure 2 Here shows the sunzip tool for compression in which show the original file size,distnict chars , compressed file size and compression ration .fully details of data file format like text and mp3 shows in the table format below .

Here the table 1 the file format is text file and All above the in table show the data reduction techniques that distinguished the time, compression ration and space required . In the table 1 specify that Gzip and Lzw is better techniques for the compression of text file In the next table 2 of comparison take the file format mp3 in which Huffman and Gzip is better technique for data compression. so Huffman is better provide the data reduction Type Of File:- MP3

RES Publication © 2012 http://www.resindia.org

Page | 14

International Journal of Modern Computer Science and Applications (IJMCSA) Volume No.-1, Issue No.-2, May, 2013

Algorithm Name

S N o

1 HUFFMA N

2 3 1

SHANNO N FANO

2 3 1

GZIP

2 3 1

JUNK CODE BINARY

2 3 1

RLE

2 3 1

LZW

2 3

Original File Size

732310 6 478988 8 593350 9 732310 6 478988 8 593350 9 732310 6 478988 8 593350 9 732310 6 478988 8 593350 9 732310 6 478988 8 593350 9 732310 6 478988 8 593350 9

ISSN: 2321-2632 (Online)

Compress ed File Size

Compressi on Ratio

Distin ct Chara cters

7294912

99.62%

256

4781161

99.82%

256

5904888

99.52%

256

7404275

101.11%

256

4865326

101.57%

256

5986265

100.89%

256

7223973

98.65%

4733205

98.82%

5846223

98.53%

7854595

107.26%

256

5153463

107.59%

256

6363777

107.25%

256

7411408

101.21%

4834808

100.94%

Information Theory, Vol. 23, pp. 337--342, 1977

5994496

101.03%

[6] D.E. Knuth — Dynamic Huffman Coding — Journal of

1018648 3 6820672

139.10%

8217964

138.50%

Bette r Algo

yes

high compression and typically result in a file which is between only 2 and 20% of the original file size. It should be noted that once a file has been compressed there is virtually no gain in compressing it again. Thus storing or transmitting compressed files over a system which has further compression will not increase the compression ratio. Huffman codes are used to differentiate between data i.e. literal values and back references.

IX. REFERENCES [1] D. W. Gillman, M. Mohtashemi, and R. L. Rivest, ―On breaking a Huffman code,‖ IEEE Trans. Inform. Theory, vol. 42, no. 3, pp. 972976, May 1996. [2] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on yes

Information Theory, May 1977 [3] Mridual k. M., ―lossless huffman coding technique for image compression and reconstruction using binary trees ―ijcta ,vol 3 (1), 76-79,feb 2012 [4]

A.B.Watson,‖Image,

Compression

using

the

DCT

"―

,Mathematica Journal, 1995,pp.81-88 [5]

J. Ziv and A. Lempel, ``A Universal Algorithm for Sequential

Data Compression,'' IEEE Transactions on

Algorithms, 6, 1983 pp. 163-180. 142.40%

[7] Dzung Tien Hoang and Jeffery Scott Vitter .Fast and Efficient Algorithms for video Compression and Rate Control, June 20,1998.

AUTHOR’S BIOGRAPHIES Table: 2

Remark:Time :- RLE, GZIP, HUFFMAN, SHANNON FANO, JUNK CODE BINARY, LZW Compression Order:- GZIP, HUFFMAN, RLE, SHANNON FANO, JUNK CODE BINARY, LZW Space Required:- GZIP, HUFFMAN, RLE, SHANNON FANO, JUNK CODE BINARY, LZW

VIII.

CONCLUSION

Huffman Algorithm is a lossless compression technique. Huffman is the most efficient but requires two passes over the data. The amount of compression, of course, depends on the type of file being compressed. Random data, such as executable programs or object code files, typically has low compression resulting in a file which is 50 to 95% of the original file size. Still images and animation files tend to have

RES Publication © 2012 http://www.resindia.org

First Author Ramesh Jangid, M.Tech-CS Student computer science engineering from Jagannath University, Jaipur. I am the member of IACSIT, IAENG. I have done B.E-computer science engineering in 2008 Batch from Rajasthan University, Jaipur since 2008 My specialization is data structure, computer networking Redhat linux, Real time system, cloud-computing.

Second Author Mr. Sandeep Kumar Assistant Professor in computer science department from Jagannath University, Jaipur. He is M.Tech &, Ph.d (Pursuing) and various Journal and International Paper published. He is the member of IACSIT, IAENG. His specialization field of area is data structure, computer network, artificial intelligence, database management system.

Page | 15

Suggest Documents