Compression of Big-Data in Hadoop: How Big ...

37 downloads 33889 Views 370KB Size Report
Compression of Big-Data in Hadoop: How Big Elephant Turns Into a. Baby Elephant. Abstract:- It is very difficult to handle a large set of data. To work with ...
Compression of Big-Data in Hadoop: How Big Elephant Turns Into a Baby Elephant Abstract:It is very difficult to handle a large set of data. To work with Big-Data we need sufficient amount of storage. Compression of data reduces the space needed. Two major benefits of file compression in Hadoop: Firstly, it reduces the memory requirement. Secondly, it speeds up data transfer across the network or to or from disk. These both savings plays a important role in Hadoop. Due to high workload completion off process needs considerable time. In Addition, internal MapReduce shuffle process is under huge I/O pressure. So Disk I/O and network bandwidth is most precious resource in Hadoop. Therefore compression of files in Hadoop gives us a high throughput by saving Disk memory and Speed.

The Compression Adjustment:It is a situation that involves losing one quality or aspect of something in return for gaining another quality or aspect. More clearly, if one thing increases, some other thing must decrease. So after compression the disk usage and network bandwidth decreases but CPU utilization increases. How stronger will be the compression more CPU utilization will be require. Because before processing a file data must be decompressed. Decompression generally increase the time of job. However in most cases it is found that overall performance will be increased by enabling the compression in multiple phase of job configuration.

The Formats of Compression Supported by Hadoop:Hadoop supports multiple compression algorithm, which commonly known as Coded . It is stands for Coder-Decoder. Each codec encapsulates an implementation of one algorithm for compression and decompression. Because each method of compression is based on a different algorithm (for reducing white space, or hashing characters, etc.), each codec also has different characteristics. Some compression algorithm are spittable. Which means that during splitting the file (i.e. in Mapping) the file can be compressed. The spittable algorithm is very efficient because when decompress operation is performed, decompress process is performed on individuals datanode. So decompressing of spittable algorithm is performed in parallel by MapReduce task. That’s why time require will be less. Following is Table is the Format of codec Format DEFLA TE Gzip

Codec org.apache.hadoop.io.compress.Defaul tCodec org.apache.hadoop.io.compress.GzipC odec Bzip2 org.apache.hadoop.io.compress.BZip2 Codec LZO com.hadoop.compression.lzo.LzopCod ec LZ4 org.apache.hadoop.io.compress.Lz4Co dec Snappy org.apache.hadoop.io.compress.Snapp yCodec

Extension .deflate

Splittable N

Hadoop Y

.gz

N

Y

.bz2

Y

Y

.lzo

N

Y

.Lz4

N

Y

.Snappy

N

Y

How It Is Done:INPUT COMPRESSION ; Suppose we have a large file (>500GB) if you want to run MapReduce jobs repeatedly against the same input data. When we submit a MapReduce job to a file it first check if the file is compressed or not by checking its extension and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job. If the file extension dos not match then it will not decompress the file. Therefore, to enable self-detection and decompression, you must ensure that the file name extension matches the file name extensions supported by each codec. It is recommended that to use splittable compression algorithm because of its parallel computation. OUTPUT COMPRESSION; In general, enabling compression during the Map output phase will significantly reduce the internal ‘shuffle’ time, and is particularly recommended if the task log indicates that data has been written to a large number of intermediate data

partitions in the local hard drive – – a process called spilling. The “Shuffling” process become very inefficient due to spilled data. To reduce the amount of data spilled we can use following Command Hadoop 0.20 and earlier Mapred.compress.map.output=true; Mapred.map.output.compression.codec={Codec}

Final Job Compression; compression of final output job reduces data storage and also makes it faster to write the output file. When compression is enabled on the output of the first job, you not only reduce the amount of storage required, but also increase the performance of the second job in the chain, because the input file is already compressed by the first job. Moreover, compression of output data is useful for archiving data. Following commands are used:Hadoop 0.20 mapred.output.compress=true; mapred.map.output.compression.codec={Codec}

Summery:Reasons to compress: a) Data is stored and not frequently processed. In this case space saving can be much more significant then processing overhead b) Compression factor is very high and therefore we save a lot of IO. c) Decompression is very fast and therefor we have a some gain with little price d) Data already arrived compressed Reasons for not to compress: a) Compressed data is not splittable. Have to be noted that many modern format are built with block level compression to enable splitting and other partial processing of the files. b) Data is created in the cluster and compression takes significant time. Have to be noted that compression usually much more CPU intensive then decompression. c) Data has little redundancy and compression gives little gain.

Sources:1. http://comphadoop.weebly.com/ 2. https://www.ibm.com/developerworks/community/wikis/home?lang =en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/Hadoop %20compression%20formats

3. http://download.microsoft.com/download/1/C/6/1C66D134-1FD54493-90BD98F94A881626/Compression%20in%20Hadoop%20(Microsoft%20IT%2 0white%20paper).docx 4. Compression in Hadoop (Microsoft IT white paper) Special thanks to Microsoft.