Image Data-Deduplication using Block truncation coding technique

0 downloads 0 Views 616KB Size Report
Abstract— In this paper, we have CBIR, which is known as. Content based Image Retrieval is the application of computer vision technique to the image retrieval ...
Image Data-Deduplication using Block truncation coding technique 1

Anum Javeed Zargar1

Department of Computer Science and Engineering Jaypee University of Information Technology, Waknaghat Distt Solan 173234, India [email protected]

Abstract— In this paper, we have CBIR, which is known as Content based Image Retrieval is the application of computer vision technique to the image retrieval problem from the large databases. In this paper de-duplication of electricity bills is done through CBIR. The technique for comparing the images is Block truncation coding; it is a lossy compression technique for grayscale image.So Photograph of electricity was divided into different block size using a compression technique known as Block Truncation coding. On the basis of block size, we will put images of the same block size in the same cluster. As there will be large block, so they will be divided into chunks of data. Through this experiment we will try to remove duplicates in the same cluster. In this paper the block of data is also divided into different chunks, so that image of single instance can be stored in these chunks. So duplication will be improved with the help of compression techniques discussed in detail.

Keywords— Data-deduplication; K-means Clustering; Block truncation coding; I.

Introduction

In today’s world there is a great demand of billing system as it is done through a proper way. But in today’s scenario still duplicity of data occurs due to some mistakes in our system. The A Public Works Department (PWD) is a government department or ministry responsible for public works is a mechanism for ensuring access and availability of proper electricity and at subsidized prices to the households. Identification of proper billing system and ensuring delivery of electricity to them effectively and efficiently is the main challenge for PWD. As part of this, one department of Civil Supplies in India has issued around 22 million electricity meter’s covering around 80 million citizens and this process was decentralized. The department noticed that there are some bogus electricity bills and decided to execute the deduplication process of the entire data. De-duplication is carried out in two different ways, one is biometric-based and the other is a photo-based. The reason to go for photo-based deduplication is that there are some electricity without biometrics. In this paper, an attempt is made to explain the deduplication process of photographic images. Photo-based Deduplication means finding the duplicate electricity bills based 978-1-4799-8433-6/15/$/31.00©2015 IEEE

2

Ninni Singh,3Geetanjali Rathee, 4Amit Kumar Singh. Department of Computer Science and Engineering

2, 3, 4 2, 3, 4

Jaypee University of Information Technology, Waknaghat, Distt Solan 173215 [email protected],[email protected], [email protected]

on the family photograph in the large scale database. The operators generated Some duplicates electricity bills using the family photographs of the already existing electricity bills cards. The methods described in [1 - 3], use color histogram refinement technique using color coherent vectors, color and texture based CBIRs. Data duplication is a special technique for compression of data [4].It aims to provide better storage space and utilization. In some process de-duplication is analyzed using different biometric technologies like fingerprint. If someone gets the fingerprint impression as on actual data and that which is present in the database, so we come to the conclusion that delicacy is there. This data will be identified as useless data block as it will show redundancy of data .A redundant data block is replaced then it is stored with data of unique id of whose there is no duplication in the database. Researchers have shown that data archiving system shows that duplication can be removed by compressing the data. In this paper, we propose an algorithm to de-duplicate photographs using histogram refinement for CBIR (Content-Based Image Retrieval). Histogram refinement splits the pixels in a given bucket into several classes based on some local property. Within the given bucket, only pixels in the same class are compared. The entire photographic data was clustered into different clusters. Here the clustering takes place in two levels, one is district-level clustering and the other is k-means clustering. District-level clustering means dividing the data into different clusters based on district names. There may be chances that a family can have two or more electricity bills in different districts. There are 33 clusters formed based on district name. The fixed size chunking method splits the original file into blocks of the same size Block). Space reduction ratio and percentage of de-duplication is calculated using the following equations: (1) (2)

(3) (4) II.

Related Work

To detect an edge in our running environment, Amir and Linden Baum [6] propose grouping cues to separate the knearest edges in two sets: one set of edges which are likely to be in the same group and another set of edges, which are with high statistical probability in different groups. If the ratio of group edges to no group edges is above a specified threshold, the edge named as l is chosen as a background element. This procedure leads to an asymmetric labeling process, where a figure from background discrimination is separated from the other process [7]. Edges having high salient features are considered to be main elements and are separated from background elements. If Similarity in it occurs, they are not unique to each other. Clustering approaches which are related to our new clustering method are Histogram Clustering [9], It is part of machine learning and to replace the means and deviation of clusters with a histogram and dynamically update histogram in the clustering process. Which can be interpreted as a k-means method in histogram space? Pairwise clustering [10] is a normalized sum of all intra cluster dissimilarities, where they are clustered in the form of pairs. Whereas Normalized Cut [11], [12] contain combination both intra cluster and inter cluster similarities. In Section III Technique used is and BTC discussed. In Section IV A. In section V Proposed Technique is discussed. Experimental results and Conclusions are given in Sections VI and VII respectively III technique used: The technique used in this paper is part of Compression technique, as image compression means to reduce irrelevant data and redundant data as the image will be stored and transmit in an efficient way [13]. There are different types of compression: 1) Lossy Compression: Image compression can be lossy or lossless. They are used at low bit rates, and suitable to be used for natural images such as photographs. By word lossy means something was lost. The lossy compression produces imperceptible differences called as visual lossless. BTC is an example of Lossy compression 2) Lossless compression: It is a class of data compression that allows original images to be reconstructed perfected from the compressed data. It is used in many applications BTC Block truncation coding is a simple and fast lossy compression technique for images [13].It has played important role in the compression of images and many people are inspired by the use of BTC. . Another variation of BTC is Absolute Moment Block Truncation AMBTC is considered simpler than BTC. In AMBTC instead of using standard deviation the first absolute value of the moment is preserved along with the men. AMBTC was proposed by Maxima Lemma and Robert Mitchell.[14]. Using sub-blocks of 8 pixels give a

compression ratio of 8:1 assuming 16-bit integer values are used during Transmission or storage. Larger blocks allow better compression different size and blocks can be easily reduce when ("c" and "d" values spread over more pixels) however quality will reduces when there is increase in block size due to the nature of the algorithm. The BTC algorithm was used for compressing Mars Pathfinder' rover images.[6] In this paper figure wise description is given, figure 1 which is a family photo of any family, who has registered an account in the electricity account in electricity database, We will see same family photo will be registered in different departments of PWD, and there will be replications of their electricity bills, which at times create confusion to the department as well as customers to avoid this confusion we will use Block truncation coding. BTC is one among the many digital image coding techniques such as Fast Fourier Transform (FFT), Huffman coding, Arithmetic coding. The BTC technique divides the image into X pixels. The mean and standard deviation of each block is computed as using mean represented as and deviation is represented as. standard 2 1/2

]

]1/2

(5) (6) (7)

Where y is the total number of pixels in the block and Ci is the value of the ith pixel. For color images, equation (1) and equation (2) are performed for each RGB color component of each pixel value Ci .Each pixel in the block is then coded based upon its value as compared to the block mean value. Every pixel whose value is greater than the block is coded as its value is compared with the block mean value. Every pixel with a value greater than or equal to the mean is set to 1; all others value is set to 0; thus reducing the image to a single bit for each pixel. This block along with mean and standard deviation values are transmitted. The block entropy, which tells us how many repeated blocks are there. S (8) H= Where n is the total number of pixels X is the height of the block Y is the width of the block

Fig .1. Photograph of one family with an electricity bill

Chunk B

Fig.2. A photograph of another family having electricity bill

Like this we will have different families whose electricity will be found in a number of databases. To avoid this we use BTC Technique.

III.

Proposed Technique

In this algorithm to prevent data de duplication again and again on the same size of image: Following steps are applied. Step1: As we will have number input image of family photograph having size 256 * 256 and store it on various files named as A, B, C. Step2: Now applies Block truncation coding on it, as it is a compression technique.It is a lossy technique used for gray – scale images Step3:First mean and standard deviation of the image is calculated in the form of bit per pixel Step4: As in BTC mean and standard deviation has to be preserved, then compression on different techniques are applied Step5: We will apply BTC to an image, so that image of block size =2 will be on one chunk file at Step6:As we will have large blocks of data, first we will have to divide data of block into different chunks, then image of different block size can be replaced place in that particular chunk Step7: Division of the block is also important, according to same size Step8: We will apply BTC of different block size on chunks of the original image. The image with block size 2 will be put in one data set A, so there is no De duplication again and again of image of block size=2 Step9: Similarly the image of different block size 4, 6,8 are also put in chunks B, C, D Step10: Due to this lot of bandwidth and space will be persevered Step11: Calculate the energy of various bands, one with high energy will be executed first as compared to others.

Fig.2. This is the image of block-size 2 and so will check from the data store on the original image that those with block size 2 will be stored on this chunk of data B. So will apply BTC to same duplicate image, then the image with block size 2 will be on chunks B. So we can easily save the picture of family with block size on chunk B,rather than storing the same image on different chunks.

Chunk C

Fig.3. This is the image of block-size 4 as the original image is compressed to block size 4,we will see all the images with block size 4 will be on chunk B

Chunk D

Fig.4. This is the image of block-size 6 and so will check from the data store on the original image that those all images on which BTC will be applied with block size 6 will be stored on this chunk of data C

10000 9000 8000 7000 6000

Chunk E

5000 4000 3000 2000 1000

Fig.5. Thus the image of block-size 8 which is in the compressed form and so will check from the data store on the original image that those all images on which BTC will be applied with block size 6 will be stored on this chunk of data C.

This way data of different block are put in different chunks using BTC method, rather than on same chunk A’. This preservers our bandwidth and utilizes less space and give better results. This is our technique through which duplication of data is removed by applying a compression technique for different image, and then compare it with original image present in the database. To check which image is redundant or uniquely present in our database. With the help of BTC compression technique, we can easily check that data is unique

0

0

10

20

30

40

50

60

70

80

90

100

Fig.8. This Shows when image repetition is removed how single instances of the image occur in this form .This is the graph which shows single instances of images one at a time, and is represented uniquely in our database.

IV Performance Analysis With the help of graphs, have shown different Data DE duplication my first plot 1

Fig.9. In this figure there is increase in I/O Performance, when there is an increase in data size. I/O performance of size our data will increase due to increase in data in Kilobytes.

x(t) y(t)

0.8 0.6

V Comparison with previous methods:

0.4

x(t)

0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -2.5

-2

-1.5

-1

-0.5

0 time (s)

0.5

1

1.5

2

2.5

Fig.6. This Shows repetition of data block as well as images in our database.

To be able to compare the proposed method with the existing one, we have implemented Block Truncation coding technique, in addition to use the experimental results of the de-duplication scheme using histogram refinement proposed by N. Pattabhi Ramah [15].We can compare the methods from different aspects in terms of better space utilization that method which we have proposed, and we can calculate Normalize coefficient and PSNR on different block size. But this is not the case in previous experiments: PSNR:It is the peak signal to noise ratio In our proposed approach we have used it to give difference between the original image to be

Data Deduplication

checked in a database for its uniqueness

1 0.8

PSNR=

0.6

(9)

Where,

0.4

Y axis

0.2

MSE=

0

(10)

-0.2

MSE is the mean square error

-0.4 -0.6 -0.8 -1

0

10

20

30

40

50 X axis

60

70

80

90

100

Fig.7. When Duplication of data is removed to some extent in our database, the result in the database will be in these waves form, it shows repetition of data.is removed

Normalized Correlation: It will give us the correct value in our database how much uniqueness exist in our database containing different family photos.It is represented as NC NC=

(11)

In Scaling factor is varied to check de-duplicated data on the basis of PSNR and NC TABLE I: Results of Photographs using Histogram refinement [19].

Scaling factor

PSNR

NC

0.1

28

0.96

0.2

29.80

1

0.3

30

1

0.6

32

0.98

0.8

34.47

1

TABLE II: Results using BTC technique

Scaling

A different block

PSNR

NC

Factor

size

0.1

2 2

29

0.98

0.2

4 4

31

0.97

0.3

6 6

31.89

1

0.6

8 8

36

1

0.8

16 16

41

1

VI

Conclusion

In this paper, a de-duplication process was implemented in large scale database of photographs. In the proposed method, an attempt is made to eliminate the duplicate electricity cards from the database using the Block truncation code technique. To speed up the de-duplication process, the entire data is compressed into different block size levels .After BTC is applied to our images we can see the single instances of the images in our database, which would not create further chaos and confusion in our database. The proposed method eliminated nearly 0.35 million (approximately) duplicate electricity card.

References [1] B. Zhu, K. Li, and H. Patterson, ‘‘Avoiding the Disk Bottleneck in the Data Domain De duplication File System,’’ in Proc. 6th USENIX Conf. FAST, Berkeley, CA, USA, 2008, pp. 269-282. [2] M. H. Lin and C. C. Chang, ‘‘A novel information hiding scheme based At BTC,’’ in Proc. Int. Conf. Computer and Information Technology, 2004, vol. 14---16, pp. 66---71. [3] K. Jin and E.L. Miller, ‘‘The Effectiveness of De duplication on Virtual Machine Disk Images,’’ in Proc. SYSTOR, Israeli Exp. Syst. Conf., New York, NY, USA, 2009, pp. 1-12 [4] Q. Kanafani, A. Beghdadi, and C. Fookes, ‘‘Segmentation-based image Compression using the BTC - VQ technique,’’ in Proc. IEEE Int. Conf. Information Science Signal Processing and their Applications, Paris, Jul.1---4, 2003, Vol. 1, pp. 113---116. [5] A. Amir and M. Linden Baum, “Ground From Figure Discrimination, “Computer Vision and Image Understanding, vol. 1, no. 76, pp. 7-18, 1999. [6] A. Amir and M. Linden Baum, “A Generic Grouping Algorithm and Its Quantitative Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence,Vol. 20, no. 2, pp. 186-192, 1998

[7] L. Herald and R. Huard, “Figure-Ground Discrimination: A Combinatorial Optimization Approach,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 899-914, Sept. 1993. [8] J. Puzicha, T. Hofmann, and J.M. Busman, “Histogram Clustering for Unsupervised Segmentation and Image Retrieval,” Pattern Recognition Letters, vol. 20, pp. 899-909, 1999. [9] T. Hofmann and J.M. Bah manna, “Pairwise Data Clustering by Deterministic Annealing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 1, pp. 1-14, 1997. [10] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888905,2000. [11] J. Malik, S. Belongie, T. Leung, and J. Shi, “Contour and Texture Analysis for Image Segmentation,” Int’l J. Computer Vision, vol. 43, no. 1, pp. 7-27, 2001. [12] B. Fischer and J.M. Buhmann, “Path Based Clustering for Grouping of Smooth Curves and Texture Segmentation,” Institution of Information III, Rheinische Friedrich-Wilhelms-Universita¨t, Bonn, Germany, TR-IAI-20027, 2001. [13 Michael Swain, Dana Ballard, “Color indexing”, International Journal of Computer Vision, 7(1):11-32, 1991 [14] J. Smith, S. Chang, “Tools and Techniques for Color Image Retrieval”, Proc. SPIE, vol. 2670, pp. 1630-1639, 1996. [15] N. Pattabhi Ramaiah, C. Krishna Mohan,” De-duplication of Photograph, Images using Histogram, no. 9, pp. 899-909, 1999. [16] M. H. Lin and C. C. Chang, ‘‘A novel information hiding scheme based At BTC,’’ in Proc. Int. Conf. Computer and Information Technology, 2004, vol. 14---16, pp. 66---71.

Anum Javeed Zargar is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan-173234, Himachal Pradesh, India. She received her B. Tech. Degree in Computer Science and Engineering (CSE) from Islamic University of Science and Technology, Awantipor, Jammu Kashmir 192122 in 2012. Now she is undertaking the Master Degree course under supervision of Mr.Amit Kumar Singh in Jaypee University of Information and Technology (JUIT), Waknaghat, Solan-173234. Her research interests include Digital Watermarking, Digital halftoning and network security. Ninni Singh is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan-173234, Himachal Pradesh, India. She received her B.E. Degree in Computer Science and Engineering (CSE) from hitkarini college of engineering and technology, Jabalpur, Madhya Pradesh in 2009. Now she is undertaking the Master Degree course under supervision of Dr. Hemraj Saini in Jaypee University of Information and Technology (JUIT), Waknaghat, Solan-173234. Her research interests include cryptography and network security, distributed system and wireless sensor and mesh network.

Geetanjali rate is a Teaching Assistant (TA) in the Department of Computer Science Engineering (CSE) & Information Communication Technology (ICT) with Jaypee University of Information Technology (JUIT), Waknaghat, Solan-173234, Himachal Pradesh, India. She received her B. Tech Degree in Computer Science and Engineering (CSE) from Bhagwan Mahavir Institute of Engineering and Technology (BMIET), Haryana in the year

2011. She has completed her Master Degree program in June 2014 under supervision of Dr. Nitin Rakesh at Jaypee University of Information and Technology (JUIT), Waknaghat. She is currently a PhD scholar at Jaypee University of Information Technology, Waknaghat. Her research interests include Resiliency in Wireless Mesh Networking, Routing Protocols, and Networking, Security in Wireless Mesh Networks.

Suggest Documents