2014 IEEE 28th International Conference on Advanced Information Networking and Applications
A Scalable Approach to Source Camera Identification over Hadoop Umberto Ferraro Petrillo Dipartimento di Scienze Statistiche Università di Roma “Sapienza” P.le Aldo Moro 5, I-00185 Rome, Italy Email:
[email protected]
Giuseppe Cattaneo, Gianluca Roscigno Dipartimento di Informatica Università degli Studi di Salerno Via Giovanni Paolo II, 132, I-84084 Fisciano (SA), Italy Email: {cattaneo,giroscigno}@unisa.it
One of the application fields where this problem is arising is Digital Image Forensics. This field concerns with the acquisition and the analysis of digital media in order to find clues while investigating a crime. For examples, we cite the need to establish if a digital image has been altered after it has being captured (i.e., Digital Image Integrity), if it contains hidden data (i.e., Digital Image Steganography) or what is the specific camera that has been used to capture it (i.e., Source Camera Identification). Most of these problems have already been discussed in the scientific literature and one or more solutions have been proposed. However, with the explosive growth of the digital photography and of the online social networks (e.g., 350 million photos per day were uploaded to Facebook in the fourth quarter of 2012 [2]), there is need to assess how these solutions scale when dealing with Big Data and/or how they can be reformulated in order to optimize their performance. This work focuses on the Source Camera Identification (SCI) problem. This topic received a lot of attention in the recent past, as witnessed by the several publications available in the scientific literature, such as [1], [3], [4], [5]. One of the most noteworthy contribution in this area is the algorithm proposed by Lukáš et al. in [1]. This algorithm works by extracting a digital fingerprint from an image under scrutiny and by correlating it with a set of previously known cameras’ fingerprints in order to identify the camera originating that image. These operations are very expensive on their own, as they involve the processing of images containing millions of pixels. Moreover, they are expensive as a whole, as this cost has to be multiplied for the number of images (which may be thousands or millions) used to build, train, test and, finally, use the recognition system. Starting from this premise, we have reformulated the original algorithm by Lukáš et al. as a distributed algorithm, according to the MapReduce paradigm as implemented by the Hadoop framework. The outcoming implementation has been experimented on a cluster of 33 nodes under several different experimental settings. Our results show that the distributed implementation succeeds in scaling almost linearly with the number of computing nodes while achieving performance that are much better than those of its original sequential counterpart. Moreover, we developed and tried
Abstract—In this paper, we explore the possibility to solve a commonly-known digital image forensics problem, the Source Camera Identification (SCI) problem, using a distributed approach. The SCI problem requires to recognize the camera used to acquire a given digital image, distinguishing even among cameras of the same brand and model. The solution we present is based on the algorithm by Lukáš et al. [1], as it is recognized by many as the reference solution for this problem, and is formulated according to the MapReduce paradigm, as implemented by the Hadoop framework. The first implementation we coded was straightforward to obtain as we leveraged the ability of the Hadoop framework to turn a stand-alone Java application into a distributed one with very few interventions on its original source code. However, our first experimental results with this code were not encouraging. Thus, we conducted a careful profiling activity that allowed us to pinpoint some serious performance issues arising with this vanilla porting of the algorithm. We then developed several optimizations to improve the performance of the Lukáš et al. algorithm by taking better advantage of the Hadoop framework. The outcoming implementations have been subject to a thorough experimental analysis, conducted using a cluster of 33 commodity PCs and a data set of 5, 160 images. The experimental results show that the performance of our optimized implementations scale well with the number of computing nodes while exhibiting performance that are, at most, two times slower than the maximum speedup theoretically achievable. Keywords-Digital Image Forensics; Source Camera Identification; Distributed Computing; Hadoop;
I. I NTRODUCTION Current technologies provide decision-makers with the ability to collect a huge amount of data, making it possible to deal with problems that were out of their grasp up to few years ago. This trend is witnessed by the spreading of words like Petabytes and Zettabytes, which are quickly replacing terms like Gigabytes and Terabytes, to denote big amount of data, i.e., Big Data. This wealth of data raises the problem of developing tools and methodologies with a high scalability degree and able to process such virtually unbound amount of data. The project is a joint work with the CNCPO (Centro Nazionale per il Contrasto della Pedopornografia Online), part of the “Dipartimento della Pubblica Sicurezza” within the Italian Ministry of Interior.
1550-445X/14 $31.00 © 2014 IEEE DOI 10.1109/AINA.2014.47
366
some tweaks that take advantage of some of the Hadoop features in order to optimize the activity of the computing nodes and reduce the overall execution time.
a cluster. The HDFS is a distributed file system optimized to run on commodity hardware and able to provide fault tolerance through replication of data. Generally, a Hadoop cluster consists of a single master node and multiple slave nodes: the master node runs the JobTracker and the NameNode services, while slave nodes run the DataNode and the TaskTracker services. The JobTracker service manages the assignment of map and reduce tasks to the slave nodes, where they will be received and run by TaskTracker service. The DataNode service manages the local storage available to a node running the TaskTracker service. Finally, the NameNode service manages the HDFS namespace, by keeping the directory tree of all files in the file system, and tracking where across the cluster the file data is kept. To save network bandwidth and local disk I/O, Hadoop implements a mechanism to minimize the data transferred between map and reduce tasks, i.e., the Combiner. It is a user-provided function that is invoked before a map task sends its outputs. It can be used, e.g., to batch and aggregate part of the output data of a map task using just in-memory operations, before sending the result to the destination reduce task.
II. T HE M AP R EDUCE PARADIGM In the recent years, several different architectural and technological solutions have been proposed for processing big amount of data. One computing paradigm that is becoming popular is MapReduce [6], which allows to process large data sets in a cluster using two steps: map and reduce. According to this paradigm, a job to be processed is split in several smaller jobs that are distributed to a large number of nodes, the slave nodes, for processing (map step). When the slave nodes finish processing the jobs, all results are collected and, then, summarized using an aggregation functions (reduce step). The MapReduce paradigm is particularly appropriate for embarrassingly parallel data intensive problems which requires fault tolerance feature. Differently from traditional paradigms, such as the message-passing based, it allows for implicit parallelism. Namely, all the operations related to the exchange of data between the nodes involved in a computation is modeled according to a file-based approach and is in charge of the underlying middleware. This paradigm has first been adopted by Google, but it has now gained a wider audience and it is used in several application fields like satellite data processing, bioinformatics applications and machine learning applications (see, e.g., [7] or [8]). There exists several frameworks implementing the MapReduce paradigm. Disco [9] is a lightweight, open-source framework for distributed computing adopting this paradigm. Dryad [10] implements an extension of the MapReduce paradigm providing an infrastructure that allows a programmer to use the resources of a cluster of Microsoft Windows servers for running data-parallel programs. Finally, Apache Hadoop [11] is a very popular MapReduce based distributed processing framework. It is a Java based open source grid computing environment useful for reliable, scalable and distributed computing.
A. Hadoop Data Management Features Usually, when the developer writes a code to be run on Hadoop, he does not have to care about the way data is spread and maintained over the different nodes of a cluster. However, there are some cases where an explicit control over data locality is required to achieve better performance. In order to cope with these needs, Hadoop offers several facilities. In the following, we briefly describe the Hadoop data-management functionalities that have been used during the development of our solution: • Distributed Cache. It is a facility, available through the DistributedCache object, useful to cache large read-only files on a slave node, before any task is executed on that node. • Sequence Files. These files, implemented through the SequenceFile object, allow to store sequences of binary pairs of arbitrary length on HDFS. In addition, they can store arbitrary types of data, even compressed. Moreover, they supports a variety of serialization frameworks and can be used also as a container of smaller files. Finally, they provide a split primitive that allows to break their content in several parts to be distributed to different slave nodes, so to put data on the same nodes where they will be processed and, thus, avoiding any further data transmission. • Data Replication. HDFS is designed to reliably store very large files across nodes in a cluster. This is done by storing each file as a sequence of blocks, where the blocks belonging to a same file can be replicated several times over different nodes for fault tolerance.
III. A PACHE H ADOOP We chose to use in our work the Apache Hadoop framework, mainly because it is a very mature project (with respect to its alternatives) and allows a quick and easy setup of a cluster of commodity PCs. In addition, we were very interested to the possibility of reformulating our existing SCI implementation as a distributed algorithm, with a very little intervention on its original code. From an architectural viewpoint, Hadoop is mainly composed of a data processing framework and of a Hadoop Distributed File System (HDFS). The data processing framework organizes a computation as a sequence of user-defined MapReduce operations on data sets of pairs. These operations are executed as tasks across the nodes of
367
hereafter have to be repeated either three times, if we choose to work on the red, green and blue color channels (RGB), or just one time, if we consider the grayscale representation of the input images. In our case, we have chosen to work in the RGB space for improving the quality of the identification. Let I be the image under scrutiny and DevSet = {C1 , C2 , . . . , Cn } the set of candidate origin cameras for I. The algorithm operates in four steps. The first step is the calculation of the Reference Pattern (i.e., the fingerprint) RPC for each camera C belonging to DevSet. The approach proposed by Lukáš et al. consists in estimating RPC by extracting the residual noise (RN ) existing in a set of pictures taken using C and, then, combining together these noises, as an approximation of the PNU noise. The residual noise RN of an image I can be defined as (1) RNI = I − F (I)
IV. T HE S OURCE C AMERA I DENTIFICATION P ROBLEM The Source Camera Identification (SCI) problem concerns with the identification of the particular digital camera that has been used to capture an input digital image. A very common approach consists in analyzing the noise existing in a digital image to find clues about the digital sensor that originated it. In a digital image, the noise can be defined as a distortion in the color of a pixel with respect to the original view being portrayed. These distortions may be due to the Shot Noise, a random component, and/or to the Pattern Noise, a deterministic component. The Pattern Noise can be, in turn, split in two main components: the Fixed Pattern noise (FP) and the Photo-Response Non-Uniformity noise (PRNU). The FP noise is caused by dark currents, the information returned by the pixel detectors of a digital sensor, when they are not being not exposed to light. The PRNU noise is caused mainly by the Pixel Non-Uniformity noise (PNU), which is due to the different sensitivity of the pixel detectors of a digital sensor to light. This difference is caused by the inhomogeneity of the wafers of silicon and by the imperfections derived from the manufacturing process of the sensor. Thanks to their deterministic and systematic nature, the PNU noise and the FP noise are the ideal candidates at providing a sort of fingerprint of digital cameras. For example, in [3] the authors use the dark current noise to identify a camcorder from videotaped images. Instead, the idea of using the PNU noise as a fingerprint for digital cameras has been explored by Lukáš et al. in [1]. The authors observed that this method was successful in identifying the source camera used to take a picture, even distinguishing between cameras of the same brand and model. The same method was also successful when used with images that have been subject to post-processing operations such as JPEG compression, gamma correction, and a combination of JPEG compression and in-camera resampling. The effectiveness of this method has been further confirmed by a large scale experimental evaluation whose results are available in [12]. The authors downloaded from the Flickr image database a set of pictures taken by 6, 896 individual cameras (covering 150 camera models), for an overall number of more than one million pictures. According to their results, their algorithm was able to exhibit, in this setting, a False Rejection Rate (FRR, i.e., the rate of images that have not been attributed to their originating camera), smaller than 0.0238, with the False Acceptance Rate (FAR, i.e., the rate of images that have been attributed to the wrong camera) set to a very small value (i.e., 2.4 × 10-5 ).
where F (I) is a filter function that returns the noisefree variant of I. To this end, the authors propose a special filter that simulates the behavior of the Wiener filter in the wavelet domain, following an approach suggested in [13]. The operation described above is iterated over a group of several images of the same spatial resolution, here called enrollment images, taken using C. This returns a group of m residual noises, including both a random noise component and the PNU noise estimation of C. Then, the sum of the residual noises is averaged to obtain a tight approximation of the camera fingerprint C, as follows: m RPC =
RNk m
k=1
(2)
The average operation reduces the contribution of the random noise components while highlights the contribution of the deterministic noise components. The second step is propaedeutic to the training of the identification system carried out during the third step of the algorithm. We first introduce a set of training images using each of the cameras belonging to DevSet. Then, we calculate the correlation between the fingerprint of each camera C belonging to DevSet and each image T taken from the training set. The calculation is accomplished using the Pearson correlation index as follows: (RNT − RNT )(RPC − RPC ) corr(RNT , RPC ) = RNT − RNT RPC − RPC (3) This index returns a value in the interval [−1, +1], where higher values imply a higher probability that image T has been taken using camera C. Notice that if the spatial resolution of T does not match the resolution of the images used for determining RP C , then the resolution of T is adapted using a cropping or resizing operation.
A. The algorithm by Lukáš et al. In this section we describe the original version of the SCI algorithm by Lukáš et al. [1], which is the basis of our reference implementation. All the operations described
368
50 cores were devoted to this application. Nothing is said about the way the algorithm has been reformulated as a distributed algorithm and about its performance. Another contribution describing a large scale experimentation of a Source Camera Identification algorithm is presented in [15]. The authors describe a fast searching algorithm (based on the usage of collection of fingerprint digests) for finding if a given image was taken by a camera whose fingerprint is in the database. The authors performed their experimentation with the help of the Matlab software and of a database of 2, 000 iPhones, proving the feasibility of the proposed approach. Even in this case, no details are provided about the way the experimentation has been conducted.
The third step is about the training of a classifier able to recognize the source camera of a given image. The identification is based on the definition of a set of three acceptance thresholds (one for each color channel) to be associated to each of the cameras under scrutiny. If the correlation between the residual noise of I and the Reference Pattern of a camera C, on each color channel, exceeds the corresponding acceptance threshold, then C is assumed to be the camera that originated I. The classifier is trained so to minimize the False Rejection Rate (F RR) for images taken using C, given an upper bound on the False Acceptance Rate (F AR) for images taken using a camera different than C (Neyman-Pearson approach). The last step concerns with the identification of the camera that captured I. Here the algorithm first extracts the residual noise from I, RN I , then correlates it with the Reference Patterns of all cameras under scrutiny using the classifier trained in Step 3. If the correlation exceeds the decision threshold of a certain camera, on each of the three color channels, a match is found.
V. RUNNING THE L UKÁŠ et al. A LGORITHM ON H ADOOP In order to experiment with the Lukáš et al. algorithm on Hadoop, we started from the implementation discussed in Section IV-B. The source code of the distributed version of the algorithm has been organized in four different modules, corresponding to the four processing steps of the Lukáš et al. algorithm, plus a fifth module related to the preliminary image loading activity (see Table I for an overview of the implementation). A first Hadoop version of the code has been quite straightforward to obtain as we had just to write some very short wrapper functions to embed the original algorithm modules in the Hadoop framework. However, this first vanilla implementation exhibited very bad performance in our preliminary tests. After analyzing in details these performance, we realized that this phenomenon was manly due to one reason: the overhead paid by Hadoop for managing a large number of small files (the image files, in our case). Starting from this consideration, we developed some optimized Hadoop versions of the algorithm. On a side, this new version succeeds in delivering very good performance. On the other side, it required much more coding work on the Hadoop side, in order to solve or circumvent the performance bottlenecks found in our first implementation. In the following, we describe in details the modules of our Hadoop based implementation of the Lukáš et al. algorithm, with an emphasis on the code optimizations we implemented.
B. Reference Implementation Our reference implementation of the Lukáš et al. algorithm has been coded in Java1 in order to make it runnable over different architectures at no additional setup or development cost. It closely follows the original algorithm except for the decision phase (Step 3 and Step 4). In our case, we have chosen to use a multi-class Support Vector Machine (SVM) [14] classifier instead of using the original approach based on the Neyman-Pearson approach, because of the better performance exhibited by this approach in our experiments with respect to the original one. A SVM classifier belongs to the class of supervised learning classifiers. These classifiers are able to estimate a function from labeled training data, with the purpose of using it for mapping unknown instances (not labeled). In the classification problem, the training data consist of a set of instances, where each is a pair consisting of a vector of features and the desired group (class). In our case, the features are the indices extracted from correlating each image under scrutiny with each reference pattern. C. Large-Scale Source Camera Identification The problem of how to perform Source Camera Identification on large data sets has not received much attention in the scientific literature. One of the few contributions in this area, [12], presents a large scale test of Source Camera Identification from sensor fingerprints. The authors tested over one million images spanning 6, 896 individual cameras covering 150 models, using an improved version of the Lukáš et al. algorithm. The only information available about the experimental setting they have chosen is about the usage of a cluster of 40 2-core AMD Opteron processors, where
A. Loading Images During this preliminary step, the images used during the algorithm execution are loaded on the HDFS storage. In the first implementation we developed, this task was accomplished trivially, by copying and keeping the images as separate files. This solution had a very bad impact on the performance of HDFS, probably because of the problems related to the management of a very large number of small files, as documented in [16]. For this reason, we had to rethink the whole management of the image files used by the algorithm. The solution we found consisted in maintaining only two, very-large, files containing all the
1 A copy of the source code of our implementation is available upon request.
369
TABLE I. Overview of our Hadoop-based implementation of the Lukáš et al. algorithm. Step
Hadoop Role
Input Processed
Output Generated
HDFS
List of Images
EnrSeq, TTSeq
(I) Calculating Reference Patterns
MapReduce Job
EnrSeq
RPSeqs
(II) Calculating Correlation Indices
Map only Job
TTSeq, RPSeqs
Plain text CORRs file
(III) Classifier Training and Testing
Sequential No Hadoop
Plain text CORRs file
SVM, RR
Map only Job
Image file, RPSeqs, SVM
Classification Results
(Setup) Loading Images
(IV) Source Camera Identification
In order to save network bandwidth and optimize the overall execution time, we required each map task to aggregate all the RN s extracted by images captured with a same device, before transmitting them to the node running the reduce task. The aggregation is done by summing all the RN images, thus allowing to transmit the residual noise of a group of images at the same cost of transmitting just one of them. To facilitate this operation, the enrollment images are ordered by the camera id and the partial sum of the RN files is kept in memory by the node executing the map task, without involving any I/O operation. From a technical viewpoint, this aggregation would have not been possible using a standard Hadoop combiner (see Section III), as this facility requires all the objects to aggregate (i.e., the RN images) to be stored in memory. Such a strategy is not adequate, in our case, as the sum of the sizes of the RN images exceeds the physical memory of the computing nodes. As a workaround, we implemented an ad-hoc solution, by means of a code to run during the map task, that does not require temporary RN images to be stored all in memory. During the reduce phase, a reduce function receives a tuple in the format , where key is the id of one of the cameras, e.g. C, and values is a set of the sums of the RN s for that device, as calculated during a map task. All the sums of the RN s of the same device are summed and then averaged (together) to form the Reference Pattern for C. As output, the function write a new pair , where key is the id of C, and value is RPC .
image files. These two files have been coded as Hadoop SequenceFile objects and are: EnrSeq, used for storing a list of erollment images, and TTSeq, used for storing a list of training and testing images. In both files, the images are ordered according to their originating camera id. Then, we used the input split capability available with the SequenceFile object to partition these two files among the different computing nodes with the aim of promoting data local execution.
C. Calculating Correlation Indices
B. Calculating Reference Patterns
During this step, the algorithm extracts the RN of each training image and, then, correlates it with the RP s of all the input devices. The same operation is repeated for all the testing images. In the map phase, each processing node receives a list of input images to be correlated in the format records, where key is derived from the image meta-data and value stores a copy of the image. The input images are stored in the TTSeq file. For each image, the corresponding RN is extracted and, then, correlated with the RP s of all the input devices calculated in the previous step. If the resolution of the input image does not match the resolution of the RP of a given camera, cropping and/or scaling techniques are used in order to correct the issue. For each correlation, the map function generates a pair, where key is the keyword “Correlation” and value consists of: the image id, the camera id, the RP id, a value indicating the correlation type (equal, cropped or resized), plus the three correlation indices (one for each color channel). The output of all map tasks is then collected in a file, CORRs, which will be read by the master node. Since each processing node has to load the RP of all the input devices, we used the Hadoop DistributedCache mechanism to have each node transfer on its local file system a copy of these files,
The aim of this step is to calculate the Reference Pattern of a camera C, by analyzing a set of enrollment images having the same spatial resolution and taken using C. In the map phase, each processing node receives a set of images, extracts their corresponding residual noises and outputs them. In the reduce phase, each processing function (one for each camera C) takes, as input, the set of residual noises of C produced in the previous tasks and combines them, thus returning the RPC . This operation is repeated for each camera under scrutiny. As said in Section III, in Hadoop, each input record is structured as a pair, where value stores a copy of the image and key is derived from the image meta-data: image id, camera id, image type (i.e., enrollment, training or testing) and image hash. Input records are initially read from the EnrSeq file. When the map function is invoked, it receives this record, loads the corresponding image in memory and, finally, it extracts the residual noise (RN) from the image. As output, the function produces a new pair, where key is the camera id and value is RN . We noticed, in our preliminary experimentations, that the map phase was very expensive because of the time spent transmitting every single RN at the end of the computation.
370
before starting the Hadoop job. In this step, a reduce task is not required.
TABLE II. Execution times, in minutes, of the different variants of the Lukáš et al. algorithm on a Hadoop cluster of 32 slave nodes compared against the sequential counterpart run on a single node. Step III is omitted because its execution times are negligible.
D. Classifier Training and Testing Type
During this step, the SVM-based classifier is trained and tested using the correlation indices calculated in the previous step and available in the CORRs file. The whole process takes place on the master node. The SVM implementation we have used is the one available with the Java Machine Learning Library (Java-ML) [17], loaded with the LIBSVM module. At the end of this step, the classifier has been trained and it is ready to be used for the identification step. Moreover, an estimation of the accuracy of the test is returned to the user, express as the number of successful matches (recognition rate, RR) between the testing source images and their corresponding reference cameras.
Setup
Step I
Step II
Total Time
Luk
n/a
903
5 059
5 962
LukHdp
58
52
278
388
LukHdp_Zip
58
52
462
572
LukHdp_PC
58
52
240
350
B. Dataset Our dataset is made of images being shot using 20 Nikon D90 digital cameras. This model has a CMOS image sensor (23.6×15.8mm) and maximum image size of 4288×2848 pixels. 258 JPEG images were taken for each device at the maximum resolution possible and with a very low JPEG compression. These images were organized in 130 enrollment images, 64 training images and 64 testing images. Enrollment images were taken from a ISO Noise Chart 15739 [18]. Instead, training and testing images portray different types of subjects. The same shots were taken using each of the 20 cameras used for our tests. The overall dataset is about 20Gb large. In all our tests, the classifier was able to correctly identify the source camera used to shot 1, 277 images, thus achieving a ≈ 99.7% recognition rate.
E. Source Camera Identification The aim of this step is to establish which camera has been used to capture an image I. The input file of the Hadoop job is the directory where the RP s have been stored. The output of this phase is the predicted device used to acquire the input image. For each input RP , a new map function is called. This function, first, uses a copy of I to extracts the residual noise RN and, then, calculates the correlation between this RN and the input RP . Finally, the job returns to the master node a file containing the list of the correlation values to perform the recognition phase using the SVM trained in the previous step.
C. Preliminary Tests We performed a first preliminary and coarse comparison of the performance of the Hadoop based implementation of the Lukáš et al. algorithm, described in Section V and here named LukHdp, versus the one running as a standalone application, here named Luk, by measuring the overall execution time of the different steps of the algorithm in both settings. The results, available in Table II, show that the LukHdp implementation exhibits approximately a 16x speedup with respect to Luk, thus providing a performance that is about two times slower than the theoretical speedup achievable using a cluster of 32 nodes. Notice that Luk has not to pay a Setup cost, as we are assuming that the images to be used during the experiment are already loaded on the machine running the algorithm. Notice also that these results do not include our first Hadoop-based implementation of the Lukáš et al. algorithm, as its execution time was out-of-scale. These preliminary results seem to confirm, on a side, that is possible to drastically reduce the execution time of the Lukáš et al. algorithm by using the MapReduce paradigm. On the other side, it is clear the LukHdp implementation of the algorithm still offers much room for improvement.
VI. E XPERIMENTAL R ESULTS In this section we discuss the results of the experimental analysis we have conducted, by comparing the performance of our Hadoop based implementation of the Lukáš et al. algorithm with its original sequential counterpart. The discussion is accompanied by a description of the experimental settings and of the data sets we have used in our analysis. A. Experimental Settings All the experiments have been conducted on a cluster of 33 PCs equipped with 4 GB of RAM, an Intel Celeron G530 @ 2.40GHz x2, Windows 7 host operating system and a 100Mbps Ethernet card. In this environment, we have installed on each computer a virtual machine running the Ubuntu 12.10 64-bit (Kernel 3.5 x86_64) guest operating system, and equipped with 3, 100 MB of RAM, 2 CPUs and 117 GB of virtual disk storage (file system type ext4). The Hadoop cluster configuration includes 32 slave nodes and a master node. The Hadoop version is 1.0.4. To save memory, the virtual machines hosting the slave nodes of the cluster were configured to not run a graphical user interface.
D. Tuning phase We expect the most expensive operation of the Lukáš et al. algorithm to be the usage of the denoising filter to
371
extract the residual noise from an image. This has been confirmed in our preliminary experimentation, where the average time required for calculating the residual noise of an input image was of approximately 20 seconds. Such a long time is justified both by the complexity of the denoising filter as well as by the need of processing the file not in its original JPEG encoding, but in an uncompressed format (i.e., approximately 140 megabyte of data per image). A similar problem affects also the calculation of the correlation between a residual noise image and a Reference Pattern. This operation implies the comparison of two files having an overall size of about 300 megabytes, thus heavily affecting the operation execution time. Finally, the need of managing such large files puts also a heavy burden on the I/O activity as the transmission of a residual noise file over the network may take a (relatively) long time. A first optimization we tried, here denoted as LukHdp_Zip, consisted in compressing the objects containing the residual noises and the Reference Patterns, before sending them over the network. The compression algorithm we have used is the Lempel-Ziv coding. The expectation was that the time spent by each node to compress and decompress a file would be repaid by the smaller transmission times required to exchange the compressed files over the network. In addition, the transmission of smaller files would also reduce the probability of a network congestion due to several nodes exchanging large files at the same time. Another potential performance issue concerns with the calculation of the correlation indices. In its original formulation, the algorithm loads from the local file system a camera Reference Pattern and, then, correlates it with an input residual noise file. While carrying out the first activity, the CPU is almost unused, as it is essentially an I/O-intensive operation. Instead, the second activity is CPU-intensive and makes no use of the file system. Consequently, the second optimization we tried, labeled as LukHdp_PC, consisted in modeling these two activities according to the producerconsumer paradigm. Here, a first thread is in charge of loading RP files from the local file system and adding them to a shared queue. In the meanwhile, the second thread loads RP files from the shared queue and uses them to calculate the correlation with an input RN . The two threads are executed concurrently, so that, while one thread is calculating the correlation between the input RN file and the RP of a given camera, the other thread is loading in memory the RP of the next camera. Notice that it is not possible to maintain in memory the RP of all the cameras due to their large size. Despite our expectations, the usage of compression in order to reduce the exchange times of the RP files did not produce any advantage on the overall execution time, as shown in Table II. On the contrary, we observed a bad increasing of the Step II execution time, likely to be due to
the time spent uncompressing these files. Instead, the adoption of the producer-consumer pattern in the computation of the correlation indices improved the execution time of Step II by about a 14% over the performance of LukHdp. E. Scalability Test In this experiment we investigated the scalability of LukHdp_PC with respect to its sequential counterpart. To this end, we focused our attention on the two more computational intensive steps of the Lukáš et al. algorithm: the calculation of the RPs (i.e., Step I) and the calculation of the correlation indices (i.e., Step II). From an operative viewpoint, we increased the size of the cluster from 4 up to 32 slave nodes, while measuring the efficiency of LukHdp_PC compared to Luk according to the following formula: E(n) =
TLuk n · TLukHdp_P C (n)
(4)
In 4, n is the number of slave nodes of the cluster, TLuk and TLukHdp_P C (n) are the execution time of Luk and the execution time of LukHdp_PC, when run on a cluster of size n. The results, available in Figure 1, are somewhat contrasting. When running on a cluster of 4 slave nodes, LukHdp_PC exhibits very good performance, with an efficiency that is not far from 1. As long as the size of the cluster increases, the efficiency of the algorithm decreases, probably because of the overhead due to the spreading and collection of data files among an increasing number of nodes. In addition, we note different performance between the two steps: the speedup of the Step I is smaller and grows slower than the Step II. This difference can be explained by the different strategies adopted to distribute the computation. Step I fully exploits the MapReduce paradigm. In facts, the map tasks produce a huge amount of data to be elaborated by the reduce tasks and, thus, the overall execution time is heavily influenced by the network activity. On the contrary, Step II is less dependent on network activity because it produces a smaller amount of data and it does not include any reduce task. Finally, the producer-consumer paradigm used in the map task to asynchronously load RP files in memory allows for a better usage of the CPU and, consequently, implies shorter execution times. VII. C ONCLUSIONS AND FUTURE WORKS Our goal was to experiment the possibility to solve a commonly-known digital image forensics problem, the Source Camera Identification (SCI) problem, using a distributed approach based on the MapReduce paradigm. Namely, we have chosen the reference algorithm for this problem, the algorithm by Lukáš et al., we have reformulated it as a MapReduce algorithm, we have implemented using the Hadoop framework and we compared its performance against the original version in a variety of settings.
372
[4] G. Cattaneo, P. Faruolo, and U. Ferraro Petrillo, “Experiments on improving sensor pattern noise extraction for source camera identification,” in Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2012. IEEE, 2012, pp. 609–616. [5] A. Castiglione, G. Cattaneo, M. Cembalo, and U. Ferraro Petrillo, “Source camera identification in real practice: A preliminary experimentation,” in International Conference on Broadband, Wireless Computing, Communication and Applications (BWCCA), 2010. IEEE, 2010, pp. 417–422. [6] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, January 2008.
Figure 1. Efficiency of LukHdp_PC with respect to Luk when running on a cluster of increasing size.
[7] J. Choi, C. Choi, B. Ko, D. Choi, and P. Kim, “Detecting web based DDoS attack using mapreduce operations in cloud computing environment,” Journal of Internet Services and Information Security (JISIS), vol. 3, no. 3/4, pp. 28–37, November 2013.
The results we obtained are quite contrasting. Initially, we have been able to code the distributed version of the algorithm with very few efforts thanks to the facilities offered by the Hadoop framework. However, this vanilla implementation exhibited very poor performance. Thus, we had to write a different optimized implementation, which required a lot of engineering, in order to obtain performance that were in line with our expectations. According to our tests, this optimized version easily outperforms the original one when running on a cluster of at least 4 computing nodes. However, there is still a 2x gap with respect to the maximum speedup achievable in a distributed setting. This performance penalty is mostly due to the overhead required for transmitting image and data files, which may be very big, among different nodes. There are several directions worth to be explored. Firstly, there is need of a deep profiling activity to better explain the performance of the distributed implementation we developed and, possibly, design new optimizations. Secondly, it should be investigated the possibility of developing an alternative approach to the exchange of files and images among nodes for achieving better performance (i.e., introducing a custom Hadoop module for explicitly managing data transmission according to our needs). Thirdly, it would be interesting to experiment with the adaptation to the MapReduce paradigm of alternative SCI algorithms, as well as with the experimentation of these algorithms on much larger data sets.
[8] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly et al., “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,” Genome research, vol. 20, no. 9, pp. 1297–1303, 2010. [9] “Disco,” http://discoproject.org/, August 2013. [10] “Microsoft Dryad,” http://research.microsoft.com/enus/projects/dryad/, August 2013. [11] “Apache Hadoop,” http://hadoop.apache.org/, July 2013. [12] M. Goljan, J. Fridrich, and T. Filler, “Large scale test of sensor fingerprint camera identification,” in Proc. SPIE, Electronic Imaging, Security and Forensics of Multimedia Contents XI, 2009, pp. 18–22. [13] M. K. Mihcak, I. Kozintsev, and K. Ramchandran, “Spatially adaptive statistical modeling of wavelet image coefficients and application to denoising,” In Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Phoenix, AZ, vol. 6, pp. 3254–3256, Mar 1999. [14] V. N. Vapnik, The nature of statistical learning theory. Springer-Verlag New York, Inc. New York, NY, USA, 1995. [15] M. Goljan, J. Fridrich, and T. Filler, “Managing a large database of camera fingerprints,” in SPIE Conference on Media Forensics and Security, 2010.
R EFERENCES
[16] T. White, “The small files problem,” http://www.cloudera.com/blog/2009/02/the-small-filesproblem/, 2009.
[1] J. Lukáš, J. Fridrich, and M. Goljan, “Digital camera identification from sensor pattern noise,” IEEE Transactions on Information Forensics and Security, vol. 1, pp. 205–214, November 2006.
[17] T. Abeel, Y. V. de Peer, and Y. Saeys, “Java-ML: A machine learning library,” Journal of Machine Learning Research, vol. 10, pp. 931–934, 2009.
[2] “Facebook Annual Report 2012,” http://investor.fb.com/ secfiling.cfm?filingID=1326801-13-3, August 2013.
[18] “ISO Noise Chart http://www.precisionopticalimaging.com/products/ products.asp?type=15739, August 2013.
[3] K. Kurosawa, K. Kuroki, and N. Saitoh, “CCD fingerprint method-identification of a video camera from videotaped images,” in International Conference on Image Processing (ICIP), 1999, vol. 3, 1999, pp. 537–540.
373
15739,”