Medical Content Based Image Retrieval by Using ... - Semantic Scholar

13 downloads 21229 Views 864KB Size Report
based image retrieval system by applying the MapReduce distributed ..... distributed cloud computing framework Hadoop and its im- plementation of the ...
Medical Content Based Image Retrieval by Using the HADOOP Framework Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane, Abderrahim Sekkaki

Abstract— Most medical images are now digitized and stored in large image databases. Retrieving the desired images becomes a challenge. In this paper, we address the challenge of content based image retrieval system by applying the MapReduce distributed computing model and the HDFS storage model. Two methods are used to characterize the content of images: the first is called the BEMD-GGD method (Bidimensional Empirical Mode Decomposition with Generalized Gaussian density functions) and the second is called the BEMD-HHT method (Bidimensional Empirical Mode Decomposition with Huang-Hilbert Transform HHT). To measure similarity between images we compute the distance between signatures of images, for that we use the Kullback-Leibler Divergence (KLD) to compare the BEMD-GGD signatures and the Euclidean distance to compare the HHT signatures. Through the experiments on the DDSM mammography image database, we confirm that the results are promising and this work has allowed us to verify the feasibility and efficiency of applying the CBIR in the large medical image databases.

I. INTRODUCTION Nowadays, medical imaging systems produce more and more digitized images in all medical fields. Most of these images are stored in image databases. There is a great interest to use them for diagnostic and clinical decision such as case-based reasoning [1]. The purpose is to retrieve desired images from a large image databases using only the numerical content of images. CBIR system (Content-Based Image Retrieval) is one of the possible solutions to effectively manage image databases [2]. Furthermore, fast access to such a huge database requires an efficient computing model. The Hadoop framework is one of the findings based on MapReduce [3] distributed computing model. Lately, the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Google, Amazon, and Facebook are the biggest users of the MapReduce programming model and it’s been recently adopted by several universities. It allows distributed processing of data intensive computing over many machines. In CBIR systems, requests (the system inputs) are images and answers (outputs/results) are all the similar images in the database. A typical CBIR system can be decomposed in three steps: firstly, the characteristic features for each image in the database are extracted and are used to index images; secondly, the features vector of a query image is computed; and thirdly, the features vector of the query Said Jai-Andaloussi, Nabil Madrane are with LIAD Lab, Casablanca,

image is compared to those of each image in the database. For the definition and extraction of image characteristic features, many methods have been proposed, including image segmentation and image characterization using wavelet transform and Gabor filter bank [4, 5]. In this work, we used MapReduce computing model to extract features of images by applying the BEMD-GGD and BEMD-HHT [2], then we write the features files into HBase [6] (HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable), the Kullback-Leibler divergence (KLD) and Euclidean distance are used to compute the similarity between features of images. The setup of the paper is as follows: Section II-a describes the database we used for evaluation. In section II-b we present the components of Hadoop framework. Section II-c describes the BEMD, BEMD-GGD and BEMD-HHT methods. In section II-d we present the similarity methods. Section III describes the architecture of CBIR system based on Hadoop framework and results are given in section IV. We end with a discussion and conclusion in section V.

II. MATERIAL AND METHODS A. DDSM Mammography database the DDSM project [7] is a collaborative effort involving the Massachusetts General Hospital, the University of South Florida and Sandia National Laboratories. The database contains approximately 2,500 patient files. Each patient file includes two images of each breast (4 images for one patient, 10 000 images in total), along with some associated patient information (age at time of study, ACR breast density rating) and image information (scanner, spatial resolution). Images have a definition of 2000 by 5000 pixels. The database is classified in 3 levels of diagnosis (’normal’, ’benign’ or ’cancer’). An example of image series is given in figure 1.

and Abderrahim Sekkaki Kingdom of Morocco.;

[email protected] Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane and Abderrahim Sekkaki are with Faculty of science Ain-chok, Casablanca, Kingdom of Morocco.

Fig. 1.

Image series from a mammography study

B. Hadoop Framework Hadoop is a distributed master-slave architecture1 that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities. Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume sizes in the petabytes on clusters with thousands of hosts [8]. 1) MapReduce: MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce 3 . It allows you to parallelize work over a large amount of raw data. MapReduce decomposes work submitted by a client into a small parallelized map and reduce workers, as shown in figure 2 (figure 2 is taken from [8]). The map and reduce constructs used in MapReduce are borrowed from those found in the Lisp functional programming language, and use a shared-nothing model 4 . to remove any parallel execution interdependencies that could add unwanted synchronization points. 2) HDFS: HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled after the Google File System (GFS)2 . HDFS is optimized for high throughput and works best when reading and writing large files (gigabytes and larger). To support this throughput HDFS leverages unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O). Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. HDFS can create, move, delete or rename files like traditional file systems but the difference is the method of storage because it includes two actors which are the NameNode and the DataNode. A DataNode stores data in the Hadoop File System and the NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. C. Numerical image characterization: signatures The BEMD [9, 10] is an adaptive decomposition which decomposes any image into a set of functions denoted BIMF and a residue, these BIMFs are obtained by means of an algorithm called sifting process [11]. This decomposition allows to extract local features (phase, frequency) of input image. In this work, we describe the image by generating a numerical signature based on BIMFs contents [12, 13]. The usual approach used in CBIR system to characterize an image in a generic way, is to define a global representation of the whole image, or by computing the statistical parameters such as co-occurrence matrix and Gabor filter bank 1 A model of communication where one process called the master has control over one or more other processes, called slaves 3 See MapReduce: Simplified Data Processing on Large Clusters, http://research.google.com/archive/mapreduce.html. 4 A shared-nothing architecture is a distributed computing concept that represents the notion that each node is independent and self-sufficient. 2 See the Google File System, http://research.google.com/archive/gfs.html.

Fig. 2.

A client submitting a job to MapReduce

Fig. 3. Extraction process of the image signature using BEMD-GGD and BEMD-HHT

[5]. These parameters repesent the signature or the index of the image. The most widely used technique is to build image signatures based on the information content of color histograms. In this work, we use the Bidimensional Empirical Mode Decomposition (BEMD), Generalized Gaussian Density function (GGD) and Huang-Hilbert transform to generate the image signature. 1) BEMD-GGD signature: The gaussian generalized law is derived from the normal law and parameterized by: • α: a scale factor, it corresponds to the standard deviation of the classical Gaussian law. • β: a shape parameter. So the law density is defined as. p(x; α, β) =

|x| β β −( α ) 1 e 2αΓ( β )

Where Γ(.) is the gamma function,

(1)

Γ(z) =

R∞ 0

e−t tz−1 dt, z > 0

decomposition, is defined as :

We propose to characterize images by couples (α, β), ˆ determined by using a maximum likelihood estimator (ˆ α, β) of the distribution law for coefficients of each BIMF in the BEMD decomposition [12]. The image vector signature is ˆ derived from each BIMF formed by the set of couples (ˆ α, β) and the histograme of residue. 2) BEMD-HHT siganture: in the second method we apply the Huang-Hilbert transform [11] to each BIMF, and extract information from transformed BIMFs. Given an analytic signal z(t) (equation (2)) of a real signal s(t). The imaginary part of z(t) is equal to the Hilbert transform of the real part (equation (3)). z(t) = s(t) + iy(t).

Z

+∞

y(t) = H(s(t)) = v.p −∞

x(τ ) dτ. π(t − τ )

(2)

(3)

Where p indicates the Cauchy principal value. We propose to characterize image by using the statistic features (mean, standard deviation) extracted from the amplitude matrix A, phase matrix θ and instantaneous frequency matrix W of each BIMF [13]. We give below in figure 3 the extraction process of the image signature using BEMD-GGD and BEMD-HHT. D. Distance 1) BEMD-GGD similarity: To compute the similarity distance between two BIMFs (Generalized Gaussian) and, according to [14] Kullback-Leibler distance is used (see equation (4)). Z

p(X; θq ) dx p(X; θi ) (4) The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (5)). KLD(p(X; θq )||p(X; θi )) =

D(I, J) =

K X

p(X; θq ) log

KLD(P (X, αIk , βIk ), P (X, αJk , βJk )

(5)

k=1

2) BEMD-Hilbert similarity: The distance between two images I and J is the sum of the weighted distance between BIMFs (see equation (6)). D(I, J) =

K X

d(BIM FkI , BIM FkJ )

(6)

k=1

where λk are the adjustment weights. The BIM Fk represents the feature vector of an image. The distance between two BIMFs in the same level of

I µ − µJA µIθ − µJθ + D(BIM FI , BIM FJ ) = A α(µA ) α(µθ ) I I J µ − µJW σA − σA + + W α(µW ) α(σA ) I I J σ − σθJ σW − σW + + θ α(σθ ) α(σW ) (7) α(µ) and α(σ) are the standard deviations of the respective features over the entire database, and are used to normalize the individual feature components. III. A RCHITECTURE OF CBIR SYSTEM BASED ON HADOOP FRAMEWORK Content based image retrieval (CBIR) is composed of two phases: 1) offline phase, 2) online phase. In the offline phase, the signature vector is computed for each image in databases and they will be stored. In the online phase, the query is constructed by computing the vector signature of input image. Then, the query signature is compared with signatures of images in the database, A. Offline phase: applying the Mapreduce in extraction of the image signature MapReduce is known for its ability to handle large amounts of data. In this work, we use the open source distributed cloud computing framework Hadoop and its implementation of the MapReduce model to extract vectors features of images. The implementation method of distributed features extraction and image storage is given in figure 4. Storage is the base of CBIR system, given the amount of images data produced daily by the medical services, retrieve and processed these images need important computation time. Therefore, parallel processing is necessary. For this reason, we adopt Mapreduce computing model to extract the visual features of images and then write the features and image files all into HBase. HBase partitions the key space. Each partition is called a Table. Each table declares one or more column families. Column families define the storage properties for an arbitrary set of columns [6]. The given table in figure 5 shows the structure of our Hbase table, the row key of our Hbase table is assigned to the ID of image and families are files and features. Label ”source” and ”class” are added under family ”file”, representing for source image and class of image (the DDSM database is classified in 3 levels of diagnosis (’normal’, ’benign’ or ’cancer’)) respectively. Under family ”features”, label ”feature BEMD-GGD Alpha”, ”feature BEMD-GGD Beta”, ”feature BEMD-HHT mean”,”feature BEMD-HHT standard deviation”, ”feature BEMD-HHT phase”, ”feature BEMD-residue histogram” are added, representing features extracting by using BEMD-GGD and BEMD-HHT methods.

Fig. 4. Offline phase: applying the Mapreduce in extraction of the image signature

Fig. 6.

Online phase: applying the Mapreduce in image retrieval

the total images retrieved. We give below the principle of our retrieval method. •

Fig. 5.

Table 1. Structure of Hbase table for image features storage

• •

B. Online phase: applying the Mapreduce in image retrieval In the figure (given below), we describe the online retrieval phase. This phase is divided into 7 steps: 1) The user sends a query image to SCL, then the image will be stored temporarily in HDFS. 2) Run a map-reduce job to extract features from query image 3) Store image features in HDFS 4) The similarity/distance between the features vectors of the query image in HDFS and the target images in the HBASE are computed. 5) A reduce collect and combines all the result from all the map function. 6) The reducer stores the result into HDFS. 7) Send the result to the user IV. RESULT The method is tested on the DDSM database (see II-A). We made experiments on mean precision at 20, which is the ratio between the number of pertinent images retrieved and



Each image in the database is used as a query image. The algorithm finds the twenty first images of the database closest to the query image. Precision is computed for this query. Finally, we compute the mean precision.

In performances testings of image retrieval, we compared the local method with the parallel method based on Hadoop framework. A diagram of time consumed to retrieve images in parallel and in local way is given in figure 7. The horizontal axis represents the size of image databases, vertical axis represents the retrieval time (in milliseconds). We can see that when the size of image data is small (100

Suggest Documents