retrievals for content-based queries call for e cient search schemes which are .... selected classes at the determined level, and send the results to the console.
A Distributed Algorithm for Content-Based Image Retrievals Using Image Classes S.R. Subramanya1, S. Srakaew2 , N. Alexandridis2, A. Youssef 2, P. Piamsa-nga2 Department of Computer Science, University of Missouri{Rolla, Rolla, MO 65409 2 Department of EE and CS, The George Washington University, Washington, DC 20052
1
Abstract
The huge data sizes of images, the enormous number of images in a typical image database, coupled with the inexact nature, and subjective interpretations of images have called for content-based retrievals, which place enormous demands on computation. Network of workstations (NOWs) are a cost-eective way of providing the much needed computational power in such applications. This paper presents a distributed scheme for image retrievals in a NOW system, by using an initial classi cation of images in the database. It uses a heuristic for determining the centroids and thresholds for the image classes, which are used in the search process. The retrieval results using the proposed scheme on a NOW system are presented. Keywords: Image databases, image classes, content-based retrievals, distributed algorithm, network-of-workstations.
1 Introduction There has recently been phenomenal increases in the use of images, along with text and other multimedia data such as audio and video, in a variety of computer applications, and this trend is expected to increase remarkably in the future. This has necessitated the development of image databases. Example applications are in digital libraries, radiological image archives, satellite imagery for earth resources, law enforcement, photo journalism, art, and several others. Fast and accurate image retrievals for user queries are crucial for such systems to be useful. The inexact and subjective nature of image data have rendered the keyword-based systems for indexing and retrieval ineective for image data, and have necessitated content-based indexing and searching schemes. Several schemes have been proposed for image retrievals for contentbased queries [1, 2, 3, 4, 5, 6]. The content-based queries are unstructured and unanticipated. This usually requires a search over all images for matches for a given query. Huge data sizes of images, together with a large
number of image les in a typical image database and the requirement for fast (real-time) retrievals for content-based queries call for ecient search schemes which are both fast and accurate. This places enormous demands on computation and disk access. One of the costeective solutions is the use of network of workstations (NOW) for such applications. Essentially, a NOW is a set of machines (any combination of PCs and workstations) which are connected as a network and can communicate among themselves. Software such as PVM (parallel virtual machine) and MPI (message passing interface) provide the user with a uni ed view of a single machine. Data and process migration is generally needed to distribute the load on the machines in a manner transparent to the user. In this paper, we propose a heuristic for determining centroid images and thresholds, which are used in the search process, and a distributed algorithm for carrying out the actual search and image retrievals using a NOW system. Color histograms at dierent resolutions, organized as a K-tree, are used as indices in the searches. Instead of arbitrarily partitioning the set of images among the machines, the images are classi ed into classes of similar images using the scheme described in [8], and the image classes are distributed to the machines. The retrieval results using the proposed scheme is shown to provide faster retrievals compared to a scheme [9] which arbitrarily partitions and distributes the images (and later does load balancing). The next section gives a brief description of the image data model, the NOW system, and some assumptions used in our image indexing and retrieval system. Section 3 presents the proposed algorithm. Experimental results are given in Section 4, followed by conclusions and future directions.
2 Image Data Model and Assumptions A uni ed model for multimedia retrieval by content based on K-trees which was used in [7], is also used in this paper. The color histograms of the images at various image levels (resolutions) are used as indices during the search process. The (R; G; B ) values of the pixels are converted to (H; S; V ) (hue, saturation, and value components) and then mapped to pre-selected domain of color set [3]. The color space is quantized into 166 levels. Histograms are then derived and organized in a hierarchical manner in a K-tree, which facilitates multi-resolution processing. Histograms at the pixel level are at the leaves of the tree. The histogram at any intermediate node corresponding to a quandrant, is the average of the histograms of four of its smaller quadrants. The global average for the whole image is at the root. There is a tree for each image where the root represents the histogram of average color values for the whole image, the intermediate nodes contain histograms of image quadrants, and the leafs contain histograms at the pixel levels. The search for images matching the given query can be made at any given level of the tree, depending on the required search speed and search accuracy. The architecture of the system for which the image retrieval algorithm is developed and implemented is a network of workstations (NOW), consisting of ve machines{One Sun UltraSparc workstation, two Pentium 166 MHz PCs, and two SunSparc20 workstations, connected by a 100Mb/sec Ethernet, and running MPI (Message Passing Interface). MPI is a portable standard for message passing which allows programs to be run in parallel on a NOW system using a set of message passing directives. The MPI code in our system utilizes the master-slave model of computation. In subsequent discussion, the terms console and machines refer to the master and slaves respectively.
Assumptions: One of the machines in the NOW which has all the image data les is designated as the
console. The data is analyzed and the K-tree index of color histograms of all the images built, before the search starts. The images are partitioned into classes, where each class contains similar images. The classes (the indices of the images in the classes) are then distributed to the machines, before the search starts. Access of the disk of machine i by machine j (6= i) if necessary, will not generally con ict with the accesses of machine i. There is a shared array named remaining of size P ; remaining[i] contains the number of images (image indices) which remain to be compared to the query at machine i.
3 The Proposed Algorithm We rst give a brief description of the algorithm followed by the pseudocode in a top-down fashion. Before the search starts, the images (and their histograms) are classi ed, and the classes are distributed to all machines. The query is then broadcast to all machines. The centroid images and the thresholds (radii) at dierent levels are computed for the image class(es) at each machine. Then, based on preliminary match of the query with the centroid of each class at a machine, (1) the level in the histogram tree at which the search is to be made is determined, and (2) classes whose images are to be considered for further matches are selected. These are explained in Sections 3.1 and 3.2. All the machines then carry out the search process in the selected classes at the determined level, and send the results to the console. The console keeps track of the progress of the machines and redistributes the loads if necessary. The action ow at the console and the other machines in the NOW are shown in Figure 1.
3.1 Determination of threshold and selection of classes For each of the image classes at each machine, a centroid image is determined. This is a synthetic image which is essentially `equidistant' from all the images in the class, and is treated as the representative of the class. The centroids are determined at the dierent resolution levels of the histograms, 1 : : : L. The maximum distance of an image from the centroid is taken as the `radius' or the threshold of the partition (at the particular level). The concept of centroid and threshold (radius) is illustrated in Figure 2. Note that the physical separation of images in the gure are to be treated as the perceptual separation (distances). Only those classes where the distance between their representative (centroid) and the query is below the threshold, are selected for further consideration in the search process. The lowest level, where the distance between the query and the class centroid is below the threshold, is taken as the level at which the query is to be matched with the images in that class.
Console
Machines
Partition the Image Index data into "Classes" Distribute the classes among the machines. Broadcast the query to all machines. Determine the "centroids" and "thresholds" at all levels, for all the classes at the machine. Determine the level (resolution) of search. Time
Do similarity matches for the given query in the class(es) at the machine. Send a message to the console as and when a match is found, with the details of the match. Receive the message about the match and append the match to the result list and sort it. Send message to console when all data has been used in matching. Determine if ‘considerable’ data is yet to be matched (in any machine). If so, determine the partition and send message to the machine. Transfer of index data occurs if necessary, as determined by the console, and matching continues. Determine if the matching (searching) is over, and present the results.
Figure 1: Flow of actions at the console and the other machines.
3.2 Distributed search After the determination of the classes to be used in the search, and their centroids, thresholds, and the levels of search, the machines then carry out the search process. Each image in a class is matched with the query at the determined level, say k, and if the distance between them is less than or equal to the threshold R , the image is selected as a possible match and a message hMatch; Filename; dist:i is sent to the console. The console receives the message about the matched data and inserts it into a list of results. The console periodically sorts the result list. Let n1 and n2 be the number of images in the classes (selected for matching) in machines i and j , at levels l1 and l2 respectively. When a machine say i has nished searching in its data partition, it sends a message hDone; ii to the console. The console then checks the remaining array and nds the maximum among the entries. Say the maximum is that of machine j , which has processed a(< n2 ). If the maximum is greater than a certain value v, then the console sends a message hRead; j; xi to machine i which then reads a suitable portion of x items from the remaining data on machine j (j 's disk) and works on that data. A message hUpdate; n2 ? a ? xi is sent to machine j which updates the value in remaining[j ] to the given value. v is determined such that the time for data transfer plus the time for completing the search by the two machines k
Collection of images in a machine
Theshold (radius)
Centroid image
Figure 2: Centroid image and threshold.
i and j does not exceed the search time of letting the machine j alone to nish with its data set. The portion of data to be transfered from j to i, namely x is determined based on the relative speeds of i and j observed at that time. In the above case, x is given by, x = n1 (a ? x)=(a4 2 ? 1 ). We further make some adjustment to the value of x with a view of balancing the load by setting x = x ? (x), where (x) is proportional to the time of transferring x amount of data from j to i. This is because, in this time, machine j could make further searches. In the above example, only two machines with one class in each machine is considered for simplicity. The argument is easily extended to multiple machines, with any number of classes in each machine. The dierent messages and the actions are tabulated below: l
From ! to Message format machine ! console hMatch; Filename; dist:i machine(i) ! console hDone; ii
Action
l
Console appends the Filename and distance to its result list. The console locks and checks the remaining array, nds the maximum, and determines data movements, if necessary. console ! machine(i) hRead; j; xi The machine i reads x data items from the disk of j (the last x data items of j ). remaining[i] is updated to x. console ! machine(j ) hUpdate; n ? a ? xi Machine j updates remaining[j ] to the given value. The notations used in the following algorithms are explained below: P : Number of machines in the NOW system. I : Set of all images. j I j= N . Q: Query image. M : Number of image classes at machine p. H : Histogram of quadrant n at level k. B : Number of histogram bins = Color quantization levels. L: Number of resolution levels of image histograms. C : Centroid image at level k. R : Threshold (radius) at level k. R: List of retrieval results. p
n;k
k
k
Note that C is represented by H , the set of all histograms of the quadrants of the centroid image at level k. The pseudocode of the proposed algorithm is given below in a top-down fashion: c n;k
k
Algorithm 3.1 DistrImageRetrieval (in: I ; Q; out: R) fI : set of images, Q: query, R: retrieval results.g 1. begin 2. Setup. fBasic system initialization.g 3. ImageClassify. fCluster all the images into Classes.g 4. DistributeClasses. fDistribute the classes to all the machines.g 5. BroadcastQuery. fBroadcast the query to all machines.g 6. Do Concurrently Each machine p does: 7. DetrmineCentroids(I ; C ). 8. DetrmineThresholds(I ; C; R). 9. SelectClasses(C; R; Q). 10. EndDo 11. Do Concurrently p
p
12. 13. 14. 15.
Console: Respond to messages from other machines. if remaining[i] = TRUE; 8i then R best results. exit.
endif
Machines:
16. ImageSearch. 17. Until all machines are done 18. end
Algorithm 3.2 DetermineCentroids (in: I ; out: C ) fI : set of all images at machine p, C = fC1 : : : C g: centroid images at levels 1 : : : l.g 1. begin 2. for each level k; 1 k l do 3. for each node n at level k; 1 n 4 ?1 do 4. for each bin i of histogram H at node n; 1 i B do 5. Find (min,max) of H [i] over all images in I . p
p
l
k
k;n
6. C [i] 7. endfor 8. endfor 9. endfor 10. end k;n
k;n
(min + max)=2.
p
Algorithm 3.3 DetermineThresholds (in: I ; C ; out: R) fI : set of all images at machine p, C = fC1 : : : C g: centroid images at levels 1 : : : l, R = fR1 : : : R g: thresholds at levels 1 : : : l.g 1. begin 2. for each level k; 1 k l do 3. for each image I 2 I ; 1 i N do 4. d HistogramDist (I ; C ; k). 5. endfor 6. R MAX(d ); 1 i N .. 7. endfor 8. end p
p
l
l
i
p
p
i
i
k
i
k
p
The function HistogramDist(h; g; k) determines the distance between histograms h and g, which are at level k. We use the histogram intersection distance, de ned by:
P d (h; g) = P MIN( B
(h[i]; g[i]) P =1 g[i]) =1 h[i];
=1 MIN
i
I
B
B
i
i
Algorithm SelectClasses determines the classes to be selected for matching, and the level at which the query is to be matched with the images in the classes, and is performed by all machines.
Algorithm 3.4 SelectClasses (in: C; R; Q) fC = fC1 : : : C g: centroid images at levels 1 : : : l, R = fR1 : : : R g: thresholds at levels 1 : : : l.g 1. begin each machine p does (steps 2{12): 2. for each class C ; 1 i M do 3. k 1; 4. while k L do 5. if HistogramDist (C ; Q ; k) R then 6. k k + 1; 7. else 8. l k ? 1; break 9. endif 10. endwhile 11. endfor 12. Select only those classes C for which l = 6 0; 1 i M for the search process. 13. end l
l
i
p
k
k
k
i
i
i
p
Algorithm 3.5 ImageSearch (in: I , Q, C, R) 1. begin fEach machine p; 1 p P ? 1 does the following concurrently, for each of the M image classes at the machine.g 2. for each image I in class C ; 1 i M do 3. if (HistogramDist(Q; I; l ) R ) then 4. Decrement remaining[i]. 5. Send hMatch; Filename; dist:i to console. 6. endif 7. Look for hUpdate; n ? a ? xi message from console and take action as described in the table of messages. 8. endfor 9. Send hDone; ii to console. 10. Wait for a message from console. 11. if hRead; j; xi is received then 12. Get and work on the data (as described earlier in this section). 13. endif 14. end p
p
i
p
i
li
4 Experimental Results The proposed algorithm has been implemented and used to retrieve images from a database of about 5,000 images. The database is built using images from [10] and [11]. The NOW system used for the experiments consisted of 5 machines: 2 Pentium (166Mhz) machines, 1 Sun UltraSparc, and 2 Sun SparcStation machines. The search time does not include the time required for the initial classi cation and distribution. The reported matches to the query in the scheme which partitions the data among the machines, without classi cation, are almost the same as in the scheme which does a classi cation of the images and then distributes the classes to the machines. A sample query and retrieval results are shown in Figure 3. Figure 4 shows the search times and speedups of the schemes with and without using the classes, for dierent number of active machines. It is easily seen that the search time decreases with more active machines and also that the search scheme using image classi cation and subsequent partitioning of the classes among the machines performs better than the scheme which does not use classi cation. The proposed scheme achieves a speedup which is linear in the number of active machines.
5 Conclusions and Future Directions The enormous amounts of image data being generated and used in various computer applications have necessitated the development of image database systems providing content-based retrievals. Huge data sizes of images and the requirement of fast retrievals place enormous demands on storage and computation. Network of workstations (NOWs) are a cost-eective way of providing the much needed computational power in such applications. This paper presented
Figure 3: Retrieval results.
12 w/o Classes w. Classes
0.7
I/O Time (Sec.)
0.6 0.5 0.4 0.3
1
2 3 4 Number of machines used in search
Speedup (w.r.t. Total Time)
6
w/o Classes w. Classes
4 3.5 3 2.5 2 1.5 1
2 3 4 Number of machines used in search
2
5
4.5
1
8
4
0.2 0.1
w/o Classes w. Classes
10
5
Speedup (scheme w. classes over w/o classes)
Computation Time (Sec.)
0.8
1
2 3 4 Number of machines used in search
5
1.4
1.3
1.2
1.1
1
Computation time I/O time Total time 1
2 3 4 Number of machines used in search
Figure 4: Search times and speedups.
5
a distributed scheme for image retrievals in a NOW system, by using an initial classi cation of images in the database. The scheme used a heuristic for determining the centroids and thresholds for the image classes, which are used in the search process. The proposed scheme achieves a linear speedup. Some future directions include (1) using more machines in the system, (2) load monitoring by the console and dynamic redistribution of the load among the machines, (3) determination of better thresholds (possibly dierent thresholds for dierent image classes), (4) minimization of disk access con icts and bus con icts.
References [1] Gudivada, V. and Raghavan, V. (Eds.) `Special Issue on Content-Based Image Retrieval Systems', IEEE Computer, Vol.28, No.9, September 1995. [2] Kemp, Z. `Multimedia and Spatial Information Systems', IEEE Multimedia, Vol.2, No.4, 1995. [3] Smith, J.R. and Chang, S-F. `SaFe: A General Framework for Integrated Spatial and Feature Image Search',IEEE Workshop on Multimedia Signal Processing, 1997. [4] SPIE: Image Storage and Retrieval Systems, SPIE, San Jose, CA. Feb. 1992. [5] Tamura, H. and Yokoya, N. `Image Database Systems: A Survey', Pattern Recognition, Vol. 17, No. 1, 1984, pp29-43. [6] Chang, S-K. `Image Information Systems', Proc. IEEE, Vol. 73, No. 4, April 1995, pp 754-764. [7] Piamsa-nga, P, et. al. `A Uni ed Model for Multimedia Retrieval by Content', Int'l. Conf. on Computers And Their Applications, Hawaii, March 1998. [8] Subramanya, S.R.,et. al. `A Scheme for Automated Classi cation of Images', Int'l. Conf. on Computer Applications in Industry and Engineering (CAINE-98), Las Vegas, November 1998. [9] Subramanya, S.R.,et. al. `A Distributed Algorithm for Content-Based Retrievals in Image Databases', Int'l. Conf. on Advanced Computing (ADCOMP-98), Pune, India, December 1998. [10] Eastman Kodak homepage. URL: http://www.kodak.com. [11] Smithsonian Institute `Online Collection of Pictures', URL: ftp://photo1.si.edu [12] Lynch, N.A. Distributed Algorithms, Morgan Kaufmann, 1996.