CBIR on Grids

3 downloads 2714 Views 322KB Size Report
perform automatic information retrieval over image databases considering rele- ..... hard disk: ∗ local: 4,6 GB. ∗ NFS: NAS Intel Pentium 4 CPU 2.8GHz. . Raid 0 ...
CBIR on Grids Oscar D. Robles1 , José Luis Bosque1 , Luis Pastor1 , and Ángel Rodríguez2 1

Dpto. de Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, U. Rey Juan Carlos, C/ Tulipán, s/n. 28933 Móstoles. Madrid. Spain {oscardavid.robles, joseluis.bosque, luis.pastor}@urjc.es 2 Dpto. de Tecnología Fotónica, U. Politécnica de Madrid, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Madrid, Spain [email protected]

Abstract. From the computational point of view, Content-based Image Retrieval systems are potentially expensive and have user response times growing with the ever-increasing sizes of the databases associated to them. This paper presents a grid implementation of a Content-based Image Retrieval system that offers a good cost/performance ratio to solve this problem due to their excellent flexibility, scalability and fault tolerance. This approach allows a dynamic management of specific databases that can be incorporated to or removed from the CBIR system in function of the desired user query. Experimental performance results are collected in order to show the feasibility of this solution.

1

Introduction

Grid computing is becoming nowadays a feasible solution for computer applications with high levels of computational power demand. This is due to the good price/performance ratio offered by the networks that compose this type of systems and because of both the high flexibility and availability offered by this computation paradigm, making easier the cooperation and resource sharing among institutions, named ”Virtual Organizations” (VO) [1]. One application area where the concept of grid computing fits in a natural way is Content Based Image Retrieval (CBIR). The main purpose of CBIR systems is to help users to perform automatic information retrieval over image databases considering relevant visual features extracted from the image data in a preprocessing stage [2,3]. The complexity of this task depends heavily on the number of items stored in the system, but usually, when dealing with image or video databases, large volumes of data are considered, and therefore, alternative strategies to the conventional centralized server must be seeked to manage the storage and processing of data in an efficient and flexible way. The tremendous improvements experimented by computers in aspects such as price, processing power and mass storage capabilities have resulted in an explosion of the amount of information available to people. But this same wealth makes finding the "best" information a very hard task. CBIR systems try to solve R. Meersman, Z. Tari et al. (Eds.): OTM 2006, LNCS 4276, pp. 1412–1421, 2006. c Springer-Verlag Berlin Heidelberg 2006 

CBIR on Grids

1413

this problem by offering mechanisms for selecting the data items which resemble most a specific query among all the available information. From a computational point of view, CBIR systems are potentially expensive and their user response times grow with the ever-increasing sizes of the databases associated to them. One of the most common approaches followed to reach acceptable price/performance ratios has been to exploit the algorithms’ inherent parallelism at implementation time [4,5,6]. However, the novelty of CBIR systems hinders finding references dealing with this aspect. Some contributions that can be cited are Zaki’s compilation [7], and the contributions of Srakaew et al. [8] and Bosque et al. [9]. A CBIR cluster based implementation was first introduced in [9], comparing its performance with a shared-memory solution. The results showed a definite advantage of the cluster in terms of scalability and price-performance ratio. In this paper, we extend the horizon of our previous work sharing the available resources of different VO. This paper presents a grid implementation of a CBIR system that offers a good cost/performance ratio to solve this problem due to their excellent flexibility, scalability and fault tolerance. The flexibility of the system herein presented allows the dynamical addition or removal of nodes from the grid between two user queries, achieving reconfigurability, scalability and an appreciable degree of fault tolerance. This approach allows a dynamic management of specific databases that can be incorporated to or removed from the CBIR system according to users queries. Experimental performance results are collected in order to show the feasibility of this solution taking into account several database setups. The rest of this article is organized as follows. Section 2 presents a brief description of the operations involved in a CBIR system, and the grid implementation developed is discussed in Section 3. Section 4 shows some performance results achieved by this implementation and finally, section 5 presents some conclusions and ongoing work.

2

CBIR System Operation

Smeulders et al. [3] reveal the importance of CBIR techniques. An example of this is the management of the huge quantity of multimedia information that anyone can access on Internet. This information must be efficiently accessed in a user friendly way and CBIR systems can provide proper solutions. In our case, we are dealing with 128x128 colour bidimensional databases that range from four hundred thousand to thirty two million images. The search for images contained in a CBIR system can be broken down into the following stages: 1. Input/query image introduction. The user first selects an image to be used as a search reference and the system computes its signature1 . Details about the retrieval techniques involved in the CBIR system can be found in 1

The signature is a vector of some features that represents the content of the images.

1414

O.D. Robles et al.

[10,11]. The whole process can be implemented using an O(k) order algorithm, being k the image’s size, that performs in a very efficient way [12]. This stage does not require quite high computational resources since the system only deals with one image. 2. Query and DB image’s signature comparison and sorting. The signature obtained in the previous stage is compared with all the DB images’ signatures using a metric based on the Euclidean distance. The identifiers of the p most similar images are extracted. Not being a quite costly operation, the volume of computations to be performed is very high though, since the signature of the input image must be compared with each of the image signatures stored in the system. Should it become necessary to incorporate a new signature to the group of the best p, the one with the worst ranking within the group would be discarded and the set then newly sorted. A bubble sorting algorithm with O(np log(p)) order has been used for this purpose, being n the number of images. 3. Results display and query image update. The system provides the user a data set with the p images considered most similar to the query one. If the result does not satisfy the user, he/she can choose one of the selected images or enter a new one that presents some kind of similarity with the required image returning to stage 1. Upon observing the operations involved, it is possible to notice that the comparison and sorting stage involves much larger computational load than the others. Luckily, since there are no dependencies the exploitation of data parallelism can be done just by dividing the workload among n independent nodes. This can be accomplished by distributing off-line the CBIR image’s signatures across the processing nodes, which can then compare the query image’s signature with every available signature. The storage capacity becomes also a problem whenever the size of the CBIR system grows beyond a certain limit. The most efficient approach to solve this point, relaxing additionally the per-node storage demands, is to distribute signatures, images and computations over an ensemble of n processing and storage nodes.

3 3.1

Grid Implementation Software Architecture

The grid implementation of a CBIR system corresponds to a distributed architecture of grid nodes with different complexity levels, ranging from standalone PCs or workstations to parallel systems like shared memory multiprocessors or clusters. Therefore, each one of the grid nodes’ may run sequential or parallel versions of the software components of the whole application, configuring a very heterogeneous system. The CBIR grid application programmed can be decomposed in the following modules: User interface, CBIR grid management and Local search per grid node. All these modules have been programmed using the Globus toolkit vs. 4 [13]. Next sections will describe each one of these elements.

CBIR on Grids

3.2

1415

User Interface

From a functional point of view, the main features of a grid system are flexibility and versatility. The software application must make feasible a user specification of a whole environment on every single system execution. This allows to update the working configuration adding or deleting specific nodes or databases taking into consideration the type and contents of the queries at a given moment. In this sense, the system presents to the user a simple interface where he/she can define parameters with two clearly different orientations: – Grid features. – CBIR features. Following, both group of features will be described with some detail. Grid Features. The grid parameters cover aspects such as: – Number of grid nodes where the query will be run. – Selection of specific nodes where the execution will be performed. – Computational power of each one of the system nodes. This is and optional parameter. If the user has some a priori knowledge about the performance of the processing nodes that belong to the grid, these data can be introduced to the application. Anyway, the application collects some statistics about each node performance in order to make in a future a better assignment of the work load among all available resources in the grid. CBIR Features. Among the programmable CBIR parameters, we can mention: – Name of the process that will perform first the search and the the sort of the signatures in each node. This is an optional parameter, and if it is not specified by the user, a default option is chosen. – Search criterion: available criteria include colour, shape and a combination of both features. For each criterion, several possibilities exist: energies, colour histograms, multiresolution colour primitives, Hu and Zernike invariants, etc. [11,14]. – Metric used in the similarity computation stage for the computed input image signature and the precomputed signatures stored in the database. – File name of the query image. – Name of the selected database where the query is performed. This is an optional parameter and if this entry is left blank, the default value is selected. 3.3

CBIR_GRID Process

This process is in charge of collecting all the parameter specifications defined by the users and starting the execution in the distributed environment. It is also in charge of keeping the fault tolerance of the whole system: if a grid node can not finish the search in its own local database, a report of this fact is sent to the user, but the rest of the answers are not lost. This process can be decomposed in the following stages:

1416

O.D. Robles et al.

– Read the system information provided by the users. At this moment, the setup information about the query performed by the users is received and all the local data structures involved during the search process execution are initialized, as well as those sent to the remote nodes. – Setup of the testbed or grid system. At this point, the number of nodes that compose the testbed is fixed, then their availability is checked and finally, the permissions for executing remote jobs over them are verified. – Selection of the remote jobs that have to be executed in each node of the grid. – Compute the signature of the input image considering its energies or histograms, like has been described in [10,11,14]. – Send the previously computed input image signature to all the nodes that compose the grid. The service GridFTP provided by the Globus Toolkit vs. 4.0 is used for this purpose. Specifically, the secure command globus-url-copy. – Once the signature is distributed, it must be launched the execution of the jobs in each one of the grid nodes, searching over their own local databases. A script that receives as input the list of nodes that compose the grid with its corresponding class of node has been programmed to perform this task. The remote execution of the jobs is based on the command or service globus-job-submit available on the Globus ToolKit vs. 4.0. This command is used instead of globus-job-run since it provides a non-blocking service, while globus-job-run performs a blocking execution of the submitted job. This command returns an identifier for each job, so with the command globus-job-status, its state can be checked passing as an argument the obtained job identifier. Monoprocessor nodes receive a sequential job, while cluster nodes receive a distributed job, and as it can be deduced, in this case, the search is performed over a distributed database. – Then, the CBIR_GRID process performs a loop that controls the state of each one of the launched jobs. The control interval of the jobs is also an application parameter, setting two seconds as default value. When a job ends, it must collect through the gridFTP service the partial results generated by the corresponding grid node. This way, a set of files with partial results is generated, one for each node of the grid. – The last step in this process is to gather all the partial results and to select the best N . Then, the process picks up the N images from the corresponding local nodes and presents them to the user in a sorted mosaic view inside a user window. 3.4

Local_Retrieval Process

This process is in charge of performing the local search in each one of the grid nodes. It has assigned two main functions: – The comparison of the input signature with respect to all the stored signatures in the database considering the search criteria and metric distance specified by the user. These comparisons produce a set of similarity values that are stored in an output file.

CBIR on Grids

1417

– Once the result of all the comparisons are available, the output file is sorted, achieving an output sorted according to the similarity value.

4

Experimental Results

4.1

Experimental Setup

A set of experiments has been executed for testing the behavior of the CBIR grid implementation presented here. Several objectives are stated for the tests: – to verify the feasibility of applying a distributed solution based on a grid, – to estimate the overhead introduced by the Globus software. – to analyze the grid response in order to optimize the distribution of the CBIR data among the grid nodes to achieve better performance. The testbed used in the experiments presented next is composed by the aggregation of the resources of two VO: the Department of Arquitectura de Computadores (DAC) of the Rey Juan Carlos University (URJC) and the Department of Arquitectura y Tecnología de Sistemas Informáticos (DATSI) of the Universidad Politécnica de Madrid (UPM). Each one of these VO contributes with the following resources2: – DAC-URJC: Africa: CPU Intel(R) Pentium(R) IV, 2.80GHz; • cache size L1 : 512 KB • main memory: 1 GB DDR RAM • hard disk: ST380011A PCI, 20 GB • operating system: Linux version 2.6.8-2-686, Debian 1:3.3.5-12 • network card bandwidth: 1 Gbps Artico: CPU Pentium III (Katmai) at 450 MHz • cache size : 512 KB • main memory: 128 MB DDR RAM • hard disk: Maxtor 91080D5 PCI, 10 GB – DATSI-UPM: Baobab: One biprocessor node with the following features: • 2 CPUs, Intel(R) Xeon(TM) CPU 2.40GHz • cache size : 512 KB (each CPU) • main memory: DDR RAM 1 GB • hard disk: ∗ local: 4,6 GB ∗ NFS: NAS Intel Pentium 4 CPU 2.8GHz. . Raid 0 over 4 disks with 160 Gb. • message passing libraries: LAM/MPI 7.1.1 • linux kernel vs 2.6.13 (Debian): shared by all the nodes 2

Non mentioned items have the same setup than the previous resource.

1418

O.D. Robles et al.

DATSI−UPM

DAC−URJC

Baobab Artico 00 11 11 00 00 11 00 11

LAN

11 00

WAN Ether 10/100

100 COL

1

2

3

4

5

6

7

8

1

2

3

6

12

25

50

80

10

!

Power

LAN

Ether 10/100

100 COL

1

2

3

4

5

6

7

8

1

2

3

6

12

25

50

80

10

!

Power

11 00 00 11 00 11 00 11 11 00 00 11

Africa

Brea

Fig. 1. Scheme of the grid used in the experiments

• internal network with 2 network interfaces per node with a bandiwth of 1Gb per interface Brea: One biprocessor node with the following features: • 2 CPUs Intel(R) Xeon(TM) CPU 3.00GHz • cache size: 1024 KB (each CPU) • hard disk: ∗ local: 9,2 GB ∗ NFS: Intel(R) Pentium(R) 4 CPU 2.80GHz. NAS. Raid 0 over 4 disks with 160 Gb. Figure 1 shows the grid described above. Dashed lines group the computational resources available in each VO. 4.2

Performance Results

Table 1 collects the response user time per grid node considering different sizes of local database stored in each one of the grid nodes. The values are measured in seconds. Each node of the grid stores the same amount of images for the same row of the table, and the value gives the total amount of time spent by each node of the grid to give an answer to the user. It can be observed the wide range of values obtained because of the heterogeneous nature of the available processing nodes. Table 2 shows the response user time of the grid (in seconds) considering different database sizes. This Table includes also the overhead introduced by the

CBIR on Grids

1419

Table 1. User response time of grid nodes for a query considering different local database sizes

Database size Africa Artico Baobab Brea per grid node 100000 1.63 6.81 0.98 0.87 1000000 17.30 84.36 11.58 9.96 2000000 33.89 173.94 23.05 20.55 4000000 68.45 390.63 56.73 44.39 8000000 136.75 816.54 125.50 88.48 Table 2. User response time for a query considering different database sizes with fair distribution of the images over the available grid nodes

DB size Grid Globus overhead Efficiency 400000 29.25 22.44 0.23 4000000 106.00 21.64 0.80 8000000 202.89 28.95 0.86 16000000 420.21 29.58 0.93 32000000 846.16 29.62 0.96

grid implementation, which remains almost constant in all cases. Finally, the efficiency reached by the system is also included, computed as: En = Sn /n

(1)

Sn = T1 /Tn

(2)

where

and T1 is the execution time of the algorithm running on the slowest node of the grid and Tn is the execution time of the algorithm carried out over n nodes of the grid. As can be noticed, the efficiency of the grid increases as the size of the database grows up. This fact is explained by the almost constant overhead introduced by the grid implementation and the lack of data dependencies in the most demanding stage of the posed CBIR application that allows a fully parallelization of the signature comparison and sorting. For small database sizes (400000 images), efficiency values are quite low in comparison with values achieved for greater database sizes. The reason for this behaviour is that query response times of nodes are smaller than time overhead values introduced by the grid implementation and therefore the efficiency is weakened. The grid overhead is due to several reasons like: – Management of the auxiliary temporal files needed by the implemented algorithm.

1420

O.D. Robles et al.

– Globus overhead that includes: • Security control mechanisms. • State system management. – Network traffic. The grid herein presented allows a dynamic management of the CBIR system since Globus provides a set of service implementations focused on infrastructure management. Specifically, GRAM (Grid Resource Allocation Management) supports the control of the available resources in the grid at a given moment. The values of Table 1 suggest a redistribution of the workload among the grid nodes considering the response times achieved in the experiments, instead of considering an homogeneous size of the image databases for all the grid nodes. Considering a static assignment of the workload, the redistribution should be made as a function of the response time of each node in relation with the slowest node of the grid: DBsi =

T1 · DBs1 Ti

(3)

where T1 is the execution time of the algorithm running on the slowest node of the grid, Ti is the execution time of the algorithm on the node i of the grid and DBs1 is the database reference size in the slowest grid node.

5

Conclusions and Future Work

This paper is focused on the analysis of the feasibility to apply a grid solution to CBIR systems. In this work we have measured the efficiency reached by the grid considering a real environment with very heterogeneous nodes. The implementation has been very satisfactory, achieving small user response times. This has been originated by the lack of data or algorithmic dependencies and to the small communication overhead. Thanks to the heterogeneity of the system, the communication overhead overlaps with the execution time in other grid nodes. These features result also in very good performance figures for the largest databases, where efficiency values higher than 90% have been achieved and the efficiency curve shows a tendency to approximate to one. The experiments presented here show that the amount of overhead introduced by this implementation is almost constant, so the system is scalable with respect to the database size. In fact, this overhead is hidden by the improvements achieved to take into account grid heterogeneity. To conclude, definite advantages of the grid implementation are its good price-performance ratio and system scalability. Finally, further work will be devoted to the analysis on the response of the system after distributing the database of the CBIR system on different Virtual Organizations that include more complex grid nodes, like shared-memory multiprocessors or clusters of PCs or workstations. We also plan to incorporate load balancing mechanisms to dynamically redistribute the workload corresponding to the sorting stage and increase the global performance of the grid.

CBIR on Grids

1421

Acknowledgments This work has been partially funded by the Spanish Ministry of Education and Science (grant TIC2003-08933-C02) and Government of the Community of Madrid (grants GR/SAL/0940/2004 and S-0505/DPI/0235).

References 1. Berman, F., Fox, G., Hey, A.J., eds.: Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons (2003) ISBN 0-470-85319-0. 2. del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann Publishers, San Francisco, California (1999) ISBN 1-55860-624-6. 3. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Transactions on PAMI 22(12) (2000) 1349–1380 4. Pitas, I., ed.: Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks. John Wiley & Sons (1993) 5. Maresca, M., et al.: Special issue on parallel architecture for image processing. Proceedings of the IEEE 84(7) (1996) 913–1056 6. Muntz, R.R., Golubchik, L.: Parallel data servers and appications. Parallel Computing 24 (1998) 1–4 7. Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurrency 7(4) (1999) 14–25 8. Srakaew, S., Alexandridis, N., Nga, P.P., Blankenship, G.: Content-based multimedia data retrieval on cluster system environment. In Sloot, P., Bubak, M., Hoekstra, A., Hertzberger, B., eds.: High-Performance Computing and Networking. 7th International Conference, HPCN Europe 1999, Springer Verlag (1999) 1235–1241 9. Bosque, J.L., Robles, O.D., Rodríguez, A., Pastor, L.: Study of a parallel CBIR implementation using MPI. In Cantoni, V., Guerra, C., eds.: Proceedings on International Workshop on Computer Architectures for Machine Perception, IEEE CAMP 2000, Padova, Italy (2000) 195–204 ISBN 0-7695-0740-9. 10. Rodríguez, A., Robles, O.D., Pastor, L.: New features for Content-Based Image Retrieval using wavelets. In Muge, F., Pinto, R.C., Piedade, M., eds.: V Iberoamerican Simposium on Pattern Recognition, SIARP 2000, Lisbon, Portugal (2000) 517–528 ISBN 972-97711-1-1. 11. Robles, O.D., Rodríguez, A., Córdoba, M.L.: A study about multiresolution primitives for content-based image retrieval using wavelets. In Hamza, M.H., ed.: IASTED International Conference On Visualization, Imaging, and Image Processing (VIIP 2001), Marbella, Spain, IASTED, ACTA Press (2001) 506–511 ISBN 0-88986-309-1. 12. Stollnitz, E.J., DeRose, T.D., Salesin, D.H.: Wavelets for Computer Graphics. Morgan Kauffman Publishers, San Francisco (1996) 13. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: IFIP International Conference on Network and Parallel Computing. Volume 3779 of Lectures Notes in Computer Science. Springer Verlag (2005) 2–13 14. Robles, O.D., Toharia, P., Rodríguez, A., Pastor, L.: Towards a content-based video retrieval system using wavelet-based signatures. In Hamza, M.H., ed.: 7th IASTED International Conference on Computer Graphics and Imaging - CGIM 2004, Kauai, Hawaii, USA, IASTED, ACTA Press (2004) 344–349 ISBN: 0-88986418-7, ISSN:1482-7905.