A keyword based metadata indexing and searching facility for Storage Resource Broker (SRB) is presented here. SRB is a popular data grid based storage ...
Keyword Based Indexing and Searching over Storage Resource Broker Adnan Abid1,2, Asif Jan1, Laurent Francioli1, Konstantinos Sfyrakis1, and Felix Schuermann1 1
Ecole Polytechnique Federale De Lausanne (EPFL), Lausanne, Switzerland NUST Institute of Information Technology (NIIT), Rawalpindi, Pakistan {adnan.abid, asif.jan, laurent.francioli, konstantinos.sfyrakis, felix.schuermann}@epfl.ch 2
Abstract. A keyword based metadata indexing and searching facility for Storage Resource Broker (SRB) is presented here. SRB is a popular data grid based storage system that provides means to store data and associate metadata information with the stored data. The metadata storage system in SRB is modeled on the attribute-value pair representation. This data structure enables SRB to be used as a general purpose data management platform for a variety of application domains. However, the generic representation of metadata storage mechanism also proves to be a limitation for applications that depend on the extensive use of the associated metadata in order to provide customized search and query operations. The presented work addresses this limitation by providing a keyword based indexing system over the metadata stored in the SRB system. The system is tightly coupled with the SRB metadata catalog; thereby ensuring that the keyword indexes are always kept updated to reflect changes in the host SRB system. Keywords: Data Management, Data Grid, Indexing, Storage Resource Broker.
1 Introduction Storage Resource Broker (SRB) is a popular scientific data management system being used in many projects in the areas of astronomy, high energy physics and biology- to name a few [1,2]. SRB is a grid based middleware that manages heterogeneous distributed storage resources including, file systems, database systems, and archival storage systems; and provides APIs to access and utilize these resources [3]. It also provides a sophisticated access control mechanism allowing for fine grained access management on individual file and collection level. Furthermore, the SRB servers can be federated with one another [4] and may also be configured to support replication of data and metadata [5] in order to provide fault tolerance and increase the system availability. The current statistics, from the SRB website, show that SRB brokers more than 1.5 Petabytes of data worldwide [6]. The SRB system consists of a server component and a metadata catalog. The server component provides an application server like facility mediating client access to the data stored in the SRB system. The files and directories (denoted as collections in the SRB jargon) stored in the SRB system are collectively known as SRB Objects. The metadata catalog, on the other R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part II, LNCS 4804, pp. 1233–1243, 2007. © Springer-Verlag Berlin Heidelberg 2007
1234
A. Abid et al.
hand, may be seen as the brain of the SRB system. The metadata catalog, abbreviated as MCAT, stores the logical and physical locations of all SRB objects. Additionally, it also stores metadata values associated with each of the SRB objects, object access rights, permissions and objects replication information. The MCAT comes in two formats (1) a widely used attribute value pair schema, (2) and a more semantically rich EMCAT (Extended MCAT) that is in early experimental state and so far has not been used at any SRB production site. While the files and collections are both represented as SRB objects, their metadata is stored separately. The metadata attributes are further classified into system defined metadata attributes such as size, date, owner etc, user defined metadata attributes i.e. any combination of user specified attribute value pairs, and extended metadata attributes that are used in conjunction with EMCAT. The focus of the presented work is on the attribute value pair MCAT only, but it can be extended for indexing an EMCAT based catalogs as well. This rest of this paper is organized as follows, section 2 illustrates the motivation for the presented work, section 3 reviews the existing research of relevance to the current work, section 4 describes the design and implementation details of the presented system, section 5 discussed the performance results, and section 6 concludes the paper and highlights some of the future areas of work.
2 Motivation The work as presented here was carried out as part of the Neobase project [7]. The Neobase project aims at providing an optimal platform for storing neocortical microcircuit data. The current version of the system handles data resulting from electrophysiological recordings and morphological reconstruction experiments. In electrophysiological recordings, the cells (single and network) are stimulated using pre defined protocols; and the response of the cells is recorded. A single experiment, depending on the detail with which the protocol was carried out, may results in hundreds of traces. The other major type of data stored in the Neobase system is the morphological reconstructions of the cells. These reconstructions contain information about the cell geometry in 3 dimensions. Each cell may be reconstructed a number of times thus resulting in more then one morphological files per cell. In order to enable efficient storage, retrieval, the stored data needs to be augmented with the metadata information describing the experimental conditions i.e. nature of the protocol used, duration of the experiment etc, information about the animal used for the experiments i.e. age, gender, weight and whether the subject was exposed to any special drug treatment or not; and various other properties of the cells such as type, microcircuit layer from which the cell was recorded etc. Each of these metadata attributes may be used as a keyword for searching through the data store, for example users might be interested in experiments performed on animal subjects of specific age, where a particular drug was used, and other cell level metadata such as type, layer etc. Search operations in the SRB system are executed on the metadata stored in the MCAT system. The sought-for values of the metadata attributes are provided as search conditions and the SRB system retrieves and presents the SRB objects fulfilling the search criteria. The search can be performed via any of the SRB client libraries e.g. Scommands – a command line SRB interface, JARGON – a Java API for
Keyword Based Indexing and Searching over Storage Resource Broker
1235
SRB, Matrix – a WSDL interface to SRB, MySRB – a web based interface to SRB and so on. Following are few of the limitations that one encounters while searching the data stored in the SRB system; 1. The attribute names as well as their category i.e. system defined, user defined or extended; have to be known before issuing any search command. The search based on the system-defined metadata is well supported on all SRB client APIs, the support for user defined attributes is nontrivial and often has very verbose format, and the support for the EMCAT attributes is minimal. In short, the user has to know before hand the name of the attribute to search for and the category with which the desired metadata attribute belongs to in the metadata catalog. For example to find all Cells of Layer1 the user has to construct a query like “Type=Cell AND Layer=Layer1”, and also indicate that the metadata attributes belong to the user defined metadata attribute category. 2. The search performance decreases with the increase in the data volumes. This is a natural consequence of using attribute-value pair database schema. As the metadata for collections and files is stored in two different tables in the MCAT. So, as the number of collections and files increases, the query turn around time of the system also increases. This is specifically troublesome in case where the user wants to perform search operation based on the file attributes. This is due to reason that like a traditional file system, the number of files stored in the SRB is far greater than the number of collections. Consequently, the size of the table (in terms of stored records) storing the files metadata is much larger than the one that stores collections metadata. 3. One of the major limitations in querying MCAT is that SRB system supports OR based queries only. For example if want to search for files whose size is more than 500 MB and are owned by the user ‘experimenter’. The only way to perform this AND query is to decompose the query into two parts i.e. firstly find out files whose size is greater then 500, secondly find out files owned by user ‘experimenter’, and finally perform an application level intersection among the two result sets. As you can see this significantly increases the processing time for the client side applications and results in bad user experience. Also here we are assuming that the resultant result sets will be small enough to be processed simultaneously – an assumption which will not hold true for any realistic data store. In case if you want to perform search based on more then two parameters then the resulting application level processing, memory requirement and the complexity of the program logic will be more the what one would like to handle in the SRB client programs. 4. It is very difficult to get an aggregated view of semantics associated with the data stored in the SRB system. This kind of information is beneficial in order to get an overview of the types of data stored in the SRB system; as well as to track frequently used metadata attributes. Generating such a meta-index showing a tag cloud like view - on the metadata attributes of the stored SRB objects – will require the retrieval of metadata associated with all SRB objects and construction of a meta-index at application level. This sort of logic will result in increased processing time, high memory requirements, and poor response time for end users. Furthermore, since this sort of view is generated at the client side, with no server side support what so ever, so the applications and programs will have to repeat the process each time they want to generate such a view or alternatively rely on client side caching etc.
1236
A. Abid et al.
To address the issues as outlined above, the current research focused on improving the search capabilities of the SRB system using a keyword based indexing over the metadata catalog. The system builds an index over metadata attributes that provides a flexible and scalable search interface for the application program, and results in better user experience due to reducing query turn around time and by presenting a familiar web search like interface. The next section provides an overview of research work related to the current paper and puts the presented research in the context of existing efforts.
3 Related Work The following paragraphs highlight some of the existing research areas that are of relevance to the presented work. Firstly we describe the efforts to augment the SRB server using semantic and relational technologies. Secondly we present an overview of keyword based searches over the relational database. Lastly, we contextualize our work with reference to these efforts. 3.1 Semantic Augmentation of the SRB Server In order to overcome the deficiency of providing attribute name and values in the search criteria for SRB, Jeffrey and Hunter [8] developed a system that uses semantic information associated with the stored metadata. Their system uses metadata stored in MCAT for extracting semantic information about data stored in the SRB. An ontology engine is then used for applying the rules on the extracted data. The system provides a semantic layer over the data retrieval process thereby facilitating search operations. The issue with this system is that it is loosely coupled with MCAT and requires synchronization each time the contents of SRB MCAT are changed or alternatively requires runtime loading of MCAT contents – which may be very time consuming for metadata catalogs containing large amount of data. Nevertheless their work has been monumental in bringing the power of semantic technologies to the data grid management. 3.2 Relational Augmentation of the SRB Server In order to overcome the limitations as posed by the generic attribute-value pair structure of the SRB system; some of the research projects have augmented the SRB servers using relational database systems. In this setting, a relational database is used for modeling the entries in the problem domain. All metadata is stored in the customized relational databases; and the raw data is stored in the SRB system. Search queries are executed on the relational databases and the results contain pointers to the SRB objects of interests to users. The example of this approach can be seen in [9, 10, 11]. A major deficiency of this approach is that it uses SRB as a mere file system and can not capitalize on the rich set of data grid features of the system. The approach is also prone to synchronization issues amongst the relational database and the SRB based backend. It may be noticed that one of the objective of the EMCAT is to enable embedding a rich relational model inside SRB environment. But the current implementation of the EMCAT has not been able to that objective so far.
Keyword Based Indexing and Searching over Storage Resource Broker
1237
3.3 Keyword Search over Relational Databases Keyword based search interfaces to relational databases have been an actively pursued research subject [12, 13, 14, 15]. In [12] a symbol table is created to store the keywords relating to database schema and contained record sets. The query is then formulated by processing the keywords against the symbol table and the schema. Others have a schema browsing facility by modeling the database as a graph [13]. The DataSpot Publisher takes one or more possibly heterogeneous databases, predefined knowledge banks such as a thesaurus, and user defined associations, and creates a hyperbase, and the Search Server performs searches and navigation against the hyperbase [14]. In [15] the information retrieval is based on interactive querying. The database is viewed as a graph, with data in vertices (objects) and relationships indicated by edges, by which the proximity is calculated by shortest path. One of the recent articles looked at issues of the keyword search in heterogeneous databases [16]. It is very interesting to note that all of these approaches have focused on the relational database technology only, but almost all the concepts can be further extended/applied on the data stored as part of the SRB and other data grid system as well. The presented work complements the above efforts by trying to bring together the keyword based querying concepts in order to augment the search operations on the SRB based data grid platform. The following section describes the design and implementation details of the system.
4 Design and Implementation Details The schematic layout of the system is described in the Figure 1. The keyword indexes are built on the data stored in the MCAT. The system maintains two indexes, (i) an index for all attribute names used in the local metadata catalogue; and (ii) another index for the values of these attributes as specified in the metadata catalog. Additionally each of these indexes also contains references to the relevant SRB objects. The indexing system has been designed and implemented as an extension to the standard metadata catalog thereby ensuring that the indexes will be kept updated to reflect changes in the underlying MCAT. The keyword index is accessible via SRB client interface and standard SQL interface. Now in order to search for a specific SRB object all we need to know is the possible keyword (or their values) that might have been used for annotating the object in the metadata catalog. Users and applications are not constrained to know the underlying data structures in advance. The indexing also reduces the query turn around time and also facilitates the construction of aggregated metadata views i.e. tag clouds etc. The indexes are kept compact by minimizing the data redundancy thereby reducing the search space for the query operations. These indexes also ensure that the search performance is not affected by the increased data volumes especially in case of files. Indexing of the local SRB server will be useful in many settings i.e. where the federation and zone capabilities of the SRB servers are not being utilized. However, there is a need to construct meta indexes over SRB zones and even the whole SRB data grid. This meta index can form basis of a grid wide search engine allowing users to discover the data available as part of various data grid settings. In order to
1238
A. Abid et al.
Fig. 1. Schematic Layout of the Indexing System
demonstrate the usability of such a meta index a proof of concept index was also created over of the individual keyword indexes. This meta index, as depicted in the following figure, enables us to perform keyword searches in multi zone SRB environment. However, the meta-index is maintained external to the SRB environment thus incurring additional management and synchronization efforts. Note that the idea is being further explored in an ongoing research work, and what is presented here may be seen as a rudimentary illustration of the concept.
Fig. 2. Schematic Layout of a multi zone index
Keyword Based Indexing and Searching over Storage Resource Broker
1239
4.1 Thesaurus Support for Facilitating Search Operations An ontology designed on the lines of Gene Ontology[17], provides a thesaurus or dictionary support for facilitating search operations by the end users. Note that the Gene Ontology consists of terms and their relationships. Using these constructs one is able to design a controlled vocabulary for the problem domain. We used these concepts to define a custom ontology describing the cells, their connections and the properties that they might be annotated with. For each of these properties, their possible values and known variations were also recorded. The resultant ontology is essentially a super set that holds a listing of known keywords and their values for a given domain. For example, metadata attributes for cellular level data may contain an attribute “Layer” with possible values as a number i.e. 1; or different variation of strings e.g. “L1” , “Layer1” etc. The search string as provided by the user is tokenized and it is compared with the terms and relationships in the ontology before executing the search on the index. This enabled us to present results that might not have been easy to extract using schema based or structure based queries on the metadata catalog. Another advantage of using this approach is the fact that with each new object that is deposited in the SRB, we are able to supplement the known terminologies (and data dictionaries). Since SRB provides us with power to annotate using any combination of metadata attributes, so this allows for increasing the knowledge base of possible attribute names and their values used at a particular site/collaboration as well.
5 Results and Discussion The following paragraphs provide an overview of the performance analysis for the keyword indexing approach and its comparison with other possible database optimization techniques that may be used to improve the performance of the MCAT database. These experiments were conducted on a dual processor 2 Gigabyte memory machine using Oracle 10g based metadata catalog. The metadata tables contained 426349 entries. Following five approaches were used for carrying out the study i.e. 1. Creating additional indexes on the MCAT metadata tables. 2. Creating Views of distinct values on the metadata attribute name and attribute value columns containing file and collection metadata tables in the MCAT. 3. Materialized view of columns containing objectid, attribute name and attribute value as part of metadata tables and indexing all 3 columns. 4. Performing searches on default MCAT installation using Jargon. 5. Creating keyword based indexes using additional tables created in the metadata catalog using extended MCAT mechanism. In the first approach indexes were created on the tables containing the metadata attribute names and the values on the columns containing objectIds and attributeValues or keywords. In the second approach a view was created for the distinct keywords of the tables containing file and collection metadata. The difference between these two approaches is that in the first approach there were numerous entries that contained no metadata attributes but the existence of these empty entries increased the size of the table, and hence severely affecting the query performance. In
1240
A. Abid et al.
the second approach these extra rows were eliminated in the created view and therefore the resultant view was of smaller size as compared to the first one. In third approach the view created in the approach 2 was materialized and the query was executed on this materialized view instead of the actual table. The fourth approach used default MCAT provided as part of SRB and used JARGON API to query the metadata catalogue. The fifth approach was to create keyword and value indexes supplementing the default MCAT. The metadata was pre-processed and each of the keyword and value was associated with the SRB object. Following graph demonstrates the averages of the performance metrics; Approach vs Time(ms) 8000
7434
7000 5724
6000
5435
5282
5000 3627
4000 3000 2000 1000 0 Tables with Indexes
Table with Views
Table with Materialized Views
JARGON
Extra Tables
Fig. 3. Query Turn around times using different strategies
The y-axis in the above graph shows the time in milliseconds to perform a set of search operations, and the x-axis depicts the MCAT indexing/optimization method. It is evident from the graph that using additional tables for indexing keywords and their values results in best performance i.e. almost twice as efficient as the searches performed using default MCAT installation. Other approaches i.e. using extra indexes, creating additional views and creating materialized views also provide improvements over the default MCAT implementation. But as opposed to the keyword based indexing; these measures are prone to increases in the data size stored as part of SRB system. These methods also do not provide keyword based search support and suffer from the same limitation as described in section 2. 5.1 Advantages Following are some of the benefits that result from using a keyword based indexing scheme;
Keyword Based Indexing and Searching over Storage Resource Broker
1241
1. The system extends the existing search facilities offered as part of SRB system. It gives additional feature i.e. to perform a keyword based search on the SRB data store. The familiar keyword based mechanism makes the system friendly to end users. A very simple user interface is provided to the user comprising of a text box in which user can provide his search keywords and execute the search operation. Advanced search interface is also available in which the user can make a complex query with conditions like “and” “or” etc. 2. The system performance does not suffer from the increase in the data volumes. The benefits are specifically evident in case of searching over catalogs containing large amount of metadata information. The recorded experiments show that on average the time taken for the search using the proposed indexing system is almost half of the time taken by searching using JARGON API for SRB. 3. The system provides an aggregated view of metadata attributes used at a particular site and provides means to quickly generate summary reports for stored data; thereby helping the efforts to build a shared ontology/data dictionary. 4. The system enables users to specify full range of SQL operators for the search operations there by removing the restriction of AND only searches as offered by the default catalog implementation shipped as part of the SRB system. 5. The system, once installed, is kept synchronized and updated to the changes made to the host MCAT system. The only drawback of this system is that it requires more memory for storing indexing information. But potential benefits in terms of improving the search operation out weight this drawback. The average size of a record in the table storing the global keywords is 58 bytes and for the table storing the zone information for a keyword in global view is 13 bytes. Whereas the same tables used in local index setting have the average record size 24 and 35, respectively. This demonstrates that the keyword based indexes do not take a lot of space in the database but on the other hand support user friendly query interface, provide performance improvement SRB query processing and present a global view of metadata attributes used for annotating the data.
6 Conclusion and Future Directions The presented work demonstrates that the concept of keyword based searching can prove to be very useful in discovering the data stored as part of the data grids, as well as help to overcome some of the performance and usability issues encountered from using current generation of the grid data management tools. Much of what is presented here has a very practical relevance to the projects/teams using SRB system. SRB while offering a very rich set of functionality does suffer from hard to use query interfaces. Providing a keyword based interface for searching the metadata will help to minimize that barrier and help in adding value to the core system. It shall also be noted that much of the work still needs to be undertaken in order to provide a robust, scalable and widely adaptable keyword based search infrastructure. The current study, nevertheless, demonstrates the feasibility as well as applicability of such efforts. There are two areas where additional research efforts need to be directed i.e.
1242
A. Abid et al.
1. Formalizing the keyword based indexing of the local MCAT structure and improving the performance as well as relevance of the constructed indexes. Of much interest is the work on building a zone-wide meta-index; and even a global index providing a universal search interface to SRB based data grids. Note that this can also form basis of building a meta-index not limited to indexing content stored as part of the SRB based data grids, but also to indexing content made available as part of the other data grid infrastructures as well. 2. Another area would be to extend the work of semantically augmenting the SRB metadata catalog and allows complex reasoning on the stored data. Used in conjunction with the global meta-index this can further help to uncover useful data and facts stored as part of SRB based data grids.
Acknowledgements The work as presented here was supported by the Blue Brain Project at Ecole Polytechnique Federale De Lausanne (EPFL). We thank the SRB team for helping us during various phases of the project. We also thank Fabio Porto at the EPFL database laboratory for his useful discussion on the subject.
References 1. Moore, R., Chen, S.-Y., Schroeder, W., Rajasekar, A., Wan, M., Jagatheesan, A.: Production Storage Resource Broker Data Grids e-science. In: e- Science 2006. Second IEEE International Conference on e-Science and Grid Computing, p. 147 (2006) 2. Current Projects Using SRB, http://www.sdsc.edu/srb/Projects/main.html 3. Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC Storage Resource Broker. In: Proc.CASCON 1998 Conference, Toronto, Canada (1998) 4. Rajsekar, A., Wan, M., Moore, R.W., Schroeder, W.: Data Grid Federation, San Diego Supercomputer Center, 2004 - npaci.edu (2004) 5. Rajasekar, A., Wan, M.: SRB & SRB Rack-Components of a Virtual Data Grid Architectur. In: ASTC 2002. Advanced Simulation Technologies Conference (2002) 6. What is SRB: http://www.sdsc.edu/srb/index.php/What_is_the_SRB 7. Muhammad, A.J., Markram, H.: NEOBASE: Databasing the Neocortical Microcircuit. Stud. Health Technology Inform. 112, 167–177 (2005) 8. Jeffrey, S.J., Hunter, J.: Semantic Augmentation of SRB. In: eScience 2005 (2005) 9. Martone, M.E., Gupta, A., Wong, M., Qian, X., Sosinsky, G., Ludaescher, B., Ellisman, M.H.: A cell centered database for electron tomographic data. J. Struct. Biol. 138, 145– 155 (2002) 10. National Optical Astronomy Observatory, http://www.noao.edu 11. On going work on Relational extensions to the Neobase Project at the EPFL 12. Agawal, S., Chadhuri, S., Das, G.: DBXplorer: A System for Keyword Based Search over Relational Databases. In: ICDE 2002. 18th International Conference on Data Engineering (2002) 13. Hulgeri, A., Nakhe, C.: Keyword Searching and Browsing in Databases using BANKS. In: Proceedings of the 18th International Conference on Data Engineering 14. Dar, S., Entin, G., Geva, S., Palmon, E.: DTL’s DataSpot: Database exploration using plain language. In: Proc. of the Int’l Conf. on VLDB, pp. 645–649 (1998)
Keyword Based Indexing and Searching over Storage Resource Broker
1243
15. Goldman, R., Shivakumar, N., Venkatasubramanian, S., Garcia-Molina, H.: Proximity search in databases. In: Proc. Of the Int’l Conf. on VLDB, pp. 26–37 (1998) 16. Sayyadian, M., LeKhac, H., Doan, A.H., Gravano, L.: Efficient Keyword Search Across Heterogeneous Relational Databases. In: ICDE 2007. IEEE 23rd International Conference on Data Engineering (2007) 17. Gene Ontology Project, http://www.geneontology.org