Merging File Systems and Data Bases to Fit the Grid

Merging File Systems and Data Bases to Fit the Grid Yves Denneulin1 , Cyril Labbé1,3 , Laurent d’Orazio2 , and Claudia Roncancio1 1

3

Grenoble University, LIG, France {First.Last}@imag.fr http://www.liglab.fr 2 Blaise Pascal University - LIMOS, France [email protected] http://www.isima.fr/limos/ Monash University, DSSE, Melbourne, Australia

Abstract. Grids are widely used by CPU intensive applications requiring to access data with high level queries as well as in a file based manner. Their requirements include accessing data through metadata of different kinds, system or application ones. In addition, grids provide large storage capabilities and support cooperation between sites. However, these solutions are relevant only if they supply good performance. This paper presents Gedeon, a middleware that proposes a hybrid approach for scientific data management for grid infrastructures. This hybrid approach consists in merging distributed files systems and distributed databases functionalities offering thus semantically enriched data management and preserving easiness of use and deployment. Taking advantage of this hybrid approach, advanced cache strategies are deployed at different levels to provide efficiency. Gedeon has been implemented, tested and used in the bioinformatic field.

1

Introduction

Large architectures of the grid or cloud categories play a crucial role in distributed computing. Numerous efforts have been done to make them operational and well suited for a large variety of users. Data organization not really well suited to their needs together with data management at a large scale of distribution are crucial issues and can be a major setback for potential users. For example, genome sequencing projects have exceeded more than a thousand and about hundreds of complete sequences have been published. The different results are classified, analyzed, published and referenced on various geographic and logical sources in terms of banks of raw or annotated data. Most of applications used by researchers in biology use as inputs and outputs flat files. Such files are composed by sequences of entries which are read sequentially during processing. An efficient access to these files, coupled to the use of caches in order to save data transfers and reduce the number of I/Os due to such a processing, is mandatory in this domain. The most common data organization abstraction A. Hameurlain, F. Morvan, and A. Min Tjoa (Eds.): Globe 2010, LNCS 6265, pp. 13–25, 2010. c Springer-Verlag Berlin Heidelberg 2010

14

Y. Denneulin et al.

used by this population is the traditional file, hence most existing software developed is able to exploit them. Unfortunately, databases produce results in formats making integration between them and legacy applications tedious, and their use in large scale distributed environment difficult. The goal of the Gedeon project is to propose an hybrid data management middleware to handle easy, efficient and semantically rich data management: access time close to file system ones, queries on large scale distributed architectures and easy deployment without any heavy administration task. Metadata are structured into records of (attribute, value) pairs, queries are based on metadata properties combined with users’ own annotations. Storage can be distributed on various sites with flexible configuration management and fault tolerance. The hybrid part consists in merging functionalities of databases with ease of use and efficiency of file systems by giving the user a centralized file system view of the distributed data sources. The Gedeon middleware provides flexible caching solutions to improve response time according to the characteristics of the infrastructure and users needs. This paper presents the main choices of this hybrid data management middleware and reports the experiences realized on a nation-wide grid (i.e. Grid5000). The first section presents the data model and the querying facilities, section 3 provides a description of the main components of the Gedeon middleware. Section 4 explains the deployment facilities of this middleware to fit users requirements. Section 5 gives account of a real use case in the field of bioinformatics, section 6 reviews related work and finally conclusion and future works are exposed in section 7.

2

Data Model, Querying Facilities and Interfaces

This section presents the main features provided to represent and access data with the Gedeon middleware. 2.1

Data Model

Many of widely used scientific raw data are found in so called flat files. As a matter of fact, most scientific applications need to be fed these raw data. These sets of data can be seen as large collections of records containing together data and metadata. Used data model and data structure are rather rough and metadata associated to data can often be compared to a collection of (attribute,value) pairs. This fundamental model of data structure is widely used for its simplicity and its ability to be widely open to future extensions. Taking this fact into account, we aim at enhancing the management of such kind of data by providing second order query capabilities (conditions on attributes values and names). No heavy data remodeling process is needed and the approach preserves the widely used file interface. Three levels of abstraction for data querying are provided. A data source is a set of files that can be accessed either locally or remotely across the grid. These

Merging File Systems and Data Bases to Fit the Grid

15

files are composed of records, a record is a list of (attribute,value) pairs. The lowest level (file level) can be accessed with basic file interfaces whereas the highest level (attribute-value pair) can be accessed through semantic queries. Hence, at the lowest level, a Gedeon data source is made of standards OS files. Since files are entities operating systems handle efficiently few additional overhead is incurred by the storage layer. 2.2

Querying Facilities

In Gedeon, queries can be used to retrieve records from distributed and/or replicated data sources at a grid scale. In a query, data sources are parameters and can be either files (local or remote) or aliases defining the composition of data sources (similar to views in standard DBMS). Compositions of sources (detailed in 3.3) are of different types: union, round-robin and join. Results of queries are a set of records retrieved as files. Navigation through this sets of records can be achieved in the same way a user navigates through files, providing thus a grid scale hybrid file system enriched by semantic queries (see figure 1). These queries, mainly selection queries, enhance navigation by allowing the use of predicates to select records in a given source. This leads to the creation of a virtual set that can be considered as part of the file’s system tree. Predicates are regular expressions on attribute names or attribute values. Figure 1 shows a navigation enhanced with queries. For example, BigFlatDataFile> cd $Date== 1991 designates a virtual node that should contain the result of the selection on date 1991 evaluated in BigFlatDataFile. To visualize the data in such a virtual node, an explicit extra request has to be expressed. This is made with the ls command. This choice allows a lazy evaluation of the queries in the whole path which is particularly important for queries leading to a distributed execution. The whole metadata can be accessed using a simple file system interface: the selection query is done with the path in the virtual file system. A simple ls command in the directory will display data of the records: the values of a default >ls BigFlatDataFile >cd BigFlatDataFile BigFlatDataFile> ls Data1 Data2 Data3 ... >cat Data1 ... BigFlatDataFile>cd $Date==/1991/

BigFlatDataFile/$Date==/1991//>ls Data2 Data21 Data322 ... ...

List files in the current directory. To go in the file. List records in current file. Display the whole Data1 record. Select records having a value for Date attribute containing the string 1991. List the current set of selected records.

BigFlatDataFile/$Date==/1991//>cd /size/==123 Within the current set, select records with an attribute name containing the string size and having 123 as value.

Fig. 1. The Gedeon semantic navigation through records in a large flat file

16

Y. Denneulin et al.

attribute or an attribute specified by the user will be selected. A cat command on a file will give the whole record matching the query, the current directory, with the specified attribute value. Hence all the records can be accessed as files thus preserving compatibility with legacy applications and providing a seamless integration of our middleware in an existing workflow. Modification of files is also a supported operation. Despite, the fact that modification of raw scientific data/metadata are quite unlikely1 , this operation is needed. As a matter of fact, the Gedeon’s composition operation join is very useful to enrich original raw data/metadata by information provided by the user which are more subject to change. For efficiency purpose, Gedeon widely uses advanced cache and replication techniques. Coherency between data stored in original data-sources and replicated data in caches solely depends on the policy defined at cache level. We propose a generic cache service framework to support any coherency policy in the middleware. It is also possible to propose different policies according to the site’s requirements. In the context of scientific data exploitation, high frequency modification on data/metadata barely exists, so we tested with a lazy policy.

3

The Gedeon Middleware

This section presents the main components of the middleware. Section 3.1 gives the global picture and sections 3.2, 3.3 and 3.4 detail them. 3.1

Main Components

The Gedeon middleware is composed of four elements: – fuple is in charge of low level I/O with the underlying file system used for storage of the metadata, – lowerG is the layer in charge of exploiting the metadata to give an unified view of them notwithstanding their real storage location, – caches store query results locally and can be used for both local and remote requests, – the interface (VSGF in the figure) used locally to access a data source can be either the lowerG API or a file system. A query sent from an application is evaluated by lowerG. If the result (or some of its elements) is cached, relevant data is retrieved automatically from the cache. If the query evaluation requires accessing a remote part of the data source, lowerG contacts the corresponding lowerG component on the remote nodes. If data are present locally, the fuple library is called to access data. If the result is a combination (see Section 3.3) of local and remote data, lowerG is in charge of aggregating the results. 1

As they are composed of experiment’s results.


3.2

17

The Fuple I/O and Request Library

The goal of the fuple library is to allow to access and to modify the underlying files and to query the records inside a file or a set of files. It was designed with three goals in mind: simplicity, robustness and performance. Operations offered by this library are low level functionalities: read/write records, read/write an (attribute, value) pair in a record. It also provides functions for record selection according to the expressions involved in the queries. As the query language is of the second order, queries may combine regular expressions on values, attribute or attribute names. These operations are done using a file descriptor which can map to a local file or to a socket hence handling remote and local access in a similar way. Access methods are optimized in ways that depend on the underlying storage of the records. For example for local access memory mapping is used. Pre-compiled queries are also supported in order to skip the syntactical analysis part but also to get it in an appropriate format to be sent through the network to other nodes. Query results may be stored in a cache component in order to avoid another evaluation for the same query or one that is a subset of an already evaluated one (see section 3.4). 3.3

The lowerG Distribution Component

lowerG is the component in charge of the distribution aspect: nodes running fuple and lowerG can be used to evaluate queries on data scattered on various sites. The choice of deployment for the data sources, the files handled by fuple, defines the evaluation plan for a query. Each query contains a data source handle that describes which set of data the query should be evaluated on. A local correspondence is done to associate this handle with a set of local and/or remote data. ID id0 A value1 baseP B value2 C value3 D value4

ID id1 A value5 B value6 D value7

ID id0 baseS X value10 Y value11

ID id1 X value12 Y value13

...

...

join key: ID source Pri: baseP source Sec: baseS

select(true)

ID id0 A value1 B value2 C value3 D value4 X value10 Y value11

ID id1 A value5 B value6 D value7 X value12 Y value13

...

Fig. 2. Example of join operation in Gedeon

Sources can be exploited in various ways depending on its local description. For example, if several sources are identical (they contain the same data) they can be used as a way to provide fault tolerance. In case they are disjoint, a response to a request will be an union of the responses for the various sites. And finally, in case of complementary sources, the result has to be built by joining them. The possible descriptions of a source in Gedeon are: local the source is a locally stored file. Any deployment branch will ultimately finish by a local source.

18

Y. Denneulin et al.

remote the source is remote, hence the request will just be forwarded, and the results locally exploited. union will execute requests on the data sources and aggregate the results. This is the most common way to distribute a base among various sites by splitting its list of records on different nodes using the advantages of parallelism. round robin will execute the request on the sources in a round robin way, allowing load balancing. join is quite similar to the classical equi-join operation well known for relational databases. Figure 2 shows an example where the join attribute, designated with KEY, is ID. It is worth noting that records in a source, for example baseP, may have different attributes. They do not have to conform to a strict schema. 3.4

Cache

In the context of large scale systems managing high volumes of data , our objective is to optimize the evaluation of queries processed by lowerG in order to reduce the response time. This section presents our cache solution for that. Semantic caching. Gedeon uses a new cache approach, called dual cache [13,11], for managing query results. Dual cache is based on the flexible cooperation between a query cache and a record cache. On the one hand, the query cache associates queries to a set of records identifiers and on the other hand, the record cache enables retrieving record using their identifier. When a query is submitted to a dual cache, it is first forwarded to the query cache. It may result in a hit or a miss. There is a query hit if entries of the query cache can be used to answer the query, a set of identifiers is retrieved. This set is used to load corresponding records via the record cache. Dual cache optimizes resources used by the cache, avoiding replication of records shared by several queries and enabling to store more cache entries. It enables storing query results in the query cache without keeping corresponding records in the record cache. As a consequence, the load on servers may be less important and access may be more efficient. In addition, transfers according to record granularity help reducing the amount of data transferred, avoiding retrieving already stored records. Flexible configuration of query caches and record caches is particularly relevant to establish fine cooperation, allowing different cooperation for query caches and record caches. Cooperative caching In order to manage cooperation in large scale environments, we proposed to adopt a generic notion of proximity [12]. This allows to manage relevant cooperative caches networks where the proximity between caches is calculated according to physical (load, bandwidth, etc.) or/and semantic (data, interest, etc.) parameters. This generic notion of proximity is very flexible. Besides enabling to propose existing approaches like peer-to-peer cooperation (where the proximity can be measured according to the number of hops),


19

proximity facilitated the setup of dynamically adaptable networks of caches. Proximity is particularly useful in dual cache, since different proximities can be used for query caches and record caches. Using a physical proximity based on the characteristics of the infrastructure for record cache enables fine load balancing. We have established networks of record caches located in a same cluster, taking advantage of high communication capabilities provided inside clusters. Using a semantic proximity to build cooperative semantic caches reduces load on data sources. Cooperation between query caches makes it possible to avoid evaluating some queries, corresponding records being accessed via their identifiers. Semantic proximity can be based on communities of interest.

4

Flexibility of the Deployment

In this section we explain how the deployment of our middleware can suit the application needs by presenting first the characteristics of a deployment and then a typical example, done for a bioinformatic application. We have seen in section 3.3 that many combinations of data sources are possible to aggregate resources, knowledge or to provide fault tolerance, load balancing,. . . Any node, typically a computer or a cluster, can have its own view of the data stored on the sources, composing them in any way it wants. In the current prototype the deployment is defined in a file, a parameter of the lowerG component. The middleware permits to dynamically change this composition to suit a given application need. A major concern in this case is the performance penalty one has to suffer for this flexibility with the amount of communication that remote requests generate. As presented in section 3.4 we address this problem by using heterogeneous caches deployed on all the sites and at various levels in order to minimize the amount of data exchange needed to evaluate a query refinement of a preceding one. A typical use case of the our middleware is the sharing of a set of records between researchers worldwide. Each researcher has a different view of the records according to the set of sources he wants to use. Typically a researcher will want to use a reference base (Bref ) enriched with his own annotations (Banot ), plus Bref Replica 1

Bref Replica 2

Bref Replica 3

Round Robin

Bcom

Banot Alias Bref

Join

Query Results

Union

Alias Benrich1

Alias Benrich2

Fig. 3. Flexibility and composition of data sources in gedeon

20

Y. Denneulin et al.

the ones of the community it belongs to (Bcom). Given that three sources of data exist (reference, his own, the community one), the base the user wants to deal with will be a composition of the three. Since the reference one (Bref ) is supposedly remote and heavily used, it will be replicated, queries will be sent to one of the replica using the round-robin description of a source. Bcom adds new attributes to the data, which are relevant for this particular community. This file can be either local or remote according to its status (more or less formal). Enriching Bref with Bcom is done by defining an alias Benrich1 using the join operation. Considering that Banot contains only attribute names already present in Bcom, Banot contains only additional information not yet available at the community level. Enriching Benrich1 with Banot is done by the creation of a new alias Benrich2 using the union operation. Figure 3 depicts the resulting deployment allowed by alias creation. Such situations are commonly found in the field of bioinformatics, which is the field Gedeon has been developed for. We have presented in this section how deployment is done and a typical use case of our middleware to provide the needed functionalities. The prototype has been written in C for the fuple and lowerG components and Java for the cache framework. The performances are presented in the next section.

5

Bioinformatics Experimentation

Files of the field: This section presents experiments with the Gedeon middleware on bioinformatics data. Files are composed by sequences of entries which are read sequentially during processing. The main goal of these experiments is to illustrate how the performance of the I/O level fuple-lowerG and the use of caches in order to save data transfers and reduce the number of I/Os satisfies an efficient access to these files. The SwissProt bank[6] , used in these experiments, is a biological database of protein sequences of different species. It supplies a high level of annotation on the protein. The used bank consists in a big flat ASCII file (750 MB) composed by different sequences of entries, structured by several lines. A line starts with a two characters code, giving the type of data contained in this line. For example, a line AC P21215; contains the identification P21215 of the sequence, a line DT 01-AUG-1991; gives information about the date of creation (and modification) of a sequence, a line OC Bacteria; Firmicutes; Clostridia; Clostridiales; Clostridiaceae; corresponds to the classification of the organism and SQ MIFDGKVAIITGGGKAKSIGYGIAVAYAK defines the sequence itself. Conversion of this kind of files in a Gedeon file, based on an attribute/value model, is quite obvious: one record per sequence and a record is a list of pairs attribute-value where the attribute name is given by the two first characters and the value by the rest of the line. Using Gedeon, it is possible to easily build subsets of sequences to feed everyday used tools of the field, and this without any additional complex treatment. For example, building a file containing all entries with an OC line including the string ”Bacteria” and ”Clostridia” can be easily express by the following query ”$OC==/Bacteria/ && $OC==/Clostridia/”.


21

Generating benchmark: Generally speaking, there are two main ways to generate workloads. The first one is to use real traces. This approach seems to give a good approximation of real use cases but finally a trace is just a particular case and often it does not represent reality in its whole complexity. Furthermore, if the main purpose is to understand why a solution is adapted to a given context, the use of traces will not highlight mechanisms in action. The second approach is to use synthetic workload. Its main drawback is to be synthetic but this type of workload can be tuned easily. If traces are available they can be used for the choice and calibration of the model. Our purpose here is to illustrate the benefit of dual cache and to understand how it works and a synthetic workload has been chosen: 1) an uniform workload is composed of query not related to each other. If a semantic cache is efficient in this context, this ensures that this cache is interesting for the system. This kind of workload has been previously used in [9] and 2) a Rx semantic workload [15]. With such a workload, queries correspond to progressive refinements. The first query is general and following ones are more and more precise and thus reduce the set of matching elements. x stands for the ratio of subsumed queries (Rx, x % of queries will be issued by constraining former queries). Experiments settings: Servers and caches have been deployed on Grid5000 [7]. The data base has been partitioned in three equivalent size files, managed by three clusters2 . One node on each cluster is allocated to query evaluation. When a query is submitted, it is forwarded to the three clusters for a parallel evaluation (Gedeon composition union). One client with a 20 Mb cache is placed at Sophia Antipolis.

(a) Mean response time ac- (b) Access rate on servers cording to the ratio of sub- according to the ratio of sumed queries subsumed queries Fig. 4. Experiments results

Figure 4(a) shows that using a dual cache globally improves the response time. Such a figure presents the mean response time (in seconds) according to the ratio 2

Sophia Antipolis: bi-Opteron 2.2Ghz, 4Gb memory, SATA disk; Bordeaux: biOpteron 2.2Ghz, 2Gb memory, IDE UDMA disk; Grenoble: bi-Xeon 2.4Ghz , 2Gb memory, IDE UDMA disk.

22

Y. Denneulin et al.

of subsumed queries. Results show that the more frequent refinements are, the more efficient a dual cache is. As illustrated by figure 4(b), that presents the impact of the ratio of subsumed queries given in percent on the load on servers, the higher this ratio, the less contacted the servers. In fact, when the number of refinements increases, the number of extended hits increases. An extended hit occurs when a query is included in a region present in the cache. As a consequence the associated answer is processed in the cache. In addition, dual cache enables to keep a large number of queries, and can put in the cache a query, even if the corresponding objects cannot be stored.

6

Related Work

This paper tackles different domains related to large scale data management. In this section we present some of the main works related to grid and cloud computing. Gedeon is not the only middleware for scientific data management in data grids. The Globus Alliance [3] have proposed the Globus Toolkit[14] consisting of a middleware that is adaptable and tunable according to the considered application. Such a middleware is composed by a set of libraries of services and softwares, developed by the community, tackling large scale distribution issues. Different categories of services are proposed, execution environments , services discovery and monitoring, data management and security. Globus is a service oriented architecture. Proposed services can be extended to fulfill specific data sources or applications requirements. Like Globus, Gedeon follows a modular approach enabling to extend the base core, but at a record granularity. Gedeon aims to be used in light grids, requiring easy deployment, transparent data access through files and high performance. Globus architecture is complex, hard to deploy and to tune, making it a too hard solution to consider in such grids. gLite [2], a proposal from the European project EGEE [1] is a service oriented middleware to build application for data and computing grids. Available services are a job and data managers as well as security and deployment services. The architecture is quite hard to deploy and as a consequence seems not suitable for light data grids. From a data management point of view, gLite only considers a file granularity. The consideration of metadata, integrated in Gedeon for data querying and exploitation, is limited to a metadata catalogue for files research and a virtual file system. It has to be noted that this virtual file system is a common aspect in many approaches, but that the file granularity often limits the associated functionalities. SRB [5] is a middleware massively used by different scientific communities. Its main goal is to supply a transparent and uniform access to distributed data in grids. SRB supplies different interfaces to manage data going from command line facilities enabling navigation via a virtual file system, to a web portal. The main service is the metadata catalog (MCAT) enabling to transparently request the collection of sources available. Sources can consist of file or data bases. Metadata are represented in files and do not enable fine grain access inside a


23

file. SRB permits the construction of complex architectures, federating zones via meta-catalogs. Gedeon aims to tackle the same aspects, but eases the information extraction from files. Gedeon integrates more sophisticated caching mechanisms to improve performances. More generally, SRB is a well suited solution for indexing images, whereas Gedeon exploits files with a complex contents, which are numerous in various domains (banks of genomes, proteins, etc.). Mobius [4] proposes a architecture quite similar to SRB. The GME (Global Model Exchange) model represents available metadata and data on a XML exchange schema. Querying and researching of resources and services are based on a XPath like language. Such a solution seems to not be really advanced, but presents the specificity to be XML based to exploit distributed data and services on grids. The researcher is in charge of producing the XML schema representing data which are desired to be shared and must follow a difficult process to make the architecture usable. As a consequence, this solution is not mainly used by scientists. Gedeon aims to propose a easier deployment. Cloud computing on the impulse of companies like Google, Microsoft and Amazon focuses the interest. Like grid computing, cloud computing supplies high computing and storage resources, but considering new problems, in particular the pay-as-you-go model. Different tools have been propose to manage large data sets. Particularly, it can be mentioned the MapReduce [10] paradigm consisting in a massively parallel execution environment and MapReduce-based high level querying languages, like Microsoft SCOPE [8], Yahoo: Pig Latin [16] or Facebook Hive [17]. Like cloud languages, Gedeon aims at supplying a language for large amounts of data analysis. Unlike these solutions, first Gedeon provides flexible parallel management configurations, via sources composition capabilities. In addition, it considers sophisticated cache mechanisms, enabling to enhance performances even without using massive parallelism. Gedeon presents several attractive aspects. First it consists in a light, but modular, architecture. Then it is based on a simple model for representing and storing metadata. After that data management is done at a record level. Finally, Gedeon tackles performance at every levels.

7

Conclusion and Future Research

This paper presented the Gedeon middleware, an hybrid system for data management in grids. It merges functionalities and properties of both files systems and databases providing thus the best of the two worlds at a grid scale. It provides an enriched file system view of distributed data allowing to take advantage of grids and to cope with legacy applications. These two dimensions are particularly important for scientific communities. The development of this middleware has been driven by three main goals: to offer semantically enhanced access to flat data files, to preserve efficiency at every level and to provide easy deployment adapted to middle sized environments. This has been achieved through: taking advantage of natural organization already presents into files, offering semantic operations for sources composition and using

24

Y. Denneulin et al.

semantic caching and cache cooperation. Thus, Gedeon enables an efficient finegrained exploitation of metadata in comparison to many solutions by considering the record as an unit of annotation rather than the file. The prototype is operational. It has been experimented (results have been reported in this paper) and used in the field of bioinformatics on real banks of (genome) sequences. Future work includes experiments in other application domains (e.g. cellular microscopy medical images.). Gedeon, at this stage, provides a solid basis for remaining future works that are mandatory to achieve a more complete system. Future experiments have to be done in order to compare Gedeon with existing approaches like Globus, gLite or MapReduce. Several issues need to be more closely considered as the choice of consistency policies in the cache network and the study of security policies in this context.

Acknowledgements Thanks to O. Valentin for his contribution, the LIG and LIMOS teams for fruitful discussions, the French Ministry of Research and Institut National Polytechnique de Grenoble for financial support. Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, an initiative from the French Ministry of Research through the ACI GRID incentive action, INRIA, CNRS and RENATER and other contributing partners.

References 1. 2. 3. 4. 5. 6.

7.

8.

9. 10.

Egee enabling grids for e-science, http://public.eu-egee.org/ glite middleware for grid, http://glite.web.cern.ch/glite/ Globus, http://www.globus.org/ The mobius project, http://projectmobius.osu.edu/ Srb the sdsc storage ressource broker, http://www.sdsc.edu/srb Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Res. 31(1), 365–370 (2003) Cappello, F., Caron, E., Dayde, M., Desprez, F., Jegou, Y., Primet, P., Jeannot, E., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Quetier, B., Richard, O.: Grid’5000: A large scale and highly reconfigurable grid experimental testbed. In: Proceedings of the IEEE/ACM International Workshop on Grid Computing, Seattle, USA, pp. 99–106 (2005) Chaiken, R., Jenkins, B., Larson, P.-˚ A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2), 1265–1276 (2008) Chidlovskii, B., Borghoff, U.M.: Semantic caching of web queries. The Very Large Data Bases Journal 9(1), 2–17 (2000) Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)


25

11. d’Orazio, L.: Caches adaptables et applications aux systèmes de gestion de données répartis ` a grande échelle. PhD thesis, Institut National Polytechnique de Grenoble (December 2007) 12. d’Orazio, L., Jouanot, F., Denneulin, Y., Labbé, C., Roncancio, C., Valentin, O.: Distributed semantic caching in grid middleware. In: Proceedings of the International Conference on Database and Expert Systems Applications, pp. 162–171. Regensburg, Germany (2007) 13. d’Orazio, L., Roncancio, C., Labbé, C., Jouanot, F.: Semantic caching in large scale querying systems. Revista Colombiana De Computaci´ on 9(1) (2008) 14. Foster, I.T.: Globus toolkit version 4: Software for service-oriented systems. Journal of Computer Science and Technology 21(4), 513–520 (2006) 15. Luo, Q., Naughton, J.F., Krishnamurthy, R., Cao, P., Li, Y.: Active query caching for database web servers. In: Proceedings of the International Workshop on The World Wide Web and Databases, Dallas, USA, pp. 92–104 (2001) 16. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a notso-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1099–1110 (2008) 17. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)