Data Storage in Clouds - FTP Directory Listing - Irisa

5 downloads 184 Views 273KB Size Report
Keywords: Cloud Computing, Cloud Storage, MapReduce, IaaS, PaaS, Data-Intensive .... cannot afford to buy and manage their own data centers. Hence cloud ...
Data Storage in Clouds Bibliography Report

Radu Marius Tudoran [email protected]

Supervisors: Gabriel Antoniu, Luc Boug´ e, Alexandra Carpen-Amarie {Gabriel.Antoniu,Luc.Bouge,Alexandra.Carpen-Amarie}@irisa.fr

ENS de Cachan, IFSIC, IRISA, KerData-Team January, 2011

Keywords: Cloud Computing, Cloud Storage, MapReduce, IaaS, PaaS, Data-Intensive Applications

Contents 1 Introduction

2

2 Why cloud computing? 2.1 Motivation . . . . . . . . . . . 2.2 Grid vs. Cloud . . . . . . . . . 2.3 Cloud Computing Landscape . 2.3.1 Cloud Architectures and 2.3.2 Main Challenges . . . . 2.4 Sky Computing . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . classification . . . . . . . . . . . . . . . .

3 Cloud Data Intensive Applications 3.1 Scientific applications . . . . . . . 3.1.1 MapReduce . . . . . . . . . 3.1.2 Scientific workflows . . . . . 3.2 Business possible applications . . .

. . . .

. . . .

. . . .

4 Cloud storage systems: State of the art 4.1 IaaS Storage . . . . . . . . . . . . . . . . 4.1.1 A few examples . . . . . . . . . . 4.1.2 Conclusions . . . . . . . . . . . . 4.2 PaaS Storage . . . . . . . . . . . . . . . 4.2.1 Description and problems . . . . 4.2.2 Conclusions . . . . . . . . . . . . 4.3 Conclusions . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

2 2 3 4 4 6 6

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 8 8 8 8

. . . . . . .

9 9 9 10 10 11 12 13

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 Discussion

13

6 Conclusions

13

1

1

Introduction

The amount of data that is processed today is extremely large and more and more applications can be classified as data-intensive. The spectrum of such applications is very wide, ranging from governmental and commercial statistics, climate modeling, cosmology, genetics, bio-informatics, highenergy physics [30] to commercial ones. With the emergence of the recent infrastructures like cloud computing platforms, achieving highly efficient data management is a critical challenge. The overall application performance is highly dependent on the properties of the data management service. The purpose of this work is to analyze the characteristics of this new infrastructure, called the cloud, and to focus on its storage capabilities. We aim to identify the main requirements of the dataintensive applications which will be contrasted with existing solutions, in order to highlight the main milestones that must be overcome further in the area of cloud storage.

2

Why cloud computing?

Cloud computing appeared as a business necessity, being animated by the idea of just using the infrastructure without managing it. Although initially this idea was present only in the academic area, recently, it was transposed into industry by companies like Microsoft, Amazon, Google or Yahoo!. This makes it possible for new startups to enter the market easier, since the cost of the infrastructure is greatly diminished. This allows developers to concentrate on the business value rather on the starting budget. The clients of commercial clouds rent computing power (virtual machines) or storage space (virtual space) dynamically, according to the needs of their business. Clouds are the new trend in the evolution of the distributed systems, the predecessor of cloud being the grid. An analysis of this evolution and of the specifics of each of these technologies will be done below, the purpose being to identify as clearly as possible what is the cloud.

2.1

Motivation

Before looking at the motivations that lead to the cloud computing paradigm, it makes sense to define this concept. One of the accepted definitions is the following: Clouds are a large pool of easily usable and accessible virtualized resources ... dynamically reconfigured to adjust to a variable load (scale), allowing ... an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model ... by means of customized SLAs [38]. According to this definition, a user of the cloud pays only the price for the needed resources at a certain moment. Therefore the model is animated by business rules. An independent software vendor (ISV) does not need to buy and to administrate a lot of resources that would only be needed at peak times. It can increase its revenues by optimally paying only for what it uses [16], based on service-level agreement (SLA) [28, 13]. Another motivation for this new paradigm was to offer to new startups the possibility to create powerful applications, in terms of computation power or storage capacity. The infrastructure costs would have made this prohibitive for new-comers, which cannot afford to buy and manage their own data centers. Hence cloud computing has appeared as an application-driven paradigm governed by business rules. Cloud computing can be described as ”converting capital expenses to operating expenses” (CapEx to OpEx), or, for capturing the economic benefits, we can use the ”pay as you go” syntagm [13].

2

2.2

Grid vs. Cloud

As previously mentioned, a comparison between grids and clouds characteristics will be made for a better distinction between them. Since a definition of cloud was previously given, one for the grid would be useful for this analysis. Ian Foster [22] defined it as a system that coordinates resources which are not subject to centralized control, using standard, open, general- purpose protocols and interfaces to deliver nontrivial qualities of service In addition to simply comparing the definitions for the two, we can look at the features of each one, to better define them. Starting with the similarities between these technologies, we can observe that both rely on virtualization and coordination of heterogeneous resources that are geographical distributioned. A summary of the characteristics can be seen in Table 1. Hence, grids enhance a fair share of resources across organization while the cloud belongs to one company and resources are offered on demand [38]. Applications for the two are developed differently: clouds are able to accept a wider range of applications, in some cases a simple deployment of the desktop version of it is possible, while for grids there is the need to ”gridifie” them [38]. From the point of view of usability, clouds are much easier to use since they hide most of the deployment details. According to Travor [20], cloud computing is the user-friendly version of grid computing. On the other hand, when it comes to standardization, there is a lot of efforts regarding grids in this manner, while for clouds there are some problems on this aspect. Regarding security, each site in a grid can have its own policy regarding how to access the resources [14], due to multiple administrators. A cloud does not have this complexity since it has a single interface that allows a user to access all the infrastructure. Another major difference is the usage of these platforms. Grids are closer to scientific area and grand-challenge applications [14], while the clouds are clearly driven by a business model. An exhaustive list of differences is hard to be accomplished, especially because features are mitigated today from one platform to another. Hence this can serve only as a starting point on how these two platforms can be thought of. Feature Resource Sharing Resource Heterogeneity Virtualization Security Platform Awareness Centralization Degree Standardization Usability

Grid Collaboration (Virtual organization,fair share) Aggregation of heterogeneous resources Virtualization of data and computing resources Security through credential delegations The client software must be Grid-enabled Decentralized control Standardization and interoperability Hard to manage

Cloud Assigned resources are not shared Aggregation of heterogeneous resources Virtualization of hardware and software platforms Security through isolation The software works on a customized environment Centralized control Lack of standards for Clouds interoperability User friendliness

Table 1: Grid vs. Cloud characteristics [38]

3

2.3

Cloud Computing Landscape

According to Buyya et al. [15], Cloud computing promises reliable services delivered through next-generation data centers that are built on compute and storage virtualization technologies . Consumers will have access to data from any point, on demand, since clouds have a single point of access for all computing requests. From the business point of view, Cloud infrastructures aim to provide robustness and availability at any time, in order to appear as reliable ”partners”. 2.3.1

Cloud Architectures and classification

With respect to the ways clouds can be used, the ”de facto” consensus achieved led to the definition of 3 major exploitation levels: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) [38, 31, 13]. They can be conceptually viewed in Figure 1. The particularities of these will be highlighted and examplified.

Figure 1: Cloud layers [31] Infrastructure as a Service offers a large set of computing resources (computation power or storage capacity) [38, 31]. This is exploited by means of virtualization, being highly dynamic and allowing the creation of ad-hoc systems on demand. Instead of buying servers, disks or networking equipment, cloud consumers rent and customize virtual machine images. Fees are charged in general, on a utility basis that reflects the amount of raw resources used: storage space-hour, bandwidth, aggregated CPU cycles consumed, etc. [31]. The most successful cloud systems at IaaS level are: Amazon EC2 [1], Nimbus [8], OpenNebula [10, 35], Eucalyptus [2, 33]. All these systems offer a simple on-line interface through which the infrastructure can be used, users have an account for logging to the front end and launching multiple VM instances on the cloud. The generic architecture is shown on Figure 2. One important component is the hypervisor, which is a low-level software that presents the guest operating systems with a virtual operating platform which monitors the execution of the guest operating systems [28, 35]. It is present in each compute node, with the role of supervising the multiple virtual machines (VMs) that run on the cloud nodes.

VM

Controller Hypervisor

...

VM

Controller ... Hypervisor

VM

Cloud Node

Cloud Node

Cloud Node

FrontEnd

VM

Controller Hypervisor

VM

Figure 2: Generic IaaS architecture

4

VM

Platform as a Service offers the possibility to exploit clouds in a different manner than using the virtualized infrastructure. Users can directly use a software platform, the requested hardware being provided transparently [38]. As it can be intuited, it is constructed on top of IaaS, providing a higher level of programming and freeing the customer from configuring VMs. A common way to develop applications at this level is to comply with specific roles, like in the case of MapReduce or Azure. Microsoft Azure [6, 17] is one of the most representative PaaS systems offered for commercial use, whose generic architecture can be seen on Figure 3. The developers program at a higher level, concentrating only on the code for the applications that will run inside the cloud, possibly conformed to a specific architecture (ex. Web Role and Worker Role for Azure) or/and on the data, which is stored through simple methods (eg. HTTP requests). Google offers as well commercial access at its PaaS infrastructure through Google Apps Engine [3], providing the customers fast development and deployment, simple administration, with no need to worry about hardware, patches or backups and effortless scalability. MapReduce [18] has emerged recently as a new programing paradigm used at PaaS level. Its open implementation, called Hadoop [5] and supported by Yahoo!, has gained a lot in popularity recently. The MapReduce model consists in providing only 2 functions: Map and Reduce. The platform will be responsible for creating the specific number of workers that will run the code for each function and for the data flow to and from workers. In addition to Google and Yahoo!, Microsoft offers Dryad [26], which has the same programing principles as MapReduce, but is more general. The additional features provided, allow more complex data flows and compositions between the workers.

Figure 3: Azure architecture [17] Software as a Service is the highest level at which clouds can be used by the customers. Notorious examples of such services are given next.Google offers Google Docs [4], where users can store their documents out there and are able to access them from any place. Microsoft Live Services [7] with Live.com is a set of personal Internet services and software designed to bring together in one place all of the relationships, information and interests people care about most, like mail, account, messenger, office etc.. Other players in the market like Amazon Web Services [1] or saleforce.com concentrate mostly on E-commerce. These become more and more popular, being addressed to all types of users, relieving them from installing software, updates or patches [31]. In general a simple web browser is enough for accessing these softwares, as they can be reached from any location based on an ID. Others. As the popularity of cloud grows, new types of exploitation appear besides the 3 mentioned above. Microsoft Azure [6] has successfully deployed its SQL Database into the Azure cloud. The major advantage is that it can be used identically as a normal database, having all the benefits of the cloud. They refer to it, as DataBase as a Service DaaS [17, 25], but more common Data as a Service is used. The SaaS concept can even be extended to the notion of Models as a Service (MaaS ) where 5

semantic annotations and ontologies are used to compose computational models and execute them as a conceptual whole [29]. If the current trends holds, new such concepts will continue to appear, but as it can be expected, they can be integrated in one of the three main categories, as DaaS could be considered as part of the PaaS, or MaaS from SaaS. Let us notice that there are no strict borders between the layers. Since they are built on top of each other, the functionality can be easily enriched to reach an upper level. As an example an IaaS system with MPI configured could be easily proposed as a PaaS infrastructure. 2.3.2

Main Challenges

Being a relatively new technology, cloud computing still has some issues that must be overcome. The nature of these could be expressed as: Data management still needs a lot of work. Currently there is no concurency at IaaS level or there is a very simple mechanism at PaaS level [25]. Complex applications with high concurrency can suffer or even cannot benefit from the cloud technology until better scheme for concurrency are delivered. There are limitations on the size of the objects that can be stored [6, 1], which can create some complications in the development process. The fine-grain access is another issue since for example IaaS provides just simple mechanisms like get and put for managing the data, and these operations cannot access just small parts. Computational. The cloud ecosystem is very heterogeneous [35] and this is reflected at several levels. The diverse experience of various cloud costumers starts with the network connection that the cloud has, which can be either regular or high-performance. Along with the problem of mitigating the applications between clouds, at IaaS level, the problem of compatibility of the hypervisors also rises. There is a real need of moving from private clouds to hybrid ones or to public clouds [35, 29], but there are not currently general standards regarding the deployment of VMs. But there are efforts regarding this issue [29], For instance the Open Geospatial Consortium [9] has created an annual process between the major stakeholders for developing such standards. Security issues. In general, there are simple password mechanisms for identification, but more secured methods for authentication have been developed [11]. Recent studies have shown that a limit of the potential damage in case of an attack, like fine-grained delegation or limits on the traffic [34], would be needed. Another issue concerns the total trust that the clients must have in the cloud owner regarding their data. A security measure can be the encryption of data stored inside the clouds. The encryption can be the solution also for legal constraints, like data confidentiality. The programing models that are imposed when using cloud technology could also create drawbacks. Referring to the imposed architectures like Web Role and Worker Role in Azure [17], not all applications can comply to them. Moreover, the stateless feature imposed by a load-balancer, in charge of distributing the requests, creates difficulties for existing REST (representational state transfer) applications, which are mostly statefull, to mitigate into clouds. Issues regarding MapReduce programs refer to data location awareness, since the efficiency depends on the placement of the mappers close to the data.

2.4

Sky Computing

Clouds can be classified as follows [35, 21]: Private clouds offers the ability to host applications or VMs in a company’s own set of hosts Hybrid. A standard definition has not yet emerged but the term has been used for two separate clouds joined together or a combination of virtualized cloud server instances used together with 6

real physical hardware Public It describes cloud computing in the traditional main stream sense, whereby resources are dynamically provisioned on a fine-grained, self-service basis over the Internet from an off-site third-party provider who bills on a fine-grained utility computing basis The normal migration direction is from private clouds to public ones, due mostly to computation power needed. Motivated by this migration towards more computation power and also by economic reasons (the ability to move according to pricing schema), a new direction has emerged, called sky computing [28]. The major problems are related to compatibility of VMs, migration of the configuration, connectivity and trust. Keahey et al. [28] made significant progresses for opening the way for sky computing, meaning the creation of ad-hoc clouds on-demand from heterogeneous resources. From the wide spectrum of cloud computing domains, the rest of this work will concentrate on storage in clouds, aiming to identify the applications needs in terms of data management, the existing solutions and their issues.

3

Cloud Data Intensive Applications

In general, the data-intensive applications are found at IaaS and PaaS levels. It makes sense, since from the point of view of the user the SaaS level delivers a full product, everything being transparent for the customer. At the IaaS level , clients typically run a distributed application using a set of VMs encapsulating it, running under certain restrictions, according to the SLA. Direct access to local storage space on the physical machine is usually denied: clients are instead provided with a specialized storage service that they can access through a specific API. As a de facto standard in manipulating data, the Amazon S3 (Simple Storage Service) [1] has emerged, most storage systems at this level implementing this interface, along with others. The API is rather basic and simply allows the user to put and get some data at a specified storage location, with no adequate support for storing very large data (the object sizes is limited at 5GB for S3), nor for global data sharing under heavy concurrency. As cloud services typically enforce isolation for security reasons, the state-of-the art mechanisms proposed for data sharing are rudimentary. For instance, Amazon’s Simple Queue Service implements a message-based abstraction where data messages can be put to/retrieved from a queue, with a limited size of 256 KB per message. Processes accessing the queue are synchronized using locking. These approaches are representative for the state-of-the-art data management systems on IaaS clouds. At the higher level, of PaaS services , specialized file systems have been designed, such as HDFS, the default storage layer of Hadoop’s MapReduce framework [5]. The need to design specialized file systems was motivated by the requirements of data-intensive distributed applications in terms of performance and scalability.In HDFS, data is typically stored using massive files that are concurrently accessed by Hadoop’s processes that run the map and reduce functions. HDFS has however some difficulties in sustaining a high throughput in the case of concurrent accesses to the same file [30, 36]. In addition, HDFS does not support versioning and implies a write-once-read-many schema which can represent an issue for some applications.

7

3.1

Scientific applications

Although clouds are by definition governed by business rules, not only business applications are meant to be run inside it. As it will be seen next, besides commercial use, clouds are very interesting also for scientific applications. Powerful scientific workflows started to be run in different cloud platforms and in addition the MapReduce paradigm was also migrated into the clouds. As an overview of what must be provided for scientific applications, we can mention [34]:: data durability - raw data that is lost can add additional costs, since a new computation will be needed to recompute it, so it should be durable. data availability - since application can imply co-allocation of expensive resources, data should be present when needed data usability - the customer should not concentrate on how to get their data, but rather on the logic of the applications data security - science applications often share data between multiple parties or institutions, so the providers must both protect data and allow complex sharing mechanisms 3.1.1

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key [18] . This was proposed by Google, but Yahoo! uses an open source implementation of this model, Hadoop, and recently Microsoft has developed the Dryad [26] system, which can be considered as the next step of the MapReduce model. There are various applications that can comply with this simple model, from powerful index engines to pharmaceutical tools. From the point of view of the data, the model’s most sensitive part is the communication between map workers and reduce workers. Depending on the application, a high amount of data can be generated after the map phase, which must be stored and delivered to reducers [32, 36]. This sensitivity motivates ongoing work for improving the throughput, the concurrency and the efficiency of the underlying storage backend. 3.1.2

Scientific workflows

Clouds have generated interest also for the complex and high demanding scientific workflows. There are several studies that aim to analyze the advantages of these new technologies [27]. Juve et al. [27] have carried the analysis with 3 different types of workflows, from astronomy, seismology and bioinformatics, showing encouraging results. However, the various studies like this one share some common views about the requirements of the applications that the future cloud generations should provide. Depending on the access pattern, they highlight the utility of caching mechanisms that can decrease the costs [34]. Other issues refer to fine-grained access [34] or concurrency [25] since there is a lack of support for such attributes. As a conclusion, for scientific workflows in clouds we can say that the road is open, and as time passes and clouds will mature, we will see more and more such applications running in clouds, lured by the mirage of infinite power and storage, and by the elasticity [27].

3.2

Business possible applications

The storage requirements for the business applications share common points with the scientific ones, like security, durability, availability etc., but have also their own needs. Before going into details, we should see some examples of data-intensive applications that could be eventually created in clouds. 8

Chappell [16] gives some examples to illustrate the potential value of this platform for the Independent Software Vendor(ISV): the Next FaceBook, the Next Youtube etc. Hence even small startups could now compete on the market with their projects against the powerful vendors. In order to make such data-intensive applications possible, the cloud providers need to provide a user-friendly API to manage the data. More, the SLA must be guaranteed, in terms of delivery time, availability, access performance, since the user experience is the most important aspect in the business landscape. Another example of a commercial application would be the creation of new features that were prohibited before by the lack of computation power. Imagine a search engine for movies, based on images, or a service that extracts in real time a compilation with a certain subject of interest. Such an application will have a burst in both computation and storage requirements on each request. The cloud would be a good platform if a very flexible pricing schema would be used. An auction for the resources, makes sense since when the application needs a lot of computational power for short periods of time it should be able to get it, and eventually release the resources after a short time. We can see that data access performance and concurrent access would be in the top list of attributes needed for such an application. Therefore there are no strict borders in necessities for the business and scientific applications since from each improvement both kinds could benefit.

4

Cloud storage systems: State of the art

We have seen so far the general landscape of clouds and we will now focus on the analysis of the storage mechanisms that exist today. Storage does not imply just user data, since the VMs that are deployed must be also stored. In addition, there also exist storage mechanisms that are meant for the internal applications, these being designed according to specific requirements. IaaS level is poorer in such systems which are more present at PaaS level, while for SaaS we can consider that they are totally missing. Even if a software would offer clients the possibility to store their data, hence storage at SaaS, it would enter according to our classification at storage mechanisms for internal applications at PaaS level, since SaaS is based on the facilities present at platform level. Next, these types will be detailed and analyzed.

4.1

IaaS Storage

IaaS storage keeps raw data and the VM images. The offer from cloud providers to the clients in terms of storage refers to unlimited capacity in which the data, mostly unstructured, can be placed. The same mechanism (APIs) are used by the cloud for reading and storing the VMs, which can be the default ones or customized by users. 4.1.1

A few examples

As previously mentioned, Amazon S3 [1] emerged as a standard at this level, and all storage systems at IaaS level tend to implement this interface, shearing most of its characteristics. The access to data is done using standards-based REST (Representational State Transfer) and SOAP (Simple Object Access Protocol) interfaces designed to work with any Internet-development toolkit, while the default download protocol is HTTP. Amazon Simple Storage Service [1] provides simple methods to manipulate the data (read, write and delete), which can be accessed also by BitTorrent protocol in addition to the mentioned ones. Data is stored as objects with sizes from 1 byte to 5GB, each object being stored into a bucket, hence 2-level namespaces are provided. The purpose of the buckets is to allow

9

user to organize the data. The maximum number of buckets is 100, but the number of objects inside one is unlimited [34]. The sites are distributed across the globe, users having the possibility to select the location for their data for a smaller latency.Since the storage services do not ensure consistency, the applications have to use locking mechanisms on performing the writes. Besides the basic CRUD (create, read, update and delete) operations no other features are provided, letting to user ingenuity to find solutions for their needs. For example, if a simple rename or copy to another bucket needs to be done, the user must download locally the entire object and then to upload it again [34]. This costs, since in addition to storage, bandwidth and operations are also charged. Walrus [2] is the storage system offered by Eucalyptus, being, as stated, compatible with S3, but implements only REST (via HTTP) and SOAP interfaces. It is used both for storing user data and the VM images [33]. The general schema can be seen in Figure 4. Unlike S3, Walrus ensures consistency in case of multiple writes by validating only the last one. VM images are compressed and encrypted, Walrus being entrusted with the task of verification and description [33]. For improving the performance, a caching mechanism is used to store the decrypted images.

Figure 4: Walrus general architecture [33] Cumulus is the storage system used by Nimbus [8]. It is also compatible with S3, but it implements only the REST API. In addition to the common features inherited, it provides quotas, which are set by the administrator to impose certain limits in the capacity used by the clients [12]. Some of the advantages of Cumulus are the customized backends (compatible with POSIX) according to each data center and its easy usage. It provides an abstraction of the storage system, allowing the creation of a customized number of drivers. 4.1.2

Conclusions

As we have seen, IaaS offers rudimentary solutions for storing data, which can represent both user data and VM images. The operations provided at this level are the basic CRUD ones. More complex operations or specialized concurrency mechanisms represent the current problems in the state of the art for storage at IaaS level. All the systems studied are compatible with the Amazon S3 interface and access data through HTTP. The research carried around this topic concentrates mostly at optimizing the VMs management, letting somehow to PaaS storages the task of developing specialized and more complex mechanism for client data manipulation.

4.2

PaaS Storage

PaaS storage systems are more numerous then IaaS ones. They are specialized for different purposes, ranging from storage backends for the platform, to services for clients and to storage for the applications that provide business value to customers. Each such type has some strong points based on the 10

requirements they must fulfill, as well as some weaknesses. No storage system, which encapsulates all needs for all applications in an satisfactory way, has been build so far. Most of the systems discussed below could be classified as distributed file systems, but as we will see, parallel file systems can also be used as storage mechanisms at PaaS level. 4.2.1

Description and problems

The storage systems at PaaS level tend to offer more features in order to satisfy customers needs. Availability and durability mostly mean to be able to retrieve data even in case of failures in the data center. Replication is very common [37] among all these systems, being in some cases a userdefined parameter [30]. Usability is refined, data being manipulated in an easy manner, mostly using HTTP or some remote methods calls. Security is enhanced, some systems allowing the creation of customized access policies and an increased system protection, by using specific architectures like the Gatekeeper design pattern [11]. In order to obtain a better view on these systems, a classification according to their main purpose will be done below: Backend for PaaS. As it has been previously detailed, new programming models, like MapReduce, have been successfully applied in cloud platforms. However, such programming environments require an efficient distributed file system(DFS). Some examples of such systems are GoogleFS [23], Hadoop Distributed File System (HDFS) [5] or BlobSeer [30]. The first two, GFS and HDFS have a central metadata server, a write-once-read-many data policy, server side replication (usually 3), caching mechanisms and no dedicated security mechanisms [23, 5, 37]. Motivated by their drawbacks and by the desire to enrich the capabilities of the storage system, BlobSeer, a newer system, addresses most of these problems. It does not prohibit multiple writes, it offers versioning capabilities and provides better throughput. By implementing a shim layer it can act as a storage backend for MapReduce, being able to provide the caching mechanisms, replication and data locality awareness needed by Hadoop [30]. A very interesting result is provided by Tantisiriroj et al. [36],which have shown that parallel file systems can also act as backends for MapReduce by implementing the shim layer with the previously mentioned properties. The efficiency of the storage system is important, being able to greatly enhance the overall performance of a scientific workflow for example. Hence the ongoing research on this topic is entitled, being able to provide a significant and useful contribution. Service for Client Applications. Storage systems like the ones discussed above could also be used to serve clients, but it makes sense to look at the commercial side, to analyze what is offered currently. The most important player at this level is Microsoft Azure [6], which offers 2 storage mechanisms: Blobs and Tables, which can be seen in Figure 5. Blobs are the main storage mechanisms, meant for unstructured data, with sizes up to 50GB(10 times more than the size at IaaS level offered by S3). They are structured on 2 namespaces, the first one being the container, with no imposed limit, which can have any number of blobs. Data is managed through HTTP requests (PUT, GET etc.), but provides more functionality than the simple set of commands available at IaaS level (e.g., atomic commit of multiple parts of a blob, fine-grain access) [17, 25]. The major drawback is the way they handle concurrency, since they only provide a basic mechanism to ensure consistency: the most recent commit wins. Tables, which have nothing to do with tables from databases. They store structured data, offer efficient fine grain access. They are composed of entities, constituted by multiple properties, with no defined schema. They allow users to define 2 keys, similar with indexes, which allows extremely efficient queries [25]. These storage mechanisms have a clear commercial purpose, providing a wide spectrum of features that can have business utility.

11

Figure 5: Azure Storage [17] Storage for the running applications. In addition to the 2 types of systems detailed above, there are some extremely specialized systems created to store data generated by applications, which can be aware or not about them. Azure Queues [17], that can be seen on Figure 5, are known by the applications. The developer uses them in order to send messages (job descriptions) between web roles and workers. A particular aspect refers to job execution which is mandatory, despite a possible crash of the worker responsible for it. The dequeued jobs are not deleted, but only hidden for some time, and in case the job has not been finished, another worker will perform it. Dynamo [19] is a highly available internal storage from Amazon. It is not known by the applications, being used as a web service based on SLA. The key design principle for it is to be always available for writes, being a critical aspect for the other web services that use it. It uses 0-hop DHT and optimistic replication. Nectar [24], provided by Microsoft is another example of a storage system that the applications are not aware of. Its task is to optimize the storage space used, by caching most frequently used data and replacing old unused data by the application code that generated it. The system is able to optimize the used space and still to provide good throughput. As we can see, these types of systems are extremely specialized for specific issues, making significant tradeoffs to achieve them. 4.2.2

Conclusions

PaaS storage systems address several needs from the ones of simple client users, to storage for programing models like MapReduce and up to specific internal applications. Among the existing limitations (for some of the above described systems) we can enumerate concurrency, size limits, lack of some basic operations or centralized components. The features required by each possible usage are numerous, so until now, the trend was to provide specialized solutions to each of them. This is reflected by the large number of systems present at the moment. An interesting idea is how such a system could be enriched to cover efficiently multiple types of needs, or perhaps all of them!

12

4.3

Conclusions

We have seen the most notorious existing solutions for storage in clouds. The ones present at IaaS level differ greatly from the ones at PaaS since the problems they address are different, and hence, the current improvements that can be done also vary. IaaS needs work on its basic operations, possibly implementing new ones or eliminating some existing limits, and improving concurrency control. PaaS also has some open issues regarding concurrency, but solutions from existing DFS, like BlobSeer, could offer the desired solutions. On the other hand, the large number of requirements from all stakeholders interested in platform exploitation of the cloud, still represent a big challenge, and there is still work to be done to satisfy everyone.

5

Discussion

Now that we have seen the cloud ecosystem, a discussion about the open questions and the possible answers makes sense. Based on the requirements of some of the data-intensive applications, we can say that some of the needed enhancements would be related to replication and concurrency. Replication increases the percentage of data availability, but cannot guaranty 100%. Most systems use a value of 3 for replication, and 5 for more sensitive data. Monitoring mechanisms are implemented to maintain this number, but improvements can still be made on this chapter and on how the system consistency is ensured. Another open question is related to access patterns: write-once-read-many vs. multiple writes/reads. We have seen that most of the used systems today have adopted the first policy (write once), but the second one should not be totally neglected, having the potential to offer benefits to a lot of possible applications. Solutions for questions like this one can come from new DFSs, like BlobSeer. Such a system, ported in commercial clouds, could bring a lot of new features (appends, versioning, multiple reads/writes etc.) and solve some of the issues: limits in object sizes, lost commits due to concurrent writes. . . . Cloud computing, being a relative new technology, still requires improvements and additional features. Today, due to its wide popularity, more and more business and scientific problems see the answer in it, hence they will want to be deployed inside the cloud and the cloud must have answers prepared for everything.

6

Conclusions

This work has analyzed the cloud landscape’s current status. Among all possible directions, the focus was oriented towards cloud storage, mostly from the point of view of data-intensive applications. Such applications have specific requirements, that must be taken into account by the cloud providers. The current status of the storage systems still has to overcome some milestones both at IaaS and PaaS levels, like concurrency, versioning, size limitations of files, fine-grained access or basic operations like multiple writes or appends. In addition, it has been observed that the current trend was to develop specialized solutions for the various usages of the storage systems. An open question is how we could improve one such solution, or how we could design a new one, to be able to answer most of the requirements of the data-intensive applications or perhaps all of them?

13

References [1] Amazon Web Service:. http://aws.amazon.com/. [2] Eucalyptus:. http://www.eucalyptus.com/. [3] Google Apps Engine:. http://code.google.com/appengine/. [4] Google Docs:. https://docs.google.com/. [5] Hadoop:. http://hadoop.apache.org/. [6] Microsoft Azure:. http://www.microsoft.com/windowsazure/. [7] Microsoft Live Services:. http://www.microsoft.com/presspass/misc/11-01LiveSoftwareFS.mspx. [8] Nimbus: . http://www.nimbusproject.org/. [9] Open Geospacial Consortium:. www.opengeospacial.org. [10] OpenNebula:. http://www.opennebula.org/. [11] Microsoft: Security Best Practices For Developing Windows http://www.microsoft.com/windowsazure/whitepapers/default.aspx, 2008.

Azure

Applications:

[12] Kate Keahey , John Bresnahan, Tim Freeman, and David LaBissoniere. Cumulus: Open Source Storage Cloud for Science. SC10 Poster. New Orleans, LA, November 2010. [13] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A view of cloud computing. Communication of the ACM, April 2010. [14] Miguel L. Bote-Lorenzo, Yannis A. Dimitriadis, and Eduardo Gomez-Sanchez. Grid Characteristics and Uses: A Grid Definition. [15] Rajkumar Buyya, Chee Shin Yeo, and Srikumar Venugopal. Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities. HPCC, 2008. [16] David Chappell. Windows Azure and ISVs. Technical report, Microsoft. http://www.microsoft.com/ windowsazure/whitepapers/. [17] David Chappell. Introducing the Windows Azure Platform. Technical report, Microsoft. http://www. microsoft.com/windowsazure/whitepapers/. [18] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communication of the ACM, January 2008. [19] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available key-value store. Symposium on Operating Systems Principles, 2007. [20] Trevor Doerksen. Cloud Computing - The User-Friendly Version of Grid Computing. SYS-CON, 2008. [21] John Foley. Private Clouds Take Shape. InformationWeek, August 9, 2008. [22] Ian Foster. What is the Grid? A Three Point Checklist. Technical report, Argonne National Laboratory and University of Chicago, July 2002. [23] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 19th ACM Symposium on Operating Systems Principles, October 2003. [24] Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. October 2010. http:// research.microsoft.com/apps/pubs/default.aspx?id=138691. [25] Zach Hill, Jie Li, Ming Mao, Arkaitz Ruiz-Alvarez, and Marty Humphrey. Early Observations on the Performance of Windows Azure. High Performance Distributed Computing, 2010.

14

[26] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed DataParallel Programs from Sequential Building Blocks. EuroSys, 2007. [27] Gideon Juve, Ewa Deelman, Karan Vahi, Gaurang Mehta, Benjamin P. Berman, Bruce Berriman, and Phil Maechling. Scientific Workflow Applications on Amazon EC2. IEEE International Conference on e-Science, 2009. [28] Katarzyna Keahey, Maurcio Tsugawa, Andra Matsunaga, and Jos A.B. Fortes. Sky Computing. Cloud Computing, 2009. [29] Craig A. Lee. A Perspective on Scientific Cloud Computing. High Performance Distributed Computing, 2010. [30] B. Nicolae, G. Antoniu, L. Boug, D. Moise, and A. Carpen-Amarie. BlobSeer: Next Generation Data Management for Large Scale Infrastructures. Journal of Parallel and Distributed Computing, 2010. [31] Bogdan Nicolae. PhD Thesis: BlobSeer:Towards efficient data storage management for large-scale, distributed system. PhD thesis, Rennes 1, 2010. [32] Bogdan Nicolae, Diana Moise, Gabriel Antoniu, Luc Boug´e, and Matthieu Dorier. BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map-Reduce Applications. IPDPS, 2010. [33] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-source Cloud-computing System. IEEE/ACM International Symposium on Cluster Computing and the Grid. [34] Mayur Palankar, Adriana Iamnitchi, Matei Ripeanu, and Simson Garfinkel. Amazon S3 for Science Grids: a Viable Solution? Data-Aware Distributed Computing, 2008. [35] Borja Sotomayor, Ruben S. Montero, Ignacio M. Llorente, and Ian Foster. Virtual Infrastructure Management in Private and Hybrid Clouds. InternetComputing, September 2009. [36] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson. Data-intensive file systems for Internet services:A rose by any other name... CMU-PDL-08-114, Octomber 2008. [37] Tran Doan Thanh, Subaji Mohan, Eunmi Choi, SangBum Kim, and Pilsung Kim. A Taxonomy and Survey on Distributed File Systems. Fourth International Conference on Networked Computing and Advanced Information Management, September 2008. [38] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A Break in the Clouds: Towards a Cloud Definition. Technical report, Telefonica Investigacion y Desarrollo and SAP Research Madrid, Spain and Belfast, UK, 2008.

15