Enhancing Cloud Computing Reliability Using Efficient ... - IEEE Xplore

3 downloads 23168 Views 301KB Size Report
May 30, 2013 - Organizations are shifting to the cloud rapidly in order to increase their overall benefits. Cloud computing has virtualization as its backbone.
2014 International Conference on Parallel, Distributed and Grid Computing

Enhancing Cloud Computing Reliability Using Efficient Scheduling by Providing Reliability as a Service Abishi Chowdhury

Priyanka Tripathi

Dept. of Computer Engineering and Application National Institute of Technical Teachers’ Training and Research Bhopal, India [email protected]

Dept. of Computer Engineering and Application National Institute of Technical Teachers’ Training and Research Bhopal, India [email protected] cost, increase productivity, collaboration purpose, better access to different analytics and stimulate the development cycle and the IBM has listed some important points for this [2]. According to the survey of RightScale (which is a company that provides software as a service in order to manage the equipment that contribute cloud services) 94% of organizations are executing their applications with Infrastructure as a Service (IaaS) where 87% of organizations are adopting public cloud. As reported by respondents more benefits in different categories, such as cost savings, geographic reach, higher availability, and business continuity in cloud computing has been earned in 2014 over the past few years [3]. By 2015, investing on cloud services by end-user could be over the $180 billion. It is reported that 82% of corporate organizations saved money by moving their business to the cloud. It is also found that 80% of organizations saw remarkable improvements within just 6 months of moving to the cloud [4]. The cloud service providers efficiently serve the user requests in the form of several virtual machines. Using the concept of server virtualization the cloud service providers generate multiple virtual machines to serve multiple requests at a time. Virtual machines are allocated to the user requests and also deallocated after the completion of the task. It has been noticed that among several enterprises that have implemented server virtualization, the main focus and concern was the reliability, even ahead of performance, security and other issues. Fault is the basic nature of any technology, so as the cloud computing consists of several technologies; it is very common that it faces several failures very often. It is reported that about 165,000 websites, hosted by NaviSite went offline for 1 entire week in 2007 where Twitter had gone down frequently in 2008.The Gmail users faced so much problem in the year 2009 as it was not available for 4 hours. The Central reservation system, Amadeus had encountered 2 major failures within 3 months in the year 2010. And this inconvenience enforced some of the airlines to check in within 1 hour. Another major distressing experience occurred in the year 2011 when Yahoo! Mail had gone down for 6 hours. Facebook blackout disrupted its users worldwide for 3 hours in 2012 [5]. It was Jan. 10, 2013, when

Abstract—Cloud computing is one of the prime need of today’s IT world. Organizations are shifting to the cloud rapidly in order to increase their overall benefits. Cloud computing has virtualization as its backbone. Cloud resources are provided on demand using the Internet and an on-running migration of resources is done by the cloud service provider. Failure is the nature of hardware and same for the software. Therefore, cloud resources can also fail at any time and leave impact on the performance of the other resources. That is why; it is required for cloud computing to incorporate a monitoring system. In a dynamic computing environment like cloud computing, it is hard to maintain and analyse the reliability of the resources. As cloud is a combination of different resources acting together to provide services to the end users, Virtual machines are the key to provide the Infrastructure as a Service (IaaS) to these end users. In this paper a novel attempt is made to propose a reliability computing technique to calculate the reliability of a Cloud data centre. Further, in this paper we have proposed a mechanism for continuous updating of cloud resources’ reliability and providing a reliable scheduling of the resources to the cloud users in a cloud computing environment. Index Terms—Cloud Computing, Datacenter, Machine, Reliability, Failure_Counter, Cloudlet;

Virtual

I. INTRODUCTION Cloud computing is the most prevailing technology that incorporates the virtualization technology in order to utilize computing resources efficiently while serving the three basic services, i.e. Software as a service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) to its users with the help of Internet [1]. While serving all these services to the users, cloud always preserves its abstract nature. Users of cloud computing request for different computing resources and the cloud service providers pick up the required resources from cloud resource pool and provide these as per the requirement of individual user. In order to accomplish as many requests as possible, it is very much necessary to utilize cloud resources effectively and wisely. Now a days, the usage of cloud computing is rising rapidly. It is very convenient to choose cloud computing in order to reduce overall business

978-1-4799-7683-6/14/$31.00©2014 IEEE

99

2014 International Conference on Parallel, Distributed and Grid Computing

Dropbox (one of the most significant cloud selling point of service), on which users can trust as it is their own local hard drive, was unavailable for almost 16 hours. And again on 30th of May, 2013, the Drop box was unavailable for 90 hours and during this period of time its users could not access their files or upload any new information [6]. On 18th June, 2012, the International Working Group (IWG) on Cloud Computing Resiliency (IWGCR) said that as a result of total 568 hours of discontinuation at 13 well-known cloud services, failure cost of cloud reached more than US$71.7 million dollars since 2007. In an average the cloud services was unavailable for about 7.5 hours per year which in turn resulting in 99.9% availability of cloud services. But it was very far from the desired reliability (99.999%) of mission critical system. Therefore, it becomes very necessary for cloud service providers that they must provide reliable services, especially when the systems are mission critical [7]. In modern day’s public clouds, the reliability is served as a rigid parameter; for example, Amazon announced that its Elastic Compute Cloud (EC2) users now can expect 99.95% uptime in terms of reliability and it corresponds to a failure occurs in once a week [8]. Therefore, it is up to the users who can make the tasks which are running within several virtual machines, harder and effective to earn better reliability. So, while providing the desired reliability to the users, it is also necessary for cloud service providers to utilize the computing resources effectively. Here, in our paper we have proposed a concept of reliable scheduling. For this, we have used the statistics, done by the Microsoft Research [8] where they have characterized the datacenter reliability which also provides information about other cloud resources.

several faults in cloud environment. To establish a consistent support, it is very important for service layers to incorporate ideal reliability mechanisms. Fault Tolerance manager (FTM) is a new scheme that is designed to produce solutions in case of repetitive system failure [13]. It is necessary to have proper collaborations between cloud service providers and the customers, as in many cases it is found that due to the lack in collaboration between these two; many times partial or faulty solutions are generated in cloud environment. The main three types of failures, i.e. application failure, virtual machine failure and hardware failure, are handled either by Exclusive fault tolerance solution or Collaborative fault tolerance solution. These solutions incorporate several methods, i.e. detection and recovery, replacement, or monitoring using sensors [14]. Adaptive fault tolerance technique in real time cloud computing environment [AFTRC] handles faults in real time cloud computing environment on the basis of reliability of several virtual machines and for performing this efficiently these technique uses acceptance test, time checker, reliability assessor and decision mechanism [15]. There is a parameter, success rate which is used for fault tolerance mechanism. A performance table is maintained that stores the performance of each virtual machine and if any virtual machine is found as faulty then this table is updated accordingly [16]. . III. PROPOSED WORK For reliable scheduling of the cloud resources, we have proposed a technique which includes six steps for its functioning. According to the statistics that are provided by the Microsoft Research, we have assigned the reliability to our proposed datacenter as well as virtual machines. In step one the reliability to the datacenter is assigned. There is a need of the failure counter, which counts the number of times the datacenter has been failed. The default value of the datacenter failure counter is zero. The default reliability value of each datacenter is 0.92 [8]. If there is any failure in the datacenter, the counter’s value is increased by one.

II. RELATED WORK Now a days the most important concerning aspect of cloud computing is the reliability. There are several failures that may be occurred in cloud computing. These are Overflow, Timeout, Computing Resource Missing, Data Resource Missing, Database failure, Network failure, Hardware failure, Software failure, etc. After analyzing these failures, reliability of cloud can be divided into two parts: Request Stage reliability and Execution Stage Reliability and by multiplying the reliability value of these two, the final reliability of the cloud service is obtained [9]. For efficiently performing scientific computations in cloud environment, it is important to pay attention in choosing the optimal reliable cloud [10]. There can be two types of faults in cloud environment: Crush faults and Byzantine faults and to resolve these faults there some popular methods; these are Checking and Monitoring, Checkpoint and Restart, and Replication [11]. The Low Latency Fault Tolerance (LLFT) Middleware provides fault tolerance reliable services to different datacenters in cloud environment. It mainly serves protection against two types of faults: Crush fault and Timing fault [12]. Fault tolerance property can be given as on demand basis. By inserting a new dedicated services layer in between computing framework and application, it is possible to provide reliable support against

Step 1: Algorithm to assign reliability to the Datacenters: Cloud_Reliability Assignment (Datacenter) 1. set initial failure_counter = 0 2. set initial reliability of each datacenter = 0.92 3. repeat for each datacenter i (i = 1 to n) 4. set datacenteri.reliability = reliability 5. set datacenteri.failure_counter = failure_counter 6. exit Next step is to set the reliability of each virtual machine. As we have already discussed, there can be a number of virtual machines in a data center and the virtual machine is comprised of the several resources, therefore the virtual machine reliability depends on the reliability of each of the resources. First, we need to set a default reliability of the virtual machine resources. For this we have used the Microsoft’s statistics

100

2014 International Conference on Parallel, Distributed and Grid Computing

7.

reliability of vmi.reliability = (disk_vmi.reliability * ram_vmi.reliability ) 8. if Datacenter failure occurs then 9. datacenteri.reliability = datacenteri.reliability / 2 10. datacenteri.failure_counter = datacenteri.failure_counter + 1 11. exit

report. In step 2, we set the default reliability of the cloud resources. In this algorithm, we pass the virtual machine as a parameter. The virtual machine must have a unique ID for its identification. The algorithm is repeated for each virtual machine. This algorithm considers that there are m number of virtual machines which consist of resources are like disks, RAM etc. A counter is used for each cloud resource for monitoring the number of failures that took place for a particular resource. As the number of failures increase the value of this counter increases. Each time its value is incremented by one. The final reliability of the virtual machine is calculated by multiplying the reliability values of all the components as the working in the virtual machine is like a series system where all the components are interrelated with each other and if any of these components fails the entire system fails [17].

The replacement of a component is required when it fails regularly. The fourth step does the work of removing a component when the number of failures reaches to a certain limit. When the value of the failure counter reaches to a threshold value then that component should be replaced and the corresponding virtual machine will be destroyed. This value can be set by the cloud service provider to ensure the quality of service. Thus this algorithm will also help in continuous auditing of the cloud component for their replacement. It can help in generating a report for the replacement of the cloud components.

Step 2: Algorithm to assign reliability to virtual machines: Cloud_Reliability Assignment (Virtual Machine) 1. for each vmi ( i = 1 to m) 2. set disk_vmi.reliability = 0.93 3. set ram_vmi.reliability = 0.99 4. set counter_diski = 0 5. set counter_rami = 0 6. reliability of vmi.reliability = (disk_vmi.reliability * ram_vmi.reliability) 7. exit

Step 4: Removal: 1. if (counter_diski >= threshold) 2. replace the diski 3. Destroy the vmi 4. elseif (counter_rami >= threshold) 5. replace the rami 6. Destroy the vmi 7. else if (counter_diski >= threshold && counter_rami >= threshold) 8. replace both the diski and rami 9. Destroy the vmi 10. exit

If there is failure in the disk then there is a need to update the reliability of that disk and the corresponding virtual machine which is using that disk. This can be done using the counter value of the disk. When any failure occurs on that disk then the counter of failure for that particular disk is increased by one. The same process is repeated for the RAM. It is the behavior of the resources that if a resource fails then the probability that it can fail again is higher than its failure probability before failure. In the same manner the reliability of a failed component decreases after the occurrence of its first failure. By considering this fact, in algorithm we have proposed that each time the reliability of a component will be the half of its current reliability if it fails. In this way a discrete decrease in the reliability of the component can be simulated. In this algorithm, the evaluation of the reliability of the virtual machine is done in time to time manner. This process is shown in Step 3 below. And in the same way datacenter reliability and failure calculation can also be done.

1. 2. 3. 4. 5. 6.

The requests from the users are in the form of the cloudlets, if we consider the requests in the cloudsim. Several virtual machines are created for fulfilling the requests of the users. The virtual machine has the resources from the datacenter. The resources, demanded by the user, are not required to be of the same size as that of the virtual machine resources. It is required to pay some extra attention while allocating the virtual machines as per the users’ requests. As the strategies of best fit, worst fit or the first fit are already available, these can be used for allocation of virtual machines to the user request cloudlet. For an effective allocation of the resources, in Step 5 we have considered these points for allocation of the resources. This Step will result in a two-dimensional array where one dimension will represent the number of cloudlets and the other will represent the number of virtual machines. The output of this step is a matrix, which has 1 and 0 values based on whether a particular cloudlet can be allocated to the corresponding virtual machines or not respectively. The decision to add 1 or 0 in the matrix will be done on the basis of a threshold value. If the difference of the demanded resource and the available resource is less than or equal to some threshold value only then a resource can be allocated to that cloudlet otherwise not.

Step 3: Evaluation: if disk failure occurs then disk_vmi.reliability = disk_vmi.reliability / 2 counter_diski = counter_diski + 1 if ram failure occurs then ram_vmi.reliability = ram_vmi.reliability / 2 counter_rami = counter_rami + 1

101

2014 International Conference on Parallel, Distributed and Grid Computing

steps from 12 to 15 will select the virtual machine which is having the maximum reliability. The virtual machine which is selected as the highest reliable machine is the kth virtual machine. Now, if this virtual machine is free then the cloudletr is assigned to this virtual machine, otherwise the next highest reliable virtual machine is selected. These steps are repeated until all the cloudlets are assigned to the virtual machines.

Step 5: A[] is an array of virtual machines for j = 1 to m for i = 1 to n d1 = vmj.disk – cloudleti.disk d2 = vmj.ram – cloudleti.ram if (d1

Suggest Documents