Reliability aware Scheduling in Cloud Computing - IEEE Xplore

4 downloads 5950 Views 877KB Size Report
Abstract—Cloud computing infrastructure encompasses many design challenges. Dealing with unreliability is one of the impor- tant design challenges in cloud ...
The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

Reliability aware Scheduling in Cloud Computing Sheheryar Malik, Research Team OASIS INRIA - Sophia Antipolis Sophia Antipolis, France [email protected]

Fabrice Huet Research Team OASIS INRIA - Sophia Antipolis Sophia Antipolis, France [email protected]

Abstract—Cloud computing infrastructure encompasses many design challenges. Dealing with unreliability is one of the important design challenges in cloud computing platforms as we have a variety of services available for a variety of clients. In this paper, we present a model for the reliability assessment of the cloud infrastructures (computing nodes mostly virtual machines). This reliability assessment mechanism helps to do the scheduling on cloud infrastructure and perform fault tolerance on the basis of the reliability values acquired during reliability assessment. In our model, every compute instance (virtual machine in PaaS or physical processing node in IaaS) have reliability values associated with them. The system assesses the reliability for different types of applications. We have different mechanism to assess the reliability of general applications and real time applications. For real time applications, we have time based reliability assessment algorithms. All the algorithms are more convergent towards failures. If a compute instance passes, the reliability of the compute instance may increase or remain constant depending upon the algorithm. In the case of a failure, reliability decreases at a faster rate.

Keywords- reliability; reputation; cloud computing; scheduling I. I NTRODUCTION Cloud computing offers variety of services through different means. It can be application/software as a service (SaaS), platform as a service (PaaS), or infrastructure as a service (IaaS) [1]. Though cloud services are considered to be more reliable than conventional systems but there exist many reliability issues, which make them vulnerable to failures. So assurance about high level of reliability becomes one of the important design issues in planning of the cloud computing infrastructure. Unreliability can arise at various levels of cloud services. It can be at infrastructure level, platform level or application level. In this paper, we are focused to deal with the reliability in the Platform and infrastructure. There can be many ways to deal with the unreliability, which also depends on the type of application. In this paper, we present a reliability assessment model for the different types of applications executing on cloud computing infrastructure. This reliability model assists the cloud scheduler in the scheduling of tasks on the cloud infrastructure and help in performing fault tolerance. The overall goal is to provide the user with a set of nodes with high level of reliability. In this model, we are focused on the reliability assessment and scheduling on the platform as a service and infrastructure as a service cloud service models. In the case of a platform as a service cloud, we assess the reliabilities for the platform

978-1-908320-08/7/$25.00©2012 IEEE

Denis Caromel Department of Computer Science University of Nice Sophia Antipolis Sophia Antipolis, France [email protected]

resources (i.e. virtual machines in most of the cases). In this case, reliability is affected by the failures caused by both the cloud platform and infrastructure. If there is a problem in the user application logic, then we do not consider it as a failure of the cloud platform/infrastructure. Whereas, in the case of infrastructure, we assess the reliabilities of the physical infrastructure offered as a service by the cloud provider. Any failure caused by the application, or platform is not considered to be a fault by the cloud provider. Thus only a fault at the infrastructure level is considered to be a fault by the cloud provider. Please note that throughout this paper we will use the term compute instance, which can refer either to the infrastructure resource or platform resource depending on the cloud computing service model. In the case of a platform as a service model, compute instance will refer to a processing node’s platform (virtual machines in most of the cases). Whereas, in the case of infrastructure as a service model, a compute instance will refer to the physical processing node. Also note that this paper is mainly focused at the reliability assessment of compute instance. Other issues like fault monitoring (distinguishing fault source), fault tolerance and scheduling methodology are not within the scope of this paper. In the model, we present different algorithms for different type of tasks. A compute instance has three different reliability values; for general, soft real time and hard real time application. Depending on the type of application, the reliability assessment mechanism triggers the algorithm associated with the type of application and perform the reliability assessment. For the real time applications we have introduced the time based reliability assessment. As in the case of a real time application, correctness not only depends on the logical result but also the time of result production [2]. In general, realtime system is any information processing system which has to respond to the externally generated input stimuli, within a finite and specified period of time [3]. Even in some of the cases, failure to response is as bad as a wrong result or even more [4]. A mistimed response can be more fatal in case of a safety critical system. There are more chances of errors in the real time cloud computing, because we do not know where and how the application is going to be executed [5]. So we present in this paper, a reliability assessment mechanism for scheduling and fault tolerance. Use of cloud infrastructure can augment the error probability, so it is important to provide fault tolerance for the safety critical real time systems [6].

194

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

The rest of the paper is structured as follows. Section 2 gives existing related work done in the area. Then in section 3, we present our proposed model. In Section 4, we describe the results of experimental evaluation of our proposed algorithm. We conclude in Section 5 and discuss the future research directions. II. R ELATED W ORK Most of the related work is done in the domain of reputation assessment and reputation based scheduling on the systems. Most of these models/algorithms rely on the binary trust values, whereas in our approach we represent reliability as a continuous number in fractional value. Real time applications can be highly effected by the timeliness, so general algorithms should not be applied on them. There should be different algorithms depending on the type of applications. We need to have some reliability assessment model, which assess the reliability on the basis of type of application. Whereas, in most of the existing work, they propose algorithm without a choice for algorithms for different types of applications. In most of the reliability assessment algorithms for real time systems, they check the timeliness for the correctness of result. But timeliness has no direct contribution in the assessment of reliability. Most of the work is done for the peer-to-peer systems or volunteer computing systems. These works mainly focus on reputation assessment, which is more applicable in P2P. Reputation is assessed using feedback from the peers. Reputation is normally affected by the malicious behavior, virus attack or computational incapacity of the nodes. We cannot employ most of these works in cloud computing as in cloud we normally do not have feedback for processing nodes from their peer processing nodes. As in most of the reliability assessment models for P2P network, reliability is generally assessed on the peer feedback using different mechanisms. Cloud computing is quite different from P2P model and these techniques are not appropriate to apply on cloud computing model. In cloud computing, the vendor offers high quality services with higher throughput, reliability, and free from viruses & malicious behavior. After analysis the existing work in the area of reliability assessment, we have come up with a problem statement. None of the existing model fulfills all the requirements of our problem statement. So we need a model to fulfill all of these requirements. The requirements are; (1) Reliability values should not be binary. They should be a continuous number. (2) Reliability of a node (compute instance) should be evaluated by a cloud management module (scheduler or resource manager). It must be evaluated directly on the application performance by the cloud management module instead of feedback from the peer nodes. (3) As real time applications can be highly effected by the timeliness, so there should be different algorithms depending on the type of applications. Separate algorithms for real time and non-real time application. (4) There should be a role of timeliness in the reliability assessment of the algorithms (5) can perform

978-1-908320-08/7/$25.00©2012 IEEE

scheduling and fault tolerance both in the same model on the basis of reliability values. Here is the existing work for the reliability/reputation assessment. Sonnek et. al. presented an adaptive reputation-based scheduling model [7] for large scale donation-based distributed infrastructure. The scheme is primarily proposed for the volunteer computing systems like BOINC. It performs the redundant computation for the verification of the results and then voting on these computations. In this method, each task is redundantly assigned to the computational nodes. After the result is received, a voting is performed to find the most agreeable result. Achim et. al. presented a model for the reputation based service selection in cloud environment [8]. They have addressed the issue of right service selection in a distributed / cloud environment for any application that needs access to the best service’s endpoint in term of performance. The service selection mechanism is based on a reputation function, which considers the execution cost as a performance criterion. Opera (OPEn ReputAtion) is a reputation based resource selection mechanism to improve the resource efficiency in the data centers, proposed by Nguyen and Shi [9]. Its objective is to improve the resource efficiency by reducing the number of failures. Opera uses a vector to represent the reputation of a resource to capture its heterogeneity in different points of view. Hou et. al. introduced a reputation based grid resource selection model [10]. This model evaluates the reputation of an application at run time by accumulating the raw score. It is dynamically adaptive to the run-time load, availability and performance of the grid resources. It is experimented on Open Science Grid with the applications. Wang et. al. presented a reliability driven reputation based scheduler for the public resource computing [11]. They have used the genetic algorithms for the computation of reputation values. They proposed knowledge based genetic algorithm. This algorithm optimizes the reliability and time for the workflow application. They have used the application run time in calculating the reputation. Rahman et. al. proposed a reputation based scheduling mechanism for work-flow applications in peer-to-peer grids [12]. It employs structured P2P indexing and networking techniques to create a grid overlay. The reliability is computed in this grid overlay on the basis of dynamic feedback or reliability score assigned by individual service clients. Damiani et. al. also proposed a reliable resource selection mechanism in P2P networks based on reputation [13]. In their approach, reliability is assessed through distributed polling. EigenTrust algorithm is proposed by Kamvar et. el. which do the reputation management in peer-to-peer file sharing networks [14]. Algorithm decreases the number of downloads of unauthenticated files in P2P file sharing network. It assigns each peer a unique reputation value based on the history. The system calculates a reputation score for a peer using Eigenvector.

195

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

PowerTrust is a reputation system for P2P systems [15]. This system uses a distributed ranking system to select the most reputable nodes. It employs the look-ahead random walk strategy to improve reputation accuracy. III. P ROPOSED M ODEL - R ELIABILITY A SSESSMENT FOR G ENERAL & R EAL T IME C LOUD C OMPUTING In this paper, we present a model for the reliability assessment of the compute instance and to perform scheduling on the basis of that reliability values. In our model, we have three reliability values for each compute instance (one each for general, soft real time and hard real time applications). At one instant, it calculates the reliability value for one type depending on the type of application executing on the compute instance. It helps the scheduler to do the scheduling on the compute instance(s) with the highest reliability within the specified application type. Scheduling is the duty of the cloud scheduler and does not fall within the scope of our research. However, our RACS scheduling module assists the cloud scheduler (ProActive scheduler) to do the scheduling on the basis of reliability. This model is part of our resource aware cloud computing framework. In the framework, we have proposed a cloud scheduler module (Resource Aware Cloud Scheduling Module)[16], which helps the cloud scheduler in making the scheduling decisions on the basis of different resource characteristics/criteria. Reliability is one of those criteria. Resulting implementation for reliability based scheduling and fault tolerance is a sub-module of Resource Awareness Cloud Scheduling (RACS) module. We have plugged in our RACS module with ProActive scheduler [17] A. System Model Our propose model revolves around reliability assessment and reliability based scheduling and fault tolerance on cloud infrastructure. There are different modules in the system model as shown in the figure 1. Some of them are integral part of our propose model and some are external entities. The description of each module is as follows. Input Buffer (IB) provides the input to the application logic. It has the data values from the data store, sensors and user configurations. Application Logic (AL) is a program executing on cloud infrastructure. It can be a general application or a real time application. The result of application logic is forwarded to the failure monitor for verification purpose. Fault Monitor (FM) module is responsible to find the failures and faults occurred during the course of execution. It tries to locate the source of failure, i.e. whether the failure is caused by the application logic or by the underline platform. It also triggers the acceptance test to verify the correctness of the result produced by the application logic running on a node (virtual machine). In case of a general application, it directly informs the reliability assessment module about the correctness of the result. Whereas, in the case of a real time application, it first passes the result to the time checker

978-1-908320-08/7/$25.00©2012 IEEE

Fig. 1.

Proposed System Model

(TC) module, which after verifying the timeliness of execution, forwards it to the reliability assessor module along with the timing information. Time checker (TC) module is a time checking sensor for the real time applications. It has a watch dog timer (WDT), which monitors the timing of the result produced while executing a real time application. TC module passes the results to RA (reliability assessor) module. It only passes the correct result, if it is produced within time. TC module raises the signal of time-overrun, if the result is produced after the maximum deadline time. Reliability assessor (RA) module assesses the reliability for each compute instance. It is the core module of the proposed model. It has different algorithms for the reliability assessment of the different types of applications. The reliability of the compute instance is adaptive, which changes after every computing cycle. In the beginning the reliability of each compute instance is 1.0. If the result after a computing cycle is correct (and also within the time limit for real time applications), the reliability of compute instance increases or remains same. And if the result is incorrect (or outside the time limit for real time applications), its reliability decreases. The reliability assessment algorithm is more convergent towards failure conditions. It means that decrease in reliability is more than increase. Scheduler is responsible to schedule the user applications/tasks on the compute instances. It performs the scheduling decision on the basis of the reliability of the compute instances. Typically, it selects the compute instance with the highest reliability within the required category (i.e. general, soft real time, or hard real time). Decision mechanism (DM) for fault tolerance helps the system to tolerate the faults and makes the decision on the basis of reliability of the compute instances. It selects a final output result for a computing cycle from the multiple results acquired from the invariant compute instances. To provide the forward recovery, we need to run multiple invariants of

196

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

a compute instance in parallel. It selects the output of the compute instance, which has highest reliability among the competing compute instance. Fault tolerance is not within the scope of this paper. To perform the fault tolerance, check pointing is also done with the help of communication induced checkpoint (CIC) [18], [17]. The CIC perform the check pointing at the end of each cycle to maintain a global state. B. Reliability Assessment Mechanism In our model, a compute instance has reliability values associated with him. The system assesses the reliability for different types of applications. We have divided applications in two main categories i.e. general applications and real time applications. Real time applications are further divided into their two categories i.e. soft real time applications and hard real time applications. Thus each compute instance has three reliability values, one for each application type. We also have different reliability assessment algorithms for the three different application categories. For general applications, we have basic reliability assessment mechanism. For soft real time and hard real time applications we have a time based reliability assessment mechanism. In time based reliability assessment, not only the logical result of the applications but also the time of delivery actually makes an impact on the reliability ratings of a compute instance. The assessment mechanism for both soft and hard real time differ from each other as the time requirements in hard real time are more stricter than in soft real time applications. 1) Reliability Assessment for General Computing: We propose an algorithm for reliability assessment for general cloud computing, which is given in algorithm 1. This algorithm is applicable to any sort of cloud computing application including real time. However, for real time application we recommend to use the other two algorithms specifically designed to perform the reliability assessment for real time applications. This algorithm is adaptive and converge more towards failure. Decrease in reliability is more than increase in reliability in most of the cases. We apply this reliability assessment algorithm for each compute instance one by one. Initially reliability of a compute instance is set to 1.0. To control the adaptability of reliability assessment, there is an adaptability factor N, which is a natural number (always greater than 0). The input to the algorithm is of the factors RF, minReliability, maxReliability and recentPassRequired from the configuration file. RF is a reliability factor, which increases or decreases the reliability of the compute instance. It decreases the reliability of the node more quickly as compare to the increase in reliability in conjunction with adaptability factor N. minReliability is the minimum reliability level. If a compute instance reaches to this level, it is stopped to perform further operations. maxReliability is the maximum reliability level. Compute instance’s reliability cannot be more than this level. It is really important in a situation, where a initially produces correct results in consecutive cycles, but then fails again and again. So its reliability should not be high enough to make the reliability difficult to decrease

978-1-908320-08/7/$25.00©2012 IEEE

and converge towards lower reliability. recentPassCount is a counter which keeps record of last number of consecutive successes. It increments by 1 after every pass, and reset to 0 after any failure. After some consecutive passes, if it meets the criteria for recentPassRequired then it reset the value of adaptability factor N to 1 to slow down the rate of reliability decrement in case of future failure. The values of all the above variables depend on the configurations set by the cloud administrator. The algorithm is designed in a way that it should be more convergent to failures in near past. So in case, if two nodes have 10 passes and 10 failures each in a total 20 cycles, the compute instance, who have more failures in near past has more chances to have lesser reliability than the other. This factor is quite similar to the latency issues, where initially compute instance’s latency was good, but then it becomes high. So this compute instance is expected to be more prone to timeliness failures. Algorithm 1: Reliability Assessment Algorithm for General Applications - Failure Convergent Data: RF, maxReliability, minReliability, nodestatus, recentPassRequired Result: General Reliability value for a node initialization; reliability = 1, N = 1, recentPassCount = 0, if nodeStatus = Pass then reliability ← reliability + (reliability * RF) if if N > 1 then N←N−1 end recentPassCount ← recentPassCount + 1 if recentPassCount ≥ recentPassRequired then N←1 end else if nodeStatus = Fail then reliability ← reliability − (reliability * RF * N) N←N+1 recentPassCount ← 0 end if reliability ≥ maxReliability then reliability ← maxReliability end 2) Reliability Assessment for Soft Real Time Computing: Soft real time computing has timeliness requirements, which are not very strict. We propose an algorithm for reliability assessment of soft real time applications, which is given in algorithm 2. This algorithm is more convergent towards failures. It decreases the reliability of a compute instance in every cycle, if there is a failure. In case of a pass, it does not increase the reliability in every cycle. However, it keeps an eye that how many times an algorithm has passed within a specific number of cycles. If its pass ratio is good then it resets its reliability to 1. During the course of execution of soft real time application,

197

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

Algorithm 2: Reliability Assessment Algorithm for Soft Real Time Applications Data: TT, DT, maxTT, minTT, PMR, RPR, LSNC Result: Soft Real-Time Reliability value for a node initialization; reliability = 1, RF = 0.0, RPC = 0, if minTT ≤ DT ≤ maxTT then timeResult ← pass if acceptanceTestResult = pass then nodeResult ← pass RPC ← RPC + 1 if (PC / LSNC) ≥ PMR & RPC ≥ RPR then nodeReliablityReset ← true reliability ← 1.0 RPC ← 1 end else nodeResult ← fail RF ← FR end else if acceptanceTestResult = pass then if DT > maxTT then TD ← DT - maxTT else if DT < minTT then TD ← minTT - DT end RF ← TD / TT if RF > FR then RF ← FR end else RF ← FR end nodeResult ← fail end if nodeResult = fail then RPC ← 0 reliability ← reliability - (reliability * RF) end

the result has to be ideally delivered on the threshold time. However, little tolerance in time is allowed. In the algorithm, we change the reliability of a compute instance by comparing the actual delivery time of the result with the threshold time. Algorithm checks whether the delivery time is within the minimum tolerance time and maximum tolerance time. If it is so then it checks whether the application has passed the acceptance test or not. If it has passed the acceptance test then it does not change its reliability, otherwise algorithm changes its reliability with the reliability factor (equal to failure reliability). On the other side, if the delivery time is outside tolerance time, then we check the acceptance test result. If it is failed then we change its reliability with the reliability factor (equal to failure reliability). If the acceptance test pass then we calculate the reliability factor to decrease

978-1-908320-08/7/$25.00©2012 IEEE

the reliability. To calculate it, we compare the delivery time with the maximum tolerance time (if deliver time is more than maximum tolerance time) or minimum tolerance time (if deliver time is less than minimum tolerance time). The difference between the two values is the time delay. Then we divide time delay with the threshold time and get the reliability factor. If the reliability factor is greater than 0.0, then we decrease the virtual machine’s reliability by multiplying the reliability factor by the current reliability. If in the next cycle, virtual machine passes then we increase its reliability with the same reliability factor. If it passes again in the subsequent cycle (means it was passed in the last cycle), then we do not further increase the reliability. Thus reliability can never reach to the value of 1.0 again, if once failed. In the algorithm, we use many concepts which are as follows; Threshold time (TT) is the time on which the result has to be ideally delivered. Delivery time (DT) is the time on which the result was actually delivered. Minimum tolerance time (minTT) is the time on which the result has to be delivered at minimum. It is less than threshold. The application performance is supposed to be 100% accurate within this time. Maximum tolerance time (maxTT) is the time on which the result has to be delivered at maximum. It is more than threshold. The application performance is supposed to be 100% accurate within this time. Time delay (TD) is the difference between the threshold time and the delivery time. Reliability factor (RF) is the fraction by which reliability has to be decrease in case of a failure. Pass count (PC) is a counter to check that how many times a node has passed in last specified number of cycles to assess. Pass marks required (PMR) is the minimum level required to reset the reliability to 1. Its value is set by administrator. Ideally it should be more than 0.9 (90%). Failure Reliability (FR) is a factor which is use to decrease the reliability in case of a failure caused by failing in the acceptance test. It is set by the cloud administrator. In case when the result is failed by the AT and also delivered outside time, we check whether the failure reliability is less than reliability factor or not. If it less then we determine the reliability factor by the ration of delay time to threshold time, otherwise we set reliability factor as failure reliability. Recent pass count (RPC) is a counter which keeps record of last number of consecutive successes. It increments by 1 after every pass, and reset to 0 after any failure. It works with pass count to reset the reliability to 1.0, if it meets the criteria for recent pass required. Last Specific Number of Cycles (LSNC) is the recent number of cycles, which is used to calculate the ratio of pass/failures in the last cycles specified cycles. recentPassRequired (RPR) is the recent consecutive pass required to reset the reliability value. 3) Reliability Assessment for Hard Real Time Computing: In hard real time, we have very strict timeliness requirements. Failure to respond exactly on time can be even very dangerous. A response outside the tolerance time is a complete failure. Even the response within tolerance time and not on threshold time results in decrease in reliability. We have given our

198

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012)

proposed algorithm in algorithm 3. We start with checking that whether the delivery time is within tolerance time (between minimum threshold time and maximum threshold time) or not. If it is not within tolerance time, then we reduce the reliability by a factor of failure reliability. But if it is within tolerance time then we check whether it has passed the acceptance test or not. If it could not pass the acceptance test then we decrease the reliability by a factor of failure reliability. But if the acceptance test is passed, then we calculate the reliability factor value. At the end we decrease reliability by reliability factor. In our propose algorithm, we use many concepts that we have already used in soft real time algorithm. We are not going to repeat them here except maximum tolerance time and maximum tolerance time. In this algorithm, Minimum tolerance time is the time on which the result has to be delivered at minimum. It is less than threshold. The application performance is supposed to be degraded, but somehow acceptable. Maximum tolerance time is the time on which the result has to be delivered at maximum. It is more than threshold. The application performance is supposed to be degraded, but somehow acceptable. Failure Reliability is a factor which is use to decrease the reliability in case of a failure. In case of a partial failure, it is used in determination of reliability factor. Partial failures means that result could not be delivered exactly on threshold time, but delivery time is within maximum and minimum tolerance time and acceptance test has also passed the result. In case of a complete failure, we decrease reliability by this factor. A complete failure means that either the delivery time is less than minimum tolerance time or more than maximum tolerance time or the acceptance test did not pass the result. IV. E XPERIMENTS AND R ESULTS We have conducted our experimens using ProActive cloud/grid interface to Amazon EC2 cloud. We created a total of six virtual machines to run our experiments. We used 2 virtual machines each for different types of applications. Each pair of virtual machine (compute instance) for a certain application is an invariant. However, application logic for one type of application is same. Each application logic has a series of tasks. Each task runs in one computing cylce. The general application requires 10 computing cycle to finish its job, whereas soft real time and hard real time applications require 15 computing cycles to complete the job. VirtualMachine-1 and VirtualMachine-2 run general application. VirtualMachine-3 and VirtualMachine-4 run soft real time application (real time financial analysis application). VirtualMachine-5 and VirtualMachine-6 run hard real time application (real time multimedia streaming aaplication). Reliability assessment algorithm is running on the cloud scheduler node. This node also have other components like time checker and decision mechanism. All the implementation is based on ProActive parallel suites active object model [18], where each virtual machine behaves like an active object. The check pointing is performed for the sake of fault tolerance

978-1-908320-08/7/$25.00©2012 IEEE

Algorithm 3: Reliability Assessment Algorithm for Hard Real Time Applications Data: TT, DT, maxTT, minTT, PMR, RPR, LSNC Result: Soft Real-Time Reliability value for a node initialization; reliability = 1, RF = 0.0, RPC = 0, if minTT ≤ DT ≤ maxTT then timeResult ← pass if acceptanceTestResult = pass then nodeResult ← pass if DT > TT then TD ← DT - TT RF ← (TD * FR) / (maxTT - TT) else if DT < TT then TD ← TT - DT RF ← (TD * FR) / (TT - minTT) end else if DT = TT then RF ← 0 end reliability ← reliability - (reliability * RF) RPC ← RPC + 1 if (PC / LSNC) ≥ PMR & RPC ≥ RPR then nodeReliablityReset ← true reliability ← 1.0 RPC ← 1 end else nodeResult ← fail RPC ← 0 reliability ← reliability - (reliability * FR) end else nodeResult ← fail RPC ← 0 reliability ← reliability - (reliability * FR) end

and is done using CIC protocol [17]. Our experiments has the following configuration. Virtual machine 1 and 2 are running general application Logic ’AL1’ and have acceptance test ’AT1’. Virtual machines 3 and 4 are running general application Logic ’AL2’ and have acceptance test ’AT2’. Virtual machines 5 and 6 are running general application Logic ’AL3’ and have acceptance test ’AT3’. In the first experiment, we tested a general application onto two different virtual machines. The results of the experiments are shown in table I. In this experiment, the VM-2 failed in the first cycle and VM-1 is passed. So scheduling preference goes with the VM-1. VM-2 fails in cycle 2 and 3, thus its reliability is decreased with the the combination of reliability factor (RF) and adaptability factor (N). In the second experiment, we tested a soft real time application onto two different virtual machines as shown in table II. Here, in the beginning four cycles, both the machines failed only one time. Thus at the end of 4 cycles they have the same

199

The 7th International Conference for Internet Technology and Secured Transactions (ICITST-2012) TABLE I E XPERIMENT R ESULTS FOR THE R ELIABILITY A SSESSMENT OF G ENERAL A PPLICATION Virtual Machine-1

Virtual Machine-2

Cycle

FM

Nbef

RF

Rel.

RPC

Naf t

FM

Nbef

RF

Rel.

RPC

Naf t

Start 1 2 3 4 5 6 7 8 9 10

Pass Fail Fail Pass Pass Pass Pass Pass Pass Fail

1 1 1 2 3 2 1 1 1 1 1

0.03 0.03 0.03 0.06 0.03 0.03 0.03 0.03 0.03 0.03 0.03

1.000 1.030 0.999 0.939 0.967 0.996 1.026 1.057 1.089 1.122 1.088

0 1 0 0 1 2 3 4 5 6 0

1 2 3 2 1 1 1 1 1 2

Fail Pass Pass Pass Pass Fail Fail Fail Pass Pass

1 1 2 1 1 1 1 2 3 4 3

0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.06 0.09 0.03 0.03

1.000 0.970 0.999 1.029 1.060 1.092 1.060 0.996 0.906 0.933 0.961

0 0 1 2 3 4 0 0 0 1 2

2 1 1 1 1 2 3 4 3 2

.

Preference VM-1 Tie VM-2 VM-2 VM-2 VM-2 VM-1 VM-1 VM-1 VM-1

Reliability Factor = 0.03, Maximum Reliability = 1.20, Minimum Reliability = 0.50, Recent Pass Required = 5 Abbreviations: FM→Fault Monitor result, Nbef →adaptability factor N before cycle, Naf t →adaptability factor N after cycle, RPC→recent pass count, RF→reliability factor, Rel.→reliability

TABLE II E XPERIMENT R ESULTS FOR THE R ELIABILITY A SSESSMENT OF S OFT R EAL T IME A PPLICATION Cycle

Real Time Limit

-

TT

Start 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

9300 4450 5700 1700 7100 6750 14350 4800 7350 4900 5600 3450 4500 4150 6150

min TT 8700 4200 4600 1450 6000 6150 13900 4150 7000 4400 5200 3000 3500 3800 5600

max TT 9800 4550 5900 2050 7700 8100 15000 5300 7800 5250 6000 3800 5000 4550 6800

Virtual Machine-1 FM

DT

RF

Rel.

Pass Pass Fail Pass Pass Pass Pass Fail Fail Pass Pass Pass Pass Pass Fail

9330 4370 5400 1710 6980 8730 14020 4760 5410 4690 5580 3320 4570 1260 4910

0.000 0.000 0.100 0.000 0.000 0.093 0.000 0.100 0.100 0.000 0.000 0.000 0.000 0.100 0.100

1.000 1.000 1.000 0.900 0.900 0.900 0.816 0.816 0.734 0.661 0.661 0.661 0.661 0.661 0.595 0.536 .

Preference

Virtual Machine-2 Node Status Pass Pass Fail Pass Pass Fail Pass Fail Fail Pass Pass Pass Pass Fail Fail

FM

DT

RF

Rel.

Pass Fail Pass Pass Fail Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass

10120 4360 5880 1640 6200 6410 14250 5590 7440 4570 5710 3360 4390 4180 6430

0.000 0.100 0.000 0.000 0.100 0.000 0.000 0.060 0.000 0.000 0.000 0.000 0.000 0.000 0.000

1.000 1.000 0.900 0.900 0.900 0.810 0.810 0.810 0.761 0.761 0.761 0.761 0.761 0.761 0.761 1.000

Node Status Pass Fail Pass Pass Fail Pass Pass Fail Pass Pass Pass Pass Pass Pass Pass

Tie VM-1 Tie Tie VM-1 VM-1 VM-1 VM-2 VM-2 VM-2 VM-2 VM-2 VM-2 VM-2 VM-2

Failure Reliability = 0.10, Last Specific Number of Cycles = 10, Pass marks required = 0.9, Recent pass required = 3 Abbreviations: TT→threshold time, minTT→minimum tolerance time, maxTT→maximum tolerance time, FM→Fault Monitor result, DT→delivery time, RF→reliability factor, Rel.→reliability

scheduling preference. Because it is for a real time system, so its failure reliability is 10%. So the maximum reliability factor can also be equal to this value. A node has to pass at least 90% of the time in the last 10 cycles to reset the value. We can see that VM-2 has passed nine times in last 10 cycles of the execution (from the cycle 6 to cycle 15), so the reliability of reset is reset to 1.0. In the third experiment, we tested a hard real time application onto two different virtual machines and the results are shown in table III. In this experiment, initially VM-2 has performed better with less failures and lesser tolerance with the threshold time. But later it failed on many occasion by unable to deliver within tolerance time. Whereas, VM-1 has perfomed better from fourth cycle and has eleven consecutive pass. Thus we have reset its reliability to 1.0 after cycle 12. Here we also change recentPassCount to 1. In most of the cycles the scheduling preference went with the VM-1. In the

978-1-908320-08/7/$25.00©2012 IEEE

14th cycle we again reset its reliability to 1, as the VM was successful in the past 10 cycles and also the requirement for recentPassRequired is met. V. C ONCLUSION AND F UTURE W ORK Cloud computing infrastructure needs to be much reliable. In this paper, we have presented a model to perform reliability assessment of cloud computing infrastructure (compute instance which includes platform and infrastructure). On the basis of these reliability values, we perform the scheduling on the cloud and do the fault tolerance. There are separate reliability assessment algorithms for general applications and real time applications. The algorithm for general application is adaptive and more convergent towards failures. The algorithms for real time computing do the reliability assessment on the basis of timeliness of result and are also more convergent towards failures. Scheduler uses these reliability values to schedule a task on cloud infrastructure.

200

Suggest Documents