Adaptability in Transaction Oriented Grid Service - IEEE Xplore

1 downloads 67 Views 166KB Size Report
Email: (dpmahato.rs.cse13, lokendra.rs.cse12, ravi.cse)@iitbhu.ac.in ... mean the execution of grid services with transaction man- agement. In the inter-operation ...
2014 International Conference on Parallel, Distributed and Grid Computing

Adaptability in Transaction Oriented Grid Service Dharmendra Prasad Mahato∗ , Lokendra Singh Umrao∗ and Ravi Shankar Singh∗ ∗ Department

of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India 221–005. Email: (dpmahato.rs.cse13, lokendra.rs.cse12, ravi.cse)@iitbhu.ac.in

Abstract—Adaptability is the ability of a system to adapt itself efficiently to changed circumstances. Adaptability in transaction oriented grid service is a challenge due to the extreme complexity and the occurrence of failures in the grid system. This paper presents the design, implementation, and evaluation of an adaptive model, ATOGS, using adaptive fault-tolerance (i.e., checkpointing and replication) during the execution of services. We evaluate our adaptive model experimentally comparing with the Dynasa and the experimental results have demonstrated that ATOGS enables the application itself to handle the failure problems efficiently and it achieves better performance in terms of execution time, network bandwidth, load, resulting in up to a lower overhead. The results indicate that the performance of transaction oriented grid service is better than the performance of general grid service when replication technique is used. This model is based on a modeling and simulation tool, Coloured petri nets (CPNs).

I. I NTRODUCTION Adaptability is to be understood here as the ability of a grid system to adapt itself efficiently and fast to changed circumstances. An adaptive system is, therefore, able to fit its behavior according to the changes in its environment or in parts of the system itself. With the increasing size and complexity, the adaptability is among the most importantly needed properties in today’s grid systems [1], [2]. When we talk about transaction oriented grid service, we mean the execution of grid services with transaction management. In the inter-operation of these services, the system faces different types of faults such as hardware faults, communication faults, software faults, Byzantine faults, and service expiry faults causing corresponding failures which degrade the performance of the system. Hence selection procedure of transactional grid services becomes necessary for reliable execution of the service [3], [4]. But, coordination of transaction in grid service is not an easy task, because, • Transaction coordination in grid service is always time consuming owing to interaction amongst users and latency, • Autonomous characteristics of grid services do not allow locking of needed resources, • Transaction always suffers from missing messages due to unreliable communication, • Services in grid environment are loosely coupled in nature, • If transaction is implemented, the reliability is ensured but execution suffers from some faults.

978-1-4799-7683-6/14/$31.00©2014 IEEE

Large-scale grids are complex systems which are composed of thousands of components from separate domains. Hence it becomes necessary to execute the services keeping in mind both of the features: transaction management and QoS of the system. The occurring of failures in such situations is always a problem which leads to inconsistency and time consuming for computation of tasks [22], [23]. It is impossible to assure the successful execution of all the services in the grid system without any failure. During subtask execution, any failure occurrence on a node or on a communication link will result in the rollback or abort of the subtask execution. If any subtask needs a long execution time, the probability of failure occurrence is high, and also it may often be the case that after a long time has been spent in executing it, it is terminated by a failure. This leads to a terrible waste of time and resources consumed. That is why, the adaptability in these cases is important. In this paper, our focus is on the adaptability when transaction processing is performed in grid computing. We are proposing ATOGS (Adaptability in Transaction Oriented Grid Service) model for the analysis of adaptability in the transaction oriented grid services. How our work is different that we are trying to evaluate adaptive performance in transaction oriented grid service and will compare the results with Dynasa framework described in [1]. For recovery, we will describe and implement two types of recovery of failed processes in our model: local level recovery and replicated level recovery. In the local level recovery the failed processes are recovered on the same loacl node where they face failures. In the replicated level recovery, the failed processes are recovered on different nodes except the local node. For modeling and simulation, Coloured Petri nets (CPNs) Tool is used in this paper. ATOGS enables transaction oriented services to execute in an adaptive way in order to handle failure problems using fault-tolerance methods. ATOGS makes adaptive decisions according to occurrence of failures during run time, and the adaptive actions are based on fault-tolerance techniques (i.e., checkpointing and replication). The remainder of this paper is organized as follows. Section II reviews the related research and discusses the existing approaches to grid service and transaction oriented grid service adaptability modeling and analysis. Section III presents the proposed ATOGS model in detail. Section IV presents simulation set up of our work. Section V presents the results in

239

2014 International Conference on Parallel, Distributed and Grid Computing

cluster federation [21].

different scenarios. Section VI concludes the paper. II. R ELATED W ORK Adaptability in transaction oriented grid service is very important and has been studied and has been worked on by many researchers. Shi et al. in paper [1] has presented Dynasa framework for adaptability in grid environment. Transaction processing in grid environments has been extensively studied. Transaction processing systems are judged by users to be correctly functioning indicates that not only their transactions must be executed correctly, but also most of them must be completed within an acceptable time limit. Feiling et al. studied transaction management in reliable grid service [3] and Tang et al. presented transaction management in highly reliable grid platforms [4] where fault-tolerance techniques were used to prevent effects due to the faults. Bernstein et al. studied principles of transaction processing [10] where basic principles of transaction processing required for execution in distributed environments. Helland et al. studied transaction processing of distributed objects with declarative transactional attributes [13]. Raz and Yoav presented distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing [14]. Mohan et al. proposed ARIES algorithm which is a transaction recovery method supporting fine granularity locking and partial rollbacks using writeahead logging [15]. Gray et al. studied the transaction concept with Virtues and limitations [16], [17]. Jajodia et al. presented advanced transaction models and architectures [19]. To tolerate the faults in adaptive way, a robust fault tolerance mechanism with better recovery technique is required in this scenario. The paper [1] presented the framework, Dynasa, where the it makes adaptive decisions according to changes of security levels, and the adaptive actions are based on fault-tolerance techniques (i.e., checkpointing and replication). It is impossible to assure the successful execution of all the services in the grid system without any failure. During subtask execution, any failure occurrence on a node or on a communication link will result in the rollback or abort of subtask execution. If a subtask needs a long execution time, the probability of failure occurrence is high, and also it may often be the case that after a long time has been spent in executing the subtask, the subtask is terminated by a failure. This leads to a terrible waste of time and resources consumed. In this case we use one of the fault tolerance techniques, fault recovery, which can provide an opportunity for failed nodes to continue processing through recovery actions, which could be a good solution to the aforementioned problem. Guimaraes et al. presented a Framework for adaptive faulttolerant execution of workflows in the grid with empirical and theoretical analysis [2]. Cao et al. presented and used coordinated checkpointing mechanism for fault tolerance in distributed system to eliminate Domino-effects and missingmessages [20]. Gupta et al. discussed how Domino-effects affect the performance in distributed system and presented Domino-effects free crash recovery for concurrent failures in

III. P ROPOSED M ODEL : ATOGS In this section, for modeling transaction oriented grid service execution in adaptive way, a CPNs Tools based adaptive model named ATOGS, is proposed. Before we explain the model in detail, CPNs Tools is being explained. The aim of this paper is not only modeling the task scheduling, transaction processing and workflow of the task execution but also evaluating the adaptability in grid environments. Hence we used an extended version of Petri nets for modeling so that the required properties, necessary for adaptability evaluation, could be modeled. A. Coloured Petri Nets Coloured Petri nets (CPNs or CP-nets) are a class of highlevel nets that extend ordinary Petri nets. In CPNs, tokens can carry arbitrarily complex data, arcs can be annotated with input inscriptions influencing the enabling of a transition, or output inscriptions stating the production rule of tokens when a transition fires. Input/output inscriptions can be functions or variables. The definition of CPNs is taken from papers [5]–[7]. Definition III-A0.1: A coloured Petri net is a 9-tuple, CP N = (Σ, P, T, A, N, C, G, E, I), where • Σ is a finite set of non-empty types, also called colour sets, • P is a finite set of places, • T is a finite set of transitions, • A is a finite set of arcs such that: P ∩ T = P ∩ A = T ∩ A = Φ, • N is a node function. It is defined from A into P × T ∪ T × P, • C is a colour function. It is defined from P into Σ, • G is a guard function. It is defined from T into expressions such that: ∀t ∈ T : [T ype(G(t)) = B ∧ T ype(V ar(G(t))) ⊆ Σ] , where B to denote the Boolean type. • E is an arc expression function. It is defined from A into expressions such that: ∀a ∈ A: [T ype(E(a)) = C(p)M S ∧T ype(V ar(E(a))) ⊆ Σ], where p is the place of N (a), • I is an initialization function. It is defined from P into closed expressions such that:∀p ∈ P : [T ype(I(p)) = C(p)M S] . CPN Tools [5]–[7] which is a well-known tool for modeling, verifying and analyzing of CPNs, has become an industrial strength computer tool for constructing and analyzing CPN models. Using CPN Tools, it is possible to investigate the behavior of the modeled system using a simulation, to verify properties by means of state space methods and model checking, and to conduct a simulation-based reliability analysis. B. ATOGS Before going in detail about ATOGS model, some assumptions are being discussed in this section;

240

2014 International Conference on Parallel, Distributed and Grid Computing Work

1) Assumptions: In the proposed model, we will use the following assumptions (most of them are taken from [24] and [25]): •



• •





Assumption 1: The grid service equipped with transaction management is using the star topology and the RMS is connected to all of the resources. There is a single communication link between RMS and each of the resources. Assumption 2: When the grid users submit their own tasks to the RMS, the RMS divides the submitted tasks to some subtasks and sends them to the grid resources. Assumption 3: The subtasks are of both short-lived and long-lived. Assumption 4: The subtasks have no dependency on one another, and each of the resources starts execution of the assigned subtask immediately after it gets the subtask data from the RMS. Assumption 5: The transaction oriented grid service faces three basic failures such as hardware, communication link and program or software failures. Assumption 6: The replication technique is used for increasing the performance of task execution, so the number of subtasks must be less than the number of available resources. Therefore, one subtask can be assigned to more than one resource, but a resource must execute only one subtask.

2) Adaptability: To make the transaction oriented grid service, adaptive, we are using adaptive controller in our model ATOGS (as mentioned Dynasa framework in the paper [1]). How our work is different that we are incorporating transaction in grid service execution and for the simulation we use CPNs Tools. We are using replication technique and coordinated checkpoint in adaptive controller. 3) Model Description: The model ATOGS is a hierarchical CPN model (as shown in Figure 1). The model consists of several hierarchical nets such as DATA GEN, CLIENT, SCHEDULER, RMS, RESOURCES, TRANSACTION, and MONITOR. At first, the users request for the applications or services which are generated in DATA GEN. The requests are submitted to CLIENT. In the net CLIENT, the authentication and payment status of the users are checked before forwarding the requests to the net SCHEDULER. The SCHEDULER net decides the order of arrival of the requests. Then the requests are submitted to RMS where the tasks are subdivided into multiple subtasks and those subtasks are sent for the execution to RESOURCES. Here the replication technique is used after failures occur. After that the transaction management is accomplished in the net TRANSACTION MANAGER where coordinated checkpoint is used for fault tolerance. All the subtasks either they are successfully or unsuccessfully executed are sent to RMS net. Thereafter, these tasks are sent to MONITOR net which monitors where to send the received tasks either to the users or to the SCHEDULER. If the subtasks are unsuccessfully executed, they are again rescheduled either on the local node (where the subtask has faced failure or on

dispatched service Job

sub_Work

service requirement

sub_service sub_Work_status

Job request

DATAGEN DATAGEN

Transacton

CLIENT CLIENT

aborted

end

message2 client

Job

Job

Work_status

SCHEDULER SCHEDULER

RMS RMS

message2 dispatcher

monitor record Work_status

Work_status

RESOURCES RESOURCES

TRANSACTION MANAGER TM2

Transaction status sub_Work_status

MONITOR MONITOR

Fig. 1.

ATOGS

the replicated nodes (any other nodes rather than the local one. In the case of replicated node, the replication method is used. The failure generation using exponential distribution is done in Fault Generation net. C. Algorithms Here the working procedure of transaction management and checkpointing in our model are based on three algorithms. The algorithm 1 is for atomic transaction coordination and algorithm 2 is for atomic transaction participants. These two algorithms are basically used for creating transaction in grid environments. The algorithm 3 is used for fault tolerance using coordinated checkpoint technique so that adaptability can be maintained. 1) Atomic Transaction Coordination: In algorithm 1 the transaction coordination process is initiated where Resource Manager creates the coordinator. When transaction processing is to be started, the coordinator communicates with participants which are created by the scheduler agent as described in algorithm 2. The Resource Manager assigns service to sub processors dividing it into sub services. Here time refers to time taken by execution of n number of sub services which are executed at different nodes in parallel and timeout is the timeout limit for each service. If all sub services have been executed under the timeout limit then the transaction is said to be commit and this message is sent to all participants that the transaction has been committed. 2) Atomic Transaction Participation: In algorithm 2 the transaction participants processes are created by message passing for acquiring the execution cycle for each sub service. When the sub-tasks are executed successfully, the results are sent to coordinators and if any failure occurs during execution, the messages are sent to the coordinators that failures have occurred and thereafter, checkpointing mechanism is used. 3) Coordinated Checkpointing: In algorithm 3 the coordinated checkpointing mechanism has been proposed. If failure occurs at any node executing the sub service, the remaining

241

2014 International Conference on Parallel, Distributed and Grid Computing

Algorithm 1: Atomic Transaction Coordination Data: sub task status Result: sub task status 1 Atomic Transaction is initiated; 2 Resource Manager creates Coordinator; 3 Resource Manager sends task-message to schedulers; 4 Prepare for transaction commit; 5 Coordinator sends prepare-message to all Participants; 6 Resource Manager assigns tasks to sub processors dividing them into sub-tasks; 7 while (time < timeout & number of subtasks < total subtasks) do 8 Wait for incoming messages; 9 Record those messages; 10 end 11 Commit the transaction; 12 if number of subtasks = timeout then 7 exit; 8 release the resources; 9 end 10 if success then 11 send prepare-message to the coordinator; 12 end 13 commit sub-transaction; 14 while (time < timeout & transaction ! = commit) do 15 Wait for incoming messages; 16 if transaction-type = flat & message-type = commit then 17 allocate reserved resources; 18 record commit in log; 19 commit sub-transaction; 20 end 21 if transaction-type = nested then 22 call Atomic Transaction Coordination; job is sent to other node on the basis of checkpointing status 23 send commit to coordinator; to re-execute the job from last checkpoint. 24 call Coordinated Checkpointing; 25 end IV. S IMULATION SET UP 26 end Using CPN Tools, it is possible to investigate the behaviour of the modeled system using simulation, to verify properties by means of state space methods and model checking, and to • Category 2 (TM Versus Non TM with Replicated Reconduct simulation-based performance analysis. covery): The category 2 compares the transactional grid A. Simulation Process service with grid service without transaction management with replicated recovery. Before simulation of our model we assume that the number • Category 3 (Local Versus Replicated Recovery with TM): of users are 100 and the arrival rate of the users follow Here, transaction oriented grid service with local level reexponential distribution i.e., any random number from 1 to 100 covery and that in replicated level recovery are compared. of users arrive in a unit of time or send their requests for grid • Category 4 (Local Versus Replicated Recovery with application or service. We assume that there are 100 number of Non TM): In category 4, the grid service with local nodes and each node consists of 10 number of resources. For level recovery and that in replicated level recovery are failure modeling, we assume that any random number subtasks compared. of 100 subtasks running on the executing nodes or processors will face a failure in a unit of time. Here the failure types are V. R ESULTS of communication, hardware and software. In this section we present the performance results of the B. Simulation set up implementation of our model using CPN Tools. In simulation set up, we compared four categories in the 1) TM Versus Non TM with Local Recovery: Here we model, ATOGS. compared transaction oriented grid service performance with that of grid service when local recovery was performed. The • Category 1 (TM Versus Non TM with Local Recovery): In this category, grid service with transaction manage- performance was measured with respect to execution time ment and grid service without transaction management in seconds which varied when number of checkpoint servers are compared. For recovery of failed processes, local level varied. We compared the results when checkpoint intervals were 10, 20 , and 30 seconds respectively (see the Figure recovery is used in this category.

242

2014 International Conference on Parallel, Distributed and Grid Computing

    

   









               

  

 

  











 

                 















  

























  



 

 

       

Fig. 2.



 





Local Recovery

Fig. 3.

















Replicated Recovery

                 

 

    









                        













    

    











    

     

       



   













    







 



          

Fig. 4.



 



Transaction Management



          

Fig. 5.

2). Here we can say that the transaction oriented grid service gives comparatively good performance than the performance of general grid service when checkpoint interval is 20 seconds. 2) TM Versus Non TM with Replicated Recovery: Here the performance of transaction oriented grid service was measured and was compared with the performance of general grid service, when replicated recovery was performed. Here we can see from the Figure 3 that the general grid service takes much time to execute, when checkpoint interval is 30 seconds and transaction oriented grid service takes minimum time when checkpoint interval is 10 seconds. We can say that the performance of transaction oriented grid service is much better when replicated recovery is performed and when checkpoint interval is 10 seconds. 3) Local Versus Replicated Recovery with TM: When local level recovery and replicated level recovery in transaction oriented grid service were compared here, we found that the





Non Transaction Management

replicated level recovery is much better than local level recovery. We can conclude that the replication technique is much needed and effective in the case of transaction management in grid environment (see the Figure 4). 4) Local Versus Replicated Recovery with Non TM: While in the case of general grid service, the local level recovery is better than replicated level recovery (se the Figure 5). VI. C ONCLUSIONS To make the transaction oriented grid service execution adaptive, we proposed the ATOGS model which is based on the framework Dynasa. How our work is different that we are evaluating the adaptive performance when transaction management is used in grid service. We found from the results that the transaction oriented grid service gives better result when replicated recovery is used. The general grid service gives better results when local level recovery is used.

243

2014 International Conference on Parallel, Distributed and Grid Computing

Algorithm 3: Coordinated Checkpointing Data: sub task status Result: sub task 1 if checkpoint manager finds the complete task result from resource then 2 send information of job to the scheduler; 3 end 4 if checkpoint manager receives the task failure from resource then 5 if failure is not recoverable then 6 abort the job; 7 exit; 8 send message to a checkpoint server; 9 end 10 if failure is recoverable and checkpoint result of the job exists in checkpoint server then 11 submit the remaining part of the job after last checkpoint received to the scheduler for rescheduling; 12 end 13 if failure is recoverable and checkpoint result of the job does not exist in the checkpoint server then 14 submit the job from start to the scheduler for rescheduling; 15 end 16 end

We also find that when the checkpoint servers are increased the execution time decreases and then the curve becomes comparatively flat. It indicates that the number of checkpoint servers can not be exceeded due to checkpoint overhead which can degrade the performance. Our future work will on the reduction of checkpoint overhead in transaction oriented grid service. R EFERENCES [1] Shi, Xuanhua and Pazat, Jean-Louis and Rodriguez, Eric and Jin, Hai and Jiang, Hongbo, Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations, Future Generation Computer Systems, 26, 2, 236–244, 2010, Elsevier. [2] F. P. Guimaraes, P. C elestin, D. M. Batista, G. N. Rodrigues, and A. C. M. A. de Melo, A framework for adaptive fault-tolerant execution of workflows in the grid: Empirical and theoretical analysis, Journal of Grid Computing, pp. 125, 2013. [3] F. Tang, M. Guo, M. Li, and L. Li, Transaction management for reliable grid applications, Advanced Information Networking and Applications, International Conference on, pp. 427-434, 2009. [4] F. Tang and M. Guo, Grid transaction management and highly reliable grid platform, pp. 421-441, 2010. [5] K. Jensen, Coloured Petri nets: basic concepts, analysis methods and practical use, vol. 1. Springer, 1996. [6] A. V. Ratzer, L. Wells, H. M. Lassen, M. Laursen, J. F. Qvortrup, M. S. Stissing, M. Westergaard, S. Christensen, and K. Jensen, Cpn tools for editing, simulating, and analysing coloured petri nets, in Applications and Theory of Petri Nets 2003, pp. 450-462, Springer, 2003. [7] Jensen, Kurt and Kristensen, Lars Michael and Wells, Lisa, Coloured Petri Nets and CPN Tools for modelling and validation of concurrent systems, International Journal on Software Tools for Technology Transfer, 9, 3–4, 213–254, 2007, Springer.

[8] Wu, J., Manivannan, D., & Thuraisingham, B. (2008). TransactionConsistent Global Checkpoints in a Distributed Database System. In Proceedings of the World Congress on Engineering (Vol. 1). [9] Baldoni, R., Quaglia, F., & Raynal, M. (2001). Consistent checkpointing for transaction systems. The Computer Journal, 44(2), 92-100. [10] P. A. Bernstein and E. Newcomer, Principles of transaction processing. Morgan Kaufmann, 2009. [11] G. Weikum and G. Vossen, Transactional information systems: theory, algorithms, and the practice of concurrency control and recovery. Elsevier, 2001. [12] S. Tai and I. Rouvellou, Strategies for integrating messaging and distributed object transactions, in IFIP/ACM International Conference on Distributed systems platforms, pp. 308-330, Springer-Verlag New York, Inc., 2000. [13] P. Helland, R. Limprecht, M. Al-Ghosein, and W. Russell, Transaction processing of distributed objects with declarative transactional attributes, Jan. 13 2004. US Patent 6,678,696. [14] Y. Raz, Distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing, Dec. 23 1997. US Patent 5,701,480. [15] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, Aries: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging, ACM Transactions on Database Systems (TODS), vol. 17, no. 1, pp. 94-162, 1992. [16] J. Gray et al., The transaction concept: Virtues and limitations, in VLDB, vol. 81, pp. 144-154, 1981. [17] J. Gray and A. Reuter, Transaction processing, Kaufmann, 1993. [18] Pilarski, S., & Kameda, T. (1990, April). A novel checkpointing scheme for distributed database systems. In Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems (pp. 368-378). ACM. [19] S. Jajodia and L. Kerschberg, Advanced transaction models and architectures, Springer, 1997. [20] G. Cao and M. Singhal, On coordinated checkpointing in distributed systems, Parallel and Distributed Systems,IEEE Transactions on, vol. 9, no. 12, pp. 1213-1225, 1998. [21] B. Gupta, S. Rahimi, V. Allam, and V. Jupally, Domino-effect free crash recovery for concurrent failures in cluster federation, pp. 417, 2008. [22] D. P. Mahato, L. S. Umrao, and R. S. Singh, Article: Recovery of failures in transaction oriented composite grid service, IJCA Proceedings on Computing Communication and Sensor Network 2013, vol. CCSN 2013, pp. 38-42, December 2013. Published by Foundation of Computer Science, New York, USA. [23] Umrao, Lokendra Singh, Dharmendra Prasad Mahato and Ravi Shankar Singh. Recent Trends in Parallel Computing, In Encyclopedia of Information Science and Technology, Third Edition, ed. Mehdi Khosrow-Pour, 3580-3589 (2015), accessed August 29, 2014. doi:10.4018/978-1-46665888-2.ch350. [24] S. Guo, H.-Z. Huang, and Y. Liu, Modeling and analysis of grid service reliability considering fault recovery, New Generation Computing, vol. 29, no. 4, pp. 345-364, 2011. [25] Azgomi, Mohammad Abdollahi and Entezari-Maleki, Reza, Task scheduling modelling and reliability evaluation of grid services using coloured Petri nets, Future Generation Computer Systems, 26, 8, 1141– 1150, 2010, Elsevier.

244