TREX: Workflow TRansactions by Means of EXceptions. Remco van Stiphout1 2, Theo Dirk Meijler1, Ad Aerts2, Dieter Hammer2, Riné le Comte1 1 Baan Labs, 2 Eindhoven University of Technology E-mail:
[email protected] Abstract Workflow Management Systems are becoming more and more important to organizations for the execution of business processes. However, most current systems lack a consistent way of dealing with failures and exceptions. This paper proposes to use an exception-handling mechanism to provide ‘transactional’ features for workflow management systems. The modeler specifies exception handlers for activities to deal with exceptions. These exception handlers change the workflow that is being executed to deal with exceptions as they arise. Based on this idea we discuss which exceptions and exception handlers should be supported and how they interact to provide transactional capabilities. 1. Introduction Workflow is a very powerful concept that provides configuration and control of business processes [1]. However, one of the problems with most current workflow products is that they do not provide transactional capabilities [2]. In contrast, database transactions are guaranteed to have the well-known ACID-properties. These specify how concurrency issues and failure issues should be managed. These issues are managed automatically by a database transaction manager, so the database transaction programmer can focus on ‘what the transaction does’, he or she does not need to describe reactions to failures [3]. For workflows there is no well-defined set of properties that specifies the reactions to failures. The desired reaction to a failure may vary from activity to activity, so the workflow programmer (modeler) should be able to specify the reaction to a failure for each activity. These reactions to failures should not be part of the normal workflow, as this would obscure the normal process. As the desired reaction may depend on the type of failure, the modeler should also be able to distinguish between different types of failures. Finally the modeler may wish to abort a workflow in certain cases. Aborting a workflow should not only result in a consistent state, but it should also attempt to preserve work that has been done. This paper proposes a transaction model called TREX (workflow TRansactions by means of EXceptions) which takes the following approach: 1. Treat failures as exceptions. This will simplify the workflow, as all failure handling is now explicitly modeled outside the standard workflow.
2.
Allow modelers to specify the way these exceptions are handled. For a given activity and exception the modeler chooses from a number of handlers, which change the workflow to deal with the exception. 3. Use compensating activities and exit activities to provide different ways to abort workflows in a way that maintains consistency and if possible preserves work that has been done. Other transaction models ([6],[7],[8],[9],[10] [11],[12]) do not provide the flexibility of specifying different reactions to different failures. Also they only support one way to abort workflows. Note that isolation is not mentioned here. This is because the long-lasting nature of workflows makes isolation inappropriate, as this would compromise performance and local database autonomy too much. Also isolation may be undesirable, as the rest of the world may wish to see intermediate results. 2. Description of environment This section will describe the environment in which TREX will operate. The environment has three important characteristics: • The structure of the workflows. • The meta-level the workflow engine supports. • The distributed execution of workflows. 2.1 Workflow structure Workflows can be described using high-level Petrinets [4]. Workflows consist of activities, which are linked by places. Places are typed and can contain tokens of that type. These tokens represent information that is passed from one activity to the other. Activities have input places and output places. An activity can execute if all its input places have received a token. When an activity finishes it outputs tokens to its output places. Activities can be atomic or composite. Composite activities in turn consist of a workflow. A composite activity executes by starting up the workflow it consists of and waiting for that workflow to finish. The composite activity is called the ‘parent’ of the activities in the workflow; the activities in the workflow are called the ‘children’ of the composite activity. Activities can be linked to each other in three ways: • Through simple links, with one predecessor and one successor.
•
Through splits, which have one predecessor and multiple successors. There are XOR-splits, which only start one successor and ANDsplits, which start all successors. • Through joins, which have multiple predecessors and one successor. There are XOR-joins, which can start when one predecessor finishes and AND-joins, which cannot start until all predecessors finish. A link is represented by a link between an output place of the predecessor and an input place of the successor. An input place that is not linked to any other place gets its input from a so-called resource manager. An output place that is not linked to any other place sends its output to a resource manager. A workflow may contain corresponding ANDsplits and AND-joins. Between these lie so-called parallel branches. A branch is a set of activities that are linked together and lie between an AND-split and the corresponding AND-join.1 In Figure 1 for example, A2, A3, A4 and A5 form a branch and so do A6 and A7. A3 forms a branch by itself and so does A4. The branch that contains A2 and the branch that contains A6 are said to be in parallel. A3 A2
A5
A4
A1 A6
A8
A7
Figure 1: Example of parallel branches 2.2 Meta-level The workflow engine that is used has a so-called ‘reflective’ capability. It supports meta-activities that may adapt the normal part of the workflow definition itself while it is being executed. The execution of the workflow is temporarily frozen while the meta-activities are executing. Then normal execution resumes, possibly at a different point in the workflow. Exception handlers use meta-activities to change the workflow definition to deal with an exception. When discussing exceptions and meta-activities, the word origin will be used to denote the activity in which the exception occurred.
that is needed for mobile computing and virtual enterprises, a mobile agent-based architecture has been chosen for the execution of workflows. A mobile agent is an object that can migrate across a network. All activities execute within mobile agents. No more than one activity is ever executing in the same agent, so agents are cloned and killed when needed. For example, in Figure 1 A2 will execute in one agent, A6 in another and the parent activity of both in yet another. When A6 is finished, A7 will execute in the same agent, but when A5 and A7 finish, the two agents will merge into one agent, in which A8 will execute. In general one agent will be responsible for the execution of a branch, until that branch splits into subbranches, which will cause the agent to be cloned, or joins with its parallel branches, which will cause the agent to be merged with the agents in which the parallel branches executed. Every agent carries a copy of the workflow definition that it is executing with it. 3. Use case To illustrate the application of the TREX-model we present a use case described called the ‘PC order process’. It fulfills orders from customers for configured PCs. There is a top-level activity, called ‘PC order process’, which consists of the following workflow: 1. An order for a PC is entered into the system. The order specifies what CPU, memory and hard disk the customer wants. 2. After the order has been received the PC is assembled. 3. Parallel to the assembly the customer is sent a bill. 4. When the payment of the bill is received, this is registered. 5. After the PC has been assembled and payment has been received, the PC is shipped. PC order process
Bill customer
Register payment
Enter order
Ship PC Assemble PC
2.3 Distributed execution The activities of the workflow are executed at different locations. To support the local autonomy 1
This definition assumes that for each AND-split there will be exactly one corresponding AND-join. This is not true for all possible workflows, but it suffices here.
Insert CPU
Insert memory
Insert HD
Figure 2: The use case This process can be seen in Figure 2. The activity ‘PC order process’ is the parent activity of the other activities. Note that the input and output places are not present, to avoid cluttering the picture.
If we look at the ‘assemble PC’ activity we see that it in turn consists of a workflow. This workflow contains the following activities: 1. Insert CPU. 2. Insert memory. 3. Insert hard disk. 4. Exceptions While a workflow is executing, there are a number of exceptions that may occur during the execution of an activity. At the moment the TREX-model can handle the following exceptions: • Timeout. Activities can have a deadline. This deadline is monitored by the deadline service. The timeout exception occurs when the deadline service notifies the activity that its deadline has passed. In the example a timeout exception would occur when the payment is not received before a certain deadline. • Resource unavailable. This exception occurs when the activity tries to reserve a resource but fails to do so. In the example this exception will occur in the ‘insert CPU’ activity if the desired CPU is not in stock. • Access exception. An access exception occurs when an activity tries to execute but fails to do so because it cannot access a piece of data. Reasons for this may be the failure of a database transaction, communication failure, a resource that has been reserved successfully but is no longer available etc. In the example this exception could occur in the ‘insert CPU’ activity if the reserved CPU turns out to have been taken by a higher priority activity. • Fail. This exception occurs when a parent activity receives a ‘fail’ token from one of its children. The child uses these to indicate that it can no longer continue execution and that its parent must fail. In the example this could happen during the execution of the ‘insert CPU’ activity if the order specifies a CPU which is no longer supported. Then the PC as specified by the customer could never be assembled and therefore the ‘insert CPU’ activity sends a ‘fail’ token to its parent. As a result of this a ‘fail’ exception occurs in its parent, the ‘assemble PC’ activity. • Abort. This exception occurs when an activity receives an ‘abort’ token. A composite activity can send ‘abort’ tokens to all of its children to indicate that they must abort. Also a user could send an ‘abort’ token to a top-level activity to cancel a workflow. Suppose that in the example the customer cancels an order. This means the activity ‘PC order process’ receives an ‘abort’ token. The handler for the ‘abort’ exception then sends the children ‘abort’
tokens. These cause the ‘abort’ exception to occur in all the active children of that activity. 5. Exception handlers The TREX-model allows the modeler to specify how an exception is handled. The modeler specifies a mapping of type: Activity x Exception -> Exception Handler There are two types of exception handlers: • Continuation exception handlers. These modify the workflow and then continue normal execution. Continuation exception handlers are not allowed as handlers for the ‘abort’ exception, because when an ‘abort’ exception occurs the workflow must be aborted. • Abortion exception handlers. These modify the workflow in such a manner that it will abort. Abortion exception handlers are only allowed as handlers for the ‘abort’ exception. Because an ‘abort’ exception will always occur in all parallel executing activities of a workflow, this guarantees that either the whole workflow will be aborted or none of it will be aborted. 5.1 Continuation exception handlers These are the continuation exception handlers that the TREX model supports: • Fail. This handler sends a ‘fail’ token to the parent of the origin. It will then suspend execution, so the activity waits for a response from the parent. This handler would be used when an exception occurs that cannot be handled by changing the workflow, but only by aborting it. In the example the activity ‘insert CPU’ could use this handler to handle the ‘resource not available’ exception, to fail the entire assembly if the CPU type is not in stock. • Insert activity . This handler will change the workflow so that an activity that is specified by the modeler is executed and then the origin is tried again. Suppose that in the example the deadline on ‘register payment’ goes off. An activity ‘contact customer’ would be inserted in front of the activity in which the exception occurred (in this case ‘register payment’). After insertion normal execution would resume at the start of ‘contact customer’. If the origin is a composite activity, the children of the origin will need to be aborted first, by sending them an ‘abort’ token and then waiting for them to abort. • Retry activity. The origin is retried. This handler is suited to exceptions such as ‘access exception’, that may not occur again if the activity is simply retried. Similarly to ‘insert activity’, any children must first be aborted. • Contingency activity . A contingency activity that is specified by the
modeler is executed instead of the origin. This handler can be used if there is a clear alternative to the origin. In the example, this handler could be used to handle the ‘resource not available’ exception in the ‘insert memory’ activity. If the specified memory is not in stock, the contingency activity could contact the customer about using a different type of memory and then use that type of memory. Similarly to ‘insert activity’, any children must first be aborted. 5.2 Abortion exception handlers Abortion handlers can only be used to handle the ‘abort’ exception. They must modify the workflow in such a way that the work that has been done so far will be ‘cleaned up’. To the parent it must be as if the workflow has terminated in a consistent state. The input tokens of the first activity must be available after the children have finished aborting. order
order
which exit activities are defined will be reached. After that those exit activities will be executed. The TREX-model supports two abortion exception handlers to accomplish this: • Backward exit. This means executing compensating activities (such as ‘remove CPU’ in Figure 3) until exit activities are encountered. Then these exit activities are executed. Note that simply compensating back to the start of the workflow is a special case of this, in which no exit activities are encountered and none are needed, because compensating the first activity will restore the state that existed before the workflow started. An example of this is shown in Figure 3, where executing ‘Remove CPU’ will restore the original state. Figure 4 shows what the workflow looks like after the backward exit handler is used to handle the ‘abort’ that comes in after ‘insert CPU’. order
order
order Insert CPU
Insert memory
PC with CPU
compensating activity
order
Insert HD
PC w ith CPU and memory
Insert CPU
finished PC
Remove CPU
order
exit activity
PC with CPU order
order Remove CPU
Store PC
Figure 4: The ‘Assemble PC’ workflow after a backward exit Figure 3: The ‘Assemble PC’ workflow with exit activities and compensating activities In Figure 3 the structure of the ‘Assemble PC’ activity, including input and output places, can be seen. The tokens ‘PC with CPU’, ‘PC with CPU and memory’ and ‘finished PC’ represent an object to which value is added in each activity. Suppose that the workflow is aborted after the CPU has been inserted. The state that existed before the ‘assemble PC’ activity started must be recreated, i.e. the ‘PC with CPU’ token must be gotten rid of and the order token must be preserved. The TREX-model uses so-called ‘exit activities’ to allow a fine granularity for the abortion of workflows. Exit activities can be defined for an activity anywhere in the workflow. These exit activities recreate the starting tokens and can also be used to specify a way to save the work that has been done so far. For example, it is possible to define an exit activity that stores the ‘PC with CPU and memory’ (the activity ‘store PC’ in Figure 3). Furthermore the TREX-model uses compensating activities, which ‘undo’ an activity. Compensating activities can be defined for any activity in the workflow. For example, in Figure 3 ‘Remove CPU’ is the compensating activity for ‘Insert CPU’. When an ‘abort’ token is received, the workflow must be changed in such a way that a point at
•
Forward exit. This means continuing normal execution until exit activities are encountered. Then these exit activities are executed. In the example this would mean executing ‘insert memory’ and then executing ‘store PC’ and ‘empty activity’ in parallel. Figure 5 shows what the workflow would look like if the forward exit handler were used to handle the ‘abort’ that comes in after ‘insert CPU’ has finished. order Empty activity
order Insert CPU
Insert memory
order Store PC
PC with CPU
PC with CPU and m emory
Figure 5: The ‘Assemble PC’ workflow after a forward exit
Note that the TREX-model does not require the modeler to define exit activities or compensating activities. If none are defined then that workflow can simply not be aborted. If for example exit activities are only defined for the last activity, then only ‘forward exit’ is allowed as a handler for the ‘abort’ exception in all activities. Exit activities should be defined at points where the intermediate result can either be easily removed or easily stored for later use. If all customers want the same CPU, but a different memory, it would be appropriate to define an exit activity for ‘insert CPU’, and to use forward exit for ‘insert CPU’ and backward exit for ‘insert memory’ and ‘insert HD’. 5.3 Abortion exception handlers and parallel activities In Figure 6 the use case is displayed with compensating activities and exit activities. All activities apart from Ship PC use the backward exit handler. com pensat ing ac tivity Ca ncel bill ord er
order
Enter order
Remove order
order
Figure 7: The ‘PC order process’ workflow after a backward exit from one perspective The ‘assemble PC’ activity has never been executed and the ‘remove order’ activity is due to be executed next. The mobile agent does not know what the other branch is doing, only that it will result in an order input token for ‘remove order’. From the perspective of the agent in which ‘register payment’ executes the workflow will look like Figure 8. order
order Cancel bill
Enter order
Remove order
order
unknown
Regis ter payme nt
En te r order
order
Bill customer
compe nsati ng acti vity Ref und
Bill cu stomer
order unknown
order
or der
order
Sh ip PC
Figure 8: The ‘PC order process’ workflow after a backward exit from the other perspective
Asse mble PC order ass embled PC Remo ve orde r
comp ensati ng ac ti vity
Figure 6: The ‘PC order process’ workflow with compensating activities From Figure 6 we see that the abortion of the workflow would mean the abortion of two parallel branches. As mentioned in section 2.3 these parallel branches are executed in two separate agents. In each of these agents, exception handlers will change the workflow independently. We must therefore ensure that these changes are consistent, to avoid for example that the agents do not agree on a join activity. Suppose that an abort comes in while ‘assemble PC’ is executing and before ‘register payment’ has executed. The ‘assemble PC’ activity will first abort its children. This will happen as discussed in section 5.2. From the perspective of the mobile agent in which ‘assemble PC’ executes the workflow definition will then look as seen in Figure 7.
This mobile agent also does not know what the other branch is doing, only that it will result in an order input token for ‘remove order’. Consistency is enforced by: 1) Using the same abortion handler in all the activities in parallel branches, in this case backward exit 2) Allowing exception handlers to only make changes to their own branch2. 3) Making sure that all exception handlers agree on one join that will join the branches, in this case the ‘remove order’ activity. 4) Making sure that all exception handlers agree on any activities that might succede that join. In this case there are no activities that succede ‘remove order’, because there were none that preceded ‘enter order’. 6. Related work One major difference between the model proposed in this paper and most of the other models in literature is that TREX has autonomy of components (mobile agents) as a requirement. This has resulted in a bottom-up model, in which the choice of the exception handlers for the activities determines the overall behavior, as opposed to a 2
If exit activities are defined, this can be relaxed to include the possibility that all concurrent branches have their own exit activity and will reach it in case of an abort.
global transaction manager that determines behavior. The Saga-model [6] uses compensating activities to abort workflows. This is a special case of backward exit. The DOM model [7] provides the modeler with compensating activities and contingency activities. However, it does not provide support for nesting or for exit activities. The ConTract model [8] also allows the modeler to specify contingency and compensating activities, but also lacks support for nesting. It does however provide support for the monitoring of Boolean conditions. The model proposed by F. Casati et al. in [9] allows the modeler to choose from a number of handlers to handle an exception in an activity, however the exception handlers cannot change the workflow, which limits their power. The Flexible Transactions proposed in [10] are functionally equivalent to only using the contingency activity, backward exit and retry handlers in the TREX model. In the WAMO model [11], [12] a workflow activity description language is used (WADL) to describe workflow processes. The model is based on the concept of transactional workflows and requires the support of an advanced transaction management facility, which TREX does not need. In TREX the underlying workflow specification paradigm is the Petri-net formalism. In WADL certain constructs (contingency activities for example) can be expressed that cannot be expressed in Petri-nets, but these be expressed by specifying TREX exception handlers. 7. Conclusions and further work TREX lets the workflow modeler choose from a number of exception handlers to handle the exceptions that may occur during the execution of an activity. This provides the modeler with more flexibility than other workflow transaction models. Furthermore the abortion exception handlers and exit activities provide the modeler with flexible ways to abort nested workflows while maintaining consistency, something which other approaches provide little or no support for. Lastly the flexibility offered by the decoupling of exception and exception handler makes it easy to introduce new exceptions and exception handlers. Testing the TREX-model on three different business cases has validated it. At the moment work is being done on an implementation of TREX. Work also remains to be
done on the formalization of consistency requirements that guarantee the proper abortion of workflows, in particular those that contain parallel activities. Furthermore a ‘workflow debugging’ tool will be needed that allows modelers to test workflows to see if they display the desired behavior in the case of exceptions.
References [1] L. Fischer, Excellence in Practice: Innovation and Excellence in Workflow and Imaging, Future Strategies, March 1997 [2] Ed. Sushil Jajodia and Larry Kerschberg, Advanced Transaction Models and Architectures, Kluwer Academic, 1997 [3] J. Gray and A. Reuter, Transaction Processing, Morgan Kaufmann, 1993 [4] K. M. van Hee, Information Systems Engineering, Cambridge University Press, 1994 [5] A. Elmagarmid, editor. Database Transaction Models for Advanced Applications Morgan-Kaufmann, 1992. [6] P. Chrysanthis and K. Ramamritham, ACTA: The Saga Continues, in [3] [7] A. Buchmann, A Transaction Model for Active Distributed Object Systems, in [3] [8] H Wächter and A Reuter, The ConTract Model, in [3]. [9] F. Casati, S. Ceri, B. Pernici, G. Pozzi, Workflow Evolution, available at http://xerox.elet.polimi.it/pub/papers/WF/ [10] G. Alonso, D. Agrawal, A. El Abbadi, M Kamath, R. Günthör, and C. Mohan, Advanced Transaction Models in Workflow Contexts, In 12th International Conference on Data Engineering, New Orleans, Louisiana, Feb 1996. [11] J. Eder, W. Liebhart, "The Workflow Activity Model WAMO", Proc. of the 3rd Int. Conf. on Cooperative Information Systems (CoopIS95), Vienna, Austria, 1995 [12] J. Eder, W. Liebhart, "Workflow Recovery", First IFCIS Int. Conf. on Cooperative Information Systems (CoopIS96), IEEE Comp. Soc. Press, Brussels, Belgium, 1996, pp. 124 - 134.