Automated Support for Recovery - CiteSeerX

5 downloads 0 Views 66KB Size Report
Automated Support for Recovery. Steven Reiss. Department of Computer Science. Brown University. Providence, RI 02912, USA spr@cs.brown.edu. Guy Eddon.
Automated Support for Recovery Steven Reiss Department of Computer Science Brown University Providence, RI 02912, USA [email protected] Abstract Transactions have traditionally been applied to database systems in order to guarantee data consistency in the face of failures. We propose to expand the role of transactions into a general model for application recovery. To achieve this goal, we use dynamic metaprogramming in order to inject the transactional recovery code at runtime, thus ensuring the system’s portability through the use of a standard execution environment. Since this recovery method does not require programmer intervention, but, rather, transforms the code automatically, we believe it can potentially simplify the design and implementation of self-healing autonomic systems and reduce the potential for failure in large-scale distributed applications, thus realizing a central tenet of autonomic computing.

1. Introduction It is a given that modern applications consist of large and complex systems of interacting components. Unfortunately, today, it is also a given that these types of systems are prone to frequent failures. The reason for their fragility is that such applications introduce new modes and points of failure. Programmers need to be aware that a call, which may appear to be a standard (local) procedure call, is actually a remote call, subject to the vagaries of network traffic, other machines’ failing, and other people’s software crashing (to name just a few potential failure points). In this manner, applications that were once straightforward and relatively understandable become distributed. Moreover, these problems only get worse as web services become more common and include calls to other web services, such that the reified application is effectively spread out across many machines in multiple locations; these are machines of which the programmer has no knowledge and over which he has no control.

Guy Eddon Department of Computer Science Brown University Providence, RI 02912, USA [email protected]

2. Motivation One way to address the fragility of complex systems of interacting components is by applying a language-based transactional model for failure control. Transactions are widely used in database systems to guarantee the atomicity of a set of actions [3]. When applied to a programming language, transactions make the program act as though either all desired actions were completed, in the case of a successful execution, or none were ever attempted, in the case of failure. It is our belief that this approach can safeguard programs from unreliable external components and lead to better self-correcting systems—so that, even when things do go wrong, problems can be corrected automatically and before users notice anything amiss, which is a central tenet of autonomic computing [1].

3. Automatic transactions We use static analysis to enable the programmer to define transactions at a high level of abstraction that are then implemented by the system automatically. In the model we propose programmers identify the transactional requirements of a given method. The system then analyzes the code to determine those objects that need to be preserved if the transaction fails, the locks needed to ensure that different transactions don’t interact, as well as any file operations that must be performed on temporary files. To construct this automated recovery system, we use metaprogramming as a method for adding transaction rollback capabilities to existing systems in an unobtrusive manner. This approach fundamentally involves two passes over the input code: the first pass statically analyzes the code for read and write operations in order to determine the extent and type of rewriting necessary; the second actually transforms and produces the output code containing the additional calls to acquire and release the necessary locks inserted before and after such access. These passes can

Proceedings of the International Conference on Autonomic Computing (ICAC’04) 0-7695-2114-2/04 $20.00 © 2004 IEEE

be performed as a post-compilation step (static metaprogramming) or just prior to execution (dynamic metaprogramming). Since transparency is a fundamental consideration in automatically applying transactions to improve recoverability, both the beginning of a transaction and the point at which it will end must be automatically determined by the system with little, if any, input from the programmer. A function in particular, as one of the most basic units of computational abstraction, makes an appropriate transactional primitive. Because in conventional programming languages, such as Java, functions have the ability to mutate memory on the heap, it is possible to think of them as self-contained sub-programs that end in success or failure independently of the larger application. A transactional function, then, is characterized by a transaction that begins prior to the execution of any code in the function and ends immediately after the execution of the function is complete. The goal of a transactional function, therefore, is to control the code’s ability to modify external data unless and until the transaction is committed. A function that throws an exception has failed; one that does not has succeeded. There are four main algorithms we develop in order to support automated recovery through transactions. These are: get (retrieves a value from a transaction); put (assigns a value within a transaction); commit (commits the changes made during a transaction); and abort (discards the changes made in the current transaction).

4. Example To provide a better understanding of the code rewriting done by the system at the metaprogramming stage, Figure 1 shows the automatic changes (highlighted in boldface) made to a sample method. public class student { String m_name; public boolean changeName(String newName) throws InvalidDataException, LockFailException { recovery.transaction tx = new recovery.transaction(); tx.put(this, "m_name", newName); if(helper.hasDigits((String)tx.get( this, "m_name"))) { tx.abort(); throw new InvalidDataException(); } tx.commit(); return true; } }

The rewritten version of this method starts a new transaction before any other code in the function is executed. Any attempts to retrieve or change the value of fields are replaced by calls to the transaction object’s get and put methods, respectively. The helper object’s hasDigits method joins the transaction begun by the changeName method (the current transaction is retrieved by a static method from a thread local variable). The abort method, invoked automatically in response to a thrown exception, simply discards the changes made in the transaction, while the commit method implements any changes to the fields that were recorded during the transaction. When the instrumented changeName method is called, the function and its descendents are guaranteed not to cause side-effects unless and until the method completes successfully.

5. Discussion and future work A number of areas remain to be explored in the development of a robust model for automated recovery. First, it is not yet clear if functions are the ideal unit of transactability. We intend to explore the use of exception handling blocks and threads as potentially larger and perhaps more flexible constructs.

6. Related work The work in this paper builds on research in three areas: checkpointing and recovery; metaprogramming as a strategy for automatic code rewriting; and transactional shared memory, a programmer controlled system for transactional memory [2].

7. References [1] Ganek, A., and Corbi, T. The dawning of the autonomic computing era. In IBM Systems Journal (2002), vol. 42 no. 1, pp. 5-18. [2] Herlihy, M., Luchangco, V., Moir, M., and Scherer, W. Software transactional memory for dynamic-sized data structures. In Proceedings of the 22nd Annual ACM Symposium on Principles of Distributed Computing (July 2003), pp. 92-101. [3] Traiger, I. L., Gray, J., Galtieri, C. A., and Lindsay, B. G. Transactions and consistency in distributed database systems. In ACM Transactions on Database Systems (Sept. 1982), vol. 7 no. 3.

Figure 1. Code rewritten for automated recovery

Proceedings of the International Conference on Autonomic Computing (ICAC’04) 0-7695-2114-2/04 $20.00 © 2004 IEEE