Issues and Challenges of an Inductive Learning Algorithm for Self

0 downloads 0 Views 324KB Size Report
worry about the mundane task of system management in ... technique for self-healing and challenges associated with ... toward computing as a whole. 3.
2010 Seventh International Conference on Information Technology

Issues and Challenges of an Inductive Learning Algorithm for Self-healing Applications Mohammad Muztaba Fuad* Department of Computer Science Winston-Salem State University Winston-Salem, NC 27110, USA Phone: (336) 750-3325, Fax: (336) 750-2499 E-mail: [email protected] Abstract*

way, producing ever more sophisticated software applications and environments, which results in enormous growth in the number and variety of systems and components. As systems become a complex mesh of technologies, software architects are less able to anticipate and design interactions among components, which further results in complexity not only in those systems, but also in the environments they operate within. Although there have been attempts to reduce such complexities by introducing better software engineering practices, the complexity remains as more and more new technologies and systems are being incorporated together. Such an environment is a complex, heterogeneous tangle of hardware, middleware and software from multiple vendors that is becoming increasingly difficult to program, integrate, install, configure, tune, and maintain. This leads to the idea of autonomic computing [1, 2] where the complexity and the management of such systems is handled by the system itself. Most times, the end users of today’s big and complex software are left with the task of managing the system when the computational task is faltering due to failures that they can not fix because of lack of computing knowledge. This paper addresses the shortfalls of manual management of large systems for technologically nonsavvy users and address how incorporating autonomic features into the system might lessen the impact of software failures by making the software self-adaptive to failures. Although it may be desirable to build such selfmanageable systems from scratch, it is not always a feasible option, mostly because of the cost and time associated with such a major development, but also because it is not practical to abandon an existing application and re-program it from scratch to be selfmanageable. For average programmers, this becomes a daunting task when they also have to incorporate autonomic primitives into the system. In real life, programmers want to concentrate on the problem in hand,

Day-to-day maintenance of software systems is a grand challenge due to the fact that the runtime environment changes continuously and the application can behave completely differently because of that. Users of such systems want to run their application and do not want to worry about the mundane task of system management in the face of a failure. If such management scenarios come into existence, the user wants the runtime environment to handle those situations autonomically. The user is more concerned with timely execution of their computation and production of intended results. Without expert technical help, average users have extreme difficulty managing such failure scenarios. This paper investigates the usability of explanation based learning algorithm with inductive rules to provide adaptive management of user application in the face of faults. A distributed algorithm is proposed that collects runtime program traces and signatures and combines all distributed copies to derive the domain knowledge for the learning algorithm. Key Words: Autonomic computing, Explanation-based learning, Policy management, Self-adaptive application.

1. Introduction Today’s information technology landscape is bristling with innovations and changes. New technologies are rapidly emerging and new versions of existing technologies continue to be released. The race to be at the cutting edge of technology makes complexity a major issue in all aspects of information technology as new technologies are being incorporated into existing systems and overall behavior of the system becomes unpredictable. Software developers have exploited the rapid upsurge in computational power in every possible *

Is supported by WSSU RIP grant no 121225/2009.

978-0-7695-3984-3/10 $26.00 © 2010 IEEE DOI 10.1109/ITNG.2010.64

264

rather than spend time on incorporating autonomic behaviors in their system. It is helpful to programmers if such autonomic behaviors can be added automatically and transparently to existing systems. This paper looks in to the self-healing property of self-management to provide users an uninterruptible execution of their program in the face of system failures. We look into the applicability of inductive learning technique for self-healing and challenges associated with such deployment.

self-management services. We believe that, explanation based learning is a good fit for the proposed problem domain (user application, running instances on distributed machines) and will provide robust support for the self-management properties. We assume the users of the target domain to be people from non-computing background. They are well suited in expressing domain knowledge and the goal for the system without much knowledge of logical predicates or calculus. The learning process is also easy to understand and will provide users with more advanced computing knowledge, greater control over the algorithm; keeping the simplicity for naïve users. Re-enforcement based learning algorithm might provide higher accuracy in some domains; however programming and expressing such learning algorithm becomes daunting for average computer users. Explanation based learning algorithm should have better performance than the static or pattern based algorithms mainly because it will be able to address unseen faults much accurately than the other two techniques. After a few times of interactions with a particular software application, users naturally express a great deal of their goals, preferences, and personality. Also, patterns of usage and failures scenarios can be learned over time by monitoring such application. Explanation-based learning can be used as a mechanism to recover this ignored information and incorporate the algorithm with existing system without end-users doing any hard-core programming (by means of automated code transformation and injection). In adapting to the user needs (easy to use and maintain), the transformed application should increase the effectiveness of the software application and improve the user’s attitude toward computing as a whole.

2. Overall Purpose When a runtime software system requires management due to any kind of failure or user action, an effective fix (management action) is needed to be identified and applied quickly. Failure can occur due to system faults or due to performance bottlenecks. In most cases, the root cause of such failure is human ignorance or mismanagement. It is therefore compelling to have systems that self-manage itself and reduce the day-to-day involvement of humans in the operation of the system. This research examines explanation based machine learning as an alternative to static, re-enforcement based, or pattern-based approach [3] to self-management. There is a large potential of explanation based learning algorithm in self-management, mostly because it is easy to use and comprehend for people outside of computing or IT. There is a large potential of explanation based learning algorithm in self-management, mostly because it is easy to use and comprehend for people outside of computing or IT. Since our previous research work on code transformation techniques [4, 5] have resulted in valuable insight into the process of incorporating autonomic primitives into exiting user code, the development of a learning algorithm will open up doors for supporting full fledged self-management in the future. The goal of explanation-based learning is to acquire an efficient concept definition from a single example by using existing background knowledge to explain the example and thereby focus on its important features [10]. Explanation based learning is based on the hypothesis that an intelligent system can learn a general concept after observing only a small number of examples. By understanding why an example is a member of a concept, the algorithm can learn the essential properties of the concept. It uses prior knowledge to analyze and explain each training example in order to infer/deduce what properties are relevant to the target function and which are irrelevant. Although explanation based learning has been used extensively in other problem domains, there is currently no research work (that we know of) investigating the application of explanation based learning in providing

3. Significance of this Study The motivation of this research comes from the fact that it is a non-trivial task to automate the process of making a regular user application into a self-managing application. To achieve this goal, proper code transformation techniques (to add required functionalities) and appropriate machine learning algorithm (to have the autonomicity) is needed. Our previous investigation [5] in this research area showed that it is viable to do code transformations to inject autonomic properties into existing applications. It also showed that relieving the application programmer from the complex programming interfaces, the application is more eligible to users who do not have advanced knowledge in programming paradigms. In the last couple of years, software fault healing (specially, adaptive or self-*) was forced back to the spotlight because of current software’s inherent 265

Binkley, et al. [8] presented a work where the authors used text similarity measures to predict fault from existing application log data. Although their approach is innovative, we can employ that in our problem domain because of lack of fault data and logs. The Unity system [9] provides a platform designed to help autonomic elements interact with each other and their environment. It uses goal-driven self-assembly to configure itself. However, the utility function it uses for self assembly assumes that one can quantify the utility of different choices. The Unity system does not address the question of how complex it is for application programmers to use this prototype. There is no discussion of programming in such an autonomic system. There is a plethora of machine learning algorithms implemented for numerous application areas. However, there is a handful [10, 11, 12] such machine learning technique implemented for self-management.

complexity and requiring more expert human intervention and man-hour to manage and maintain those software. There is a plethora of research work addressing this issue and some of the research work related to the work in this paper is presented here. The proposed technique in this paper differs from these related works and from traditional fault tolerance techniques by relieving the application programmer from the burden of complex programming interfaces and metaphors. Another major difference is, we try to establish inherent relationships among faults in a system to figure out the root cause of a fault along with looking in the similarity/dissimilarity between faults as others do. We believe that, building up these relationships between faults and root causes will allow us to logically deduce fault scenarios faster and allow system administrators to provide input in an inductive statement (which is more close to human language) that will make their job bit easier. Ding, et al. [6] proposes a black-box approach of software development that automatically diagnoses several classes of application faults using the application’s runtime behaviors. Their approach collects application’s runtime signature as we do, but instead of concatenating previous execution traces to form signatures, we try to form relationships between traces and formulate signature. Also, instead of manual invocation of diagnosis process, we incorporate that as part of the system, so that in an unstable state, the added codes try to diagnose and solve the problem itself without any human intervention. Yuan, et al. [7] try to find correlation of an unstable application state with a list of solved cases. Instead of using vague text descriptions to identify problem situations, the authors employ statistical techniques to match an unknown fault to a set of known fault situations. An obvious difference with our approach is that we trace the target application in a regular execution to establish known execution paths and signatures and once an unknown case is identified the injected code handles all related healing procedures. In this work, a already establish list of 100 top faults and their root causes are already given (specific to a certain problem domain) and an unknown cases is matched with this list of cases and the one which is close to the unknown fault is provided to the user so that the user can take control and solve that particular situation. Cook, et al. [3] presents a similar approach that directly matches an unknown fault to a list of known fault. We think both of these work are good foundation for further extension and we utilizes both these ideas in our approach and extended by employing logical deduction of new fault situation by creating relationship between faults and signatures and also by automating the process of healing the software once the fault is diagnosed.

4. Issues and Challenges A self healing application should be able to recover from potential faults and should continue to work smoothly in the presence of faults. In the past, selfhealing applications were rare and mostly confined to domains such as space craft control software, where taking a system down to correct faults was not an option. However, more and more of today’s ever complex and distributed software systems have the same requirements. A system administrator could certainly fix most faults manually by analyzing logs and error reports, but that requires the system administrator to spend a large amount of time solving each individual fault. Furthermore, once the fault is identified and manually solved by the system administrator, the task has to be restarted from the beginning (or manually saved checkpoints), resulting in loss of useful computation and valuable time. However, with well defined policies and pre-defined goals, the software should heal itself in such situations and continue running transparently without any loss of valuable computation. This will not only save time and money in long term, but will also make the whole system more productive and responsive to environmental changes. Issues that needed to be addressed are: ƒ Learning algorithm: Although explanation based learning is well defined and there are couple of algorithms for other domains been developed, for self-healing and related fault-to-inductive rule framework and corresponding deduction mechanism have to be developed. Another challenge is domain knowledge representation and ways to extend the domain knowledge itself over time? How we determine the optimal size of the domain knowledge for a certain application area? Shortcomings of the explanation based learning also have to be addressed. 266

ƒ

ƒ

One possible way to weight rules by the success rate of applying a rule to fix the system. This will minimize the affect of incomplete theory of explanation based learning. As more usage data are collected from the distributed machines, correlation can be used to build complete rules from several incomplete rules. We have to be careful about redundant rules as it will slow down the algorithm and therefore care has to be taken when a new rule is added to the domain knowledge. Policy management: Although there are many techniques available for fast recovery, there is a lack of proper policies and policy management mechanisms to invoke these techniques automatically and correctly after a failure. Without such policy management mechanisms and automated ways to learn and derive policies, humans always remain in the failure-recovery loop. This limits recovery to slower human timescale rather than machine timescale. To envision such a self-adaptive system, we need to develop a robust learning algorithm (to learn actions for previously unseen faults) along with code transformation techniques (to work with exiting code) to incorporate such autonomic properties transparently into existing user application. An important aspect of any selfmanaging system is for the user to specify the behavior of the system at a high level with broadly scoped directives. The benefit of policy-based (goalbased) management is that the behavior of the computing resources can be guided to follow certain rules, and dynamically configured so that the system can achieve specific goals and can react more promptly to environment changes. Identifying robust policies that work in practice and developing an explanation based learning algorithm that will help learn those policies automatically will be of great challenge. Failure Identification: Not all faults can be healed automatically or even recovered from. This research is concerned with transient faults (network outage, memory overload, disk space outage etc.) that occur after the program is deployed. Such faults could result from problems in the user code, in the underlying physical system or network connection or in the run-time environment. Non-transient faults, caused by bugs in the user code (logical errors), user generated custom exceptions or faults generated due to the functional aspect of the program are outside the control of this approach and should be addressed by the system administrator or the developer of the user program. Once the transformed code self-heal such transient faults and resumes execution, the condition that caused the fault will be healed and the fault will not reappear. However, non-transient faults

ƒ

are generally caused by some bug in the application code or due to unhandled exceptions. Although some types of non-transient faults can be self-healed, this may change the original semantic of the program or the fault will occur again as the condition that is creating the fault actually is in the user code. Intractability: Although we might deduce the solution from the available rules in the domain knowledge, how much time should we allow the algorithm to spend on that. There are different tradeoff related to the answer of this question and an adaptive policy to address such situation has to be devised.

5. Learning Algorithm

(a)

(b) Figure 1. System View and Knowledge Transformation. As shown in Figure 1 (a), the target application runs in multiple machines and collects runtime parameters (signature) and execution pathways (traces) and stores it in an object called ST. STs are created for every single run of the application and in regular intervals are shared among the machines in the system. Each machine merges all gathered STs and generates Distributed ST or DST. Initially, the system treats DSTs as the sole source for domain knowledge. However, with enough run of the application and generation of enough DSTs, generalizations are made and a global domain knowledge (DK) database is generated from which inductive rules 267

1.

2.

Collect local application’s signature trace (ST). A. In regular intervals, share STi with other machines in the domain running the same software. B. Merge STi..n to form Distributed Signature Trace (DST) C. If a new DK is received, then update local DST with more generalized scenarios from DK. If a failure, f is detected: A. Save the current system state and treat the system as unstable. B. Run the inductive engine to analyze the detected fault to transform the fault to satisfy the operational criterion of the algorithm. C. The inductive engine will deduce the failure signature trace (STc) and match it with all possible DST and generate a list of fixes f1….fn with existing success rate of those fixes. D. The list of possible fixes will be applied one (fx) at a time to the application. i. If fx results in a stable state, increase the success rate of that ST and continue to step H. ii. If fx results in an unstable state and part of the DST exactly matches the STc, then apply the corresponding fix for that part of DST and continue as step D.i. The learning algorithm has to be aware of the depth of this kind of recursive try and should have a threshold value. iii. If fx results in an unstable state and there is a DST, which partially matches the STc, then mark that as a candidate ST for future processing. E. For all candidate ST, find the subset (size n) with the highest success rate. F. Calculate the distance of STc with the members of the subset and find the ST with the lowest distance. Different distance formulas can be used to calculate this distance. G. Apply the newly found fix (in the closest ST) and i. if it results in a stable state, continue to step H, otherwise, ii. if not exceeded time limit (threshold) then refresh DST with others in the system and continue to step B iii. otherwise continue to step I. H. Resume the application. I. If all of the above fails then save the current status of the application, let the administrator know about the fault and what the learning algorithm tried to fix that fault and then restart the application. Figure 2. Overall Algorithm to Find Fixes to a Fault. an algorithm to implement those rules to generate logical consequence. The set of inference rules is well defined and we plan to use available inference algorithms to deduce rule automatically. However, generalizing rules and forming DK from DSTs will be a challenge and a pattern matching algorithm (like [13]) can be used to generalize rules.

(IR) can be deduced for different scenarios. The overall knowledge transformation is shown in Figure 1 (b). It is to be noted that, once DK is generated, each machines need to update its DST with the more generalized version of any ST combinations from DK. In traditional self-management architecture, domain experts specify rules that maps symptoms to fixes in an if-then-else format. Previously defined static rules might work well for simple systems where all possible failures are known in advance and a universal fix can quickly solve most failures. However to overcome problems with such static approach, the proposed explanation based learning algorithm employ feedback-driven loop to find the best solution scenario (s) for previously unseen failure. The overall algorithm to find fixes is shown in Figure 2. Step 1 and 2 are concurrent processes for better performance of the algorithm. As noted earlier, after couple of runs, DST is replaced by DK and therefore the algorithm only mentions about DST. The semantics of the predicate calculus (proposed to be used in this research) provides a basis for formal theory of logical inference. However, deducing logical consequence by interpretation is very difficult. Therefore, we can use a combination of standard inference rules and

6. Current Status and Conclusions This is an ongoing research and a pilot environment has been set up to collect execution traces and signatures. Java enterprise application is selected as the problem domain and Java Platform Debugger Architecture [14] is being used to gather runtime application status to generate the program signatures and traces. Since Java only allows a limited number of environment and runtime parameters to collect (which we need to form ST), we are investigating ways to gather every possible runtime state and work with a limited feature set for the rule deduction. We are investigating the way to inject different classes of faults in the runtime environment to observe the signature trace formation and merging process across the network. Separately, a deduction and generalization algorithm for 268

40, 4 (Oct. 2006), 375-388. 8. Binkleya D., Feildb H., Lawriea D. and Pighinc M., “Increasing diversity: Natural language measures for software fault prediction”, Journal of Systems and Software, Vol. 8, Issue 11, 2009, Pages 1793-1803. 9. Chess D. M., Segal A., Whalley I. and White S. R., “Unity: Experiences with a Prototype Autonomic Computing System”, First International Conference on Autonomic Computing, pp. 140-147, 2004. 10. DeJong G. F., editor. “Investigating ExplanationBased Learning”, Kluwer Academic Publishers, Norwell, MA, 1993. 11. Tesauro G., “Reinforcement Learning in Autonomic Computing: A Manifesto and Case Studies”, IEEE Internet Computing, Vol. 11, No. 1, pp. 22-30, 2007. 12. Kephart J. and Das R., “Achieving SelfManagement via Utility Functions”, IEEE Internet Computing, Vol. 11, No. 1, pp. 40-48, 2007. 13. Angaye, C. O. and Fisher, P. S. 2007. Application of FI sequences for authentication in a network. In Proceedings of the 45th Annual Southeast Regional Conference (Winston-Salem, North Carolina, March 23 - 24, 2007). ACM-SE 45. ACM, New York, NY, 362-366. 14. Java Platform Debugger Architecture, JPDA, http://java.sun.com/javase/technologies/core/toolsapi s/jpda/. 15. Brown A. et al., “Benchmarking autonomic capabilities: Promises and pitfalls”, Proceedings of the 1st International Conference on Autonomic Computing, IEEE NSF, May 2004. 16. Griffith R., “The 7U Evaluation Method: Evaluating Software Systems via Runtime Fault-Injection and Reliability, Availability and Serviceability (RAS) Metrics and Models”, Ph. D. Thesis, Columbia University, Aug 2008.

rules is under implementation. Currently we are on the way to develop a directed acyclic graph based merging algorithm to form DSTs from STs. To evaluate the effectiveness of the transformed selfmanaged system, a robust and quantitative benchmarking methodology is needed. However, developing such a benchmark methodology is a non-trivial task [15] given the many evaluation issues and environment criteria to be resolved. If there were such an benchmarking methodologies, then the effectiveness will be quantitatively measured by comparing the performance of the proposed technique against benchmarks associated with the current modes of operation (without the learning algorithm). However, since that is not feasible with the current state of benchmarking technologies for selfmanaged systems; we can deploy techniques, such as [16], to measure the effectiveness of the proposed algorithm. Eventually, the implementation of the proposed algorithm will shed light into the applicability of inductive learning technique in the field of selfadaptive system.

7. References 1. Horn P., “Autonomic Computing: IBM's Perspective on the State of Information Technology”, IBM Corporation, October 15, 2001. 2. Kephart J. O. and Chess D.M., “The Vision of Autonomic Computing”, IEEE Computer, Vol. 36, No. 1, pp.41–52, 2003. 3. Cook, B. and Babu, S. and Candea, G. and Duan, S. Toward Self-Healing Multitier Services. Technical Report of the Duke University, 2005. 4. Fuad, M. M., “Code Transformation Techniques and Management Architecture for Self-manageable Distributed Applications”, The Twentieth International Conference on Software Engineering and Knowledge Engineering (SEKE '08), San Francisco, USA, July 2008, Pages 315-321. 5. Fuad, M. M. and Oudshoorn, M. J., “Transformation of Existing Programs into Autonomic and Selfhealing Entities”, 14th IEEE International Conference on the Engineering of Computer Based Systems (IEEE/ECBS), Arizona, USA, March 26 29, 2007, Pages 133-144. 6. Ding, X., Huang, H., Ruan, Y., Shaikh, A., and Zhang, X. 2008. Automatic software fault diagnosis by exploiting application signatures. In Proceedings of the 22nd Conference on Large installation System Administration Conference (San Diego, California, November 09 - 14, 2008). USENIX Association, Berkeley, CA, 23-39. 7. Yuan, C., Lao, N., Wen, J., Li, J., Zhang, Z., Wang, Y., and Ma, W. 2006. Automated known problem diagnosis with event traces. SIGOPS Oper. Syst. Rev. 269

Suggest Documents