Performance Problem Localization in Self-Healing ... - CiteSeerX

2 downloads 0 Views 250KB Size Report
We are also grateful to David Power and Mark Slaymaker for their ..... [5] J. Brady, D. Gavaghan, A. Simpson, M. Mulet-Parada, and R. Highnam. eDiaMoND: A ...
Performance Problem Localization in Self-Healing, Service-Oriented Systems using Bayesian Networks∗ Rui Zhang, Steve Moyle and Steve McKeever Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford OX1 3QD, England

Alan Bivens IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, N.Y. 10532, USA

[email protected]

{ruiz,sam,swm}@comlab.ox.ac.uk

ABSTRACT In distributed, service-oriented environments, performance problem localization is required to provide self-healing capabilities and deliver the desired quality of service (QoS). This paper presents an automated approach to identifying system elements causing performance problems. Applying probabilistic inference to collected response time and elapsed time data, the approach 1) infers elapsed time for services where data is missing, 2) estimates the response time degradation caused by different services using the duration, abnormality and response time correlation of their elapsed times, and 3) identifies the services that are the most important causes of slow response time and yield the most benefit if recovered. The approach has been used to localize a performance problem on the test bed of a real-world serviceoriented Grid. Evaluation using simulations shows that the approach consistently achieves better accuracy than traditional techniques in various service-oriented settings.

Categories and Subject Descriptors C.4 [Performance of Systems]: Reliability, availability, and serviceability

Keywords End-to-end response time, Problem localization, Serviceoriented computing, Bayesian networks, Missing data

1.

INTRODUCTION

Systems following the Service-Oriented Architecture, such as those based on Web Services and more recently on the ∗Rui Zhang and Steve Moyle are funded by the UK Department of Trade and Industry project: Heterogeneous Workload Management and Grid Integration (Grant THBB/C/008/00025). We are also grateful to David Power and Mark Slaymaker for their assistance in our experiments.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07 March 11-15, 2007, Seoul, Korea Copyright 2007 ACM 1-59593-480-4 /07/0003 ...$5.00.

Open Grid Service Architecture (OGSA) [7], are being widely used in both academia and industry. Meanwhile, performance problems in these environments have grown more difficult for human administrators to diagnose due to their sizes and the dynamic interactions between (possibly geographically distributed) system components. As a crucial step towards realizing the autonomic vision [9] in serviceoriented systems, automated performance problem localization is required to identify fundamental system components (i.e. services), where corrective measures can be taken to restore the desired quality of service (QoS). This paper focuses on the abnormally slow services as causes to response time problems. These services may benefit from relatively obvious, lightweight solutions (e.g. replacing faulty hardware, restoring mistakenly dropped database tables [3] or even by restarting applications [14]), which restore their conditions to an earlier state when the services had more consistent and normal performance. Localization of normally slow services (in contrast to the abnormally slow services) which may be assisted by autonomic resource actions is the topic of future research. Focusing on the abnormally slow services is valuable because services may not be equipped with active resource management capabilities like CPU allocations and provisioning or they could find these actions prohibitive (e.g. other applications sharing the same resource could be affected). In addition, performance problems caused by mis-configurations or poorly written software may not be resolved by resource actions alone. Some existing approaches in databases [17], business servers [10] and traditional distributed systems [1, 4, 8, 2, 16] have adopted a similar philosophy to attack performance problems, but two issues prominent in service-oriented systems are overlooked. First, in practice, some services may become Unobservable (i.e. no performance data is available from these services), due to 1) lack of instrumentation on these services (because of security/privacy constraints or insufficient provider support), 2) failure in the act of data reporting itself or 3) the need to reduce overhead. Second, due to the current exercise of replicating components for scalability and efficiency, the problem-causing components may be in many places within the environment. Therefore, performance problem localization must identify the components that are the most important causes of the problem and those that will considerably improve overall system performance if fixed. For instance, if service SA is being invoked in parallel with another service SB that has a significantly

longer elapsed time and thus dictates the overall response time, SB should be identified as the most significant cause even though SA is more abnormally slow than SB . This paper presents an automated approach to performance problem localization that is robust against missing data, accurate as far as end-to-end performance goals are concerned, and lightweight. Building on a Bayesian network model that supports probabilistic inference among service elapsed times to end-to-end response time, this approach first uses the Bayesian network (in contrast to classic models like queueing networks) to derive time elapsed for unobservable services from time measurements on observable ones, and then scores the difference between the current response time and the projected response time had a service performed normally, highlighting the “damage” a service has inflicted on end-to-end response time. This work has made the following technical contributions: • An elapsed time deduction mechanism that optimizes the amount of information available for problem localization in hostile environments lacking performance data from unobservable services. • A damage assessment mechanism that considers elapsed time duration, elapsed time abnormality and the correlation between elapsed time and end-to-end response time altogether, to pinpoint the problem-causing services with the goal of response time improvement. • A localization approach that has proved effective under a wide range of simulated settings and a prototype implementation that has successfully located real-world performance problems in a service-oriented Grid. • Self-healing suggestions for various platforms. The remainder of the paper is structured as follows. Section 2 gives an introduction to the eDiaMoND service-oriented system. Section 3 presents this paper’s automated approach to performance problem localization. This approach is extensively evaluated in Section 4 where it is shown to outperform traditional localization techniques. It is then tested on eDiaMoND in Section 5. Section 6 reviews related work. Section 7 concludes and discusses future research.

2. THE EDIAMOND SERVICE-ORIENTED SYSTEM The eDiaMoND Grid, an OGSA-enabled federated database of annotated mammograms [5], will be used as a reference service-oriented system throughout this paper. Figure 1 shows a common eDiaMoND scenario, where a radiologist retrieves mammograms assigned to him/her for analysis and comparison. Having received a request from the radiologist client, the image list service calls the work list service asking for information about the images assigned to the radiologist. Suppose that the IDs and the locations of two images that need to be compared are returned. Since the two images are stored in a local hospital L and a remote hospital R respectively, the image list service simultaneously issues two requests to the image locator service on both sites. This leads to the invocation of the ogsa dai service (a database service) on both sites to obtain the corresponding images from on-site databases. The retrieved images are returned as part of the service responses to the radiologist for viewing.

Figure 1: The eDiaMoND scenario. As is described in the research conducted by Zhang et. al. [18], all eDiaMoND services and underlying components can be instrumented to trace requests, capturing workflow and measuring the end-to-end response time and time elapsed on each service. These data are the input to the problem localization approach described in the following section. Other eDiaMoND scenarios, such as one where the image locator remote service is dynamically called (as a backup) when the image locator local service fails, can also be handled by the approach described herein. They are, nevertheless, not discussed here due to space limitation.

3.

AUTOMATED APPROACH TO PERFORMANCE PROBLEM LOCALIZATION

This section first introduces the mathematical foundation to the approach. The problem localization algorithm is then presented along with explanations of how it compensates for missing data and considers multiple elapsed time factors to pinpoint the most important causes.

3.1

A Bayesian network linking service elapsed time to end-to-end response time

We use a Bayesian network [12] to model how end-to-end response time is composed by service elapsed times. Assuming the time elapsed on a service is independent of that on others, the response time model is given in Definition 1: DEFINITION 1. A Response Time Bayesian network (RTBN), is a directed graph that contains a discrete service elapsed time variable set X = {X1 , . . . , Xn } and a discrete end-to-end response time variable D. These factors are used to define joint probability distribution p(D, X) as shown in Equation 1: p(D, X) = p(D|X)

n Y

p(Xi )

(1)

i=1

where p(Xi ) is the Prior (Probability) Distribution) for variable Xi , while p(D|X) is the Conditional (Probability) Distribution) describing how X are “causing” D. The prior distributions p(Xi ), i = 1, . . . , n can be 1) trained from historical data, 2) supplied by the service provider, or 3) simply assumed to be uniform if not obtainable through the first

two means. Unlike other BN-based approaches for response time modeling (e.g. [8]), the conditional distribution p(D|X) is not trained from data but defined using a deterministic function directly obtained from the workflow information. X1

X2

……

D

X6

Variable Service X1 image_list X2

work_list

X3

image_locator_local

X4

image_locator_remote

X5

ogsa_dai_local

X6

ogsa_dai_remote

be easily turned back to a normal performing state. • Correlation with response time – Only changes to service elapsed times with strong correlation may have triggered significant increase in response time. Finally, the services are ranked based on their damage scores. The localization procedure is summarized in Table 1. An elapsed time deduction run and a damage assessment run both take O(n) time, with n = |X|. This computation will be repeated n times for up to n services, yielding an algorithm complexity bounded by O(n2 ).

Figure 2: The RTBN for the eDiaMoND scenario. Figure 2 illustrates the RTBN for the eDiaMoND scenario in Section 2. For this example, there are six first layer variables and the deterministic function giving the conditional distribution is: D = X1 + X2 + max(X3 + X5 , X4 + X6 ), for the reason that the image locator services (and then the ogsa dai services) on the two sites are invoked in parallel after the image list and work list services are called.

3.2

Localizing the most important causes

For any period in which performance problems occurred (e.g. users experienced abnormally long response time), the localization of the most important cause can be launched and then completed in three steps with two inputs: Observation (Probability) Distributions po(O) acquired in the problematic period from observable service set O, and prior distributions p(X) estimated from observation histories. Firstly, Elapsed Time Deduction is performed for each unobservable service in order to provide some insights into the states of these services using data obtained. In other words, a Posterior (Probability) Distribution p(U |O = E(O)) (where E(O) gives the mean values of observations) is computed using standard BN-based inference [12] based on Equation 1 for each U where no elapsed time data was available. Secondly, Damage Assessment estimates “what the response time D would have been, had a service Z (observable or unobservable) performed normally,” and scores the difference between this estimation and the actual response time, relative to the latter as is given in Equation 2: P E(D) − p(DZ ) E(D) − E(DZ ) = (2) ds(Z) = E(D) E(D) where E(D) is the mean response time for the given period, p(DZ ) is the Projected (Probability) Distribution for response time had Z performed normally (i.e. consistently with its Baseline (Probability) Distribution pb(Z) which is established by probing the service when it is idle or under normal circumstances before any localization process is launched), E(DZ ) is a corresponding mean, and the resulting damage score ds(Z) is a percentage that reflects by how much the service accounts for the total response time degradation. The damage scoring mechanism enables us to consider three facets of service elapsed time together during problem localization: • Duration – Only a service that accounts for a large portion of the response time is likely to be a major cause of the response time problem. • Abnormality – It may be more simple to focus on significantly abnormal services if operating conditions can

INPUT: Observation distributions po(O) (where available) and prior distributions p(X) OUTPUT: The set of services in the order of their damage scores X∗ and the scores themselves ds(X∗ ) FOR ALL U ∈ X − O BEGIN P Obtain p(U |O = po(O)); END FOR ALL Z ∈ X BEGIN Compute the projected distribution for D using pb(Z), P p(U |O = po(O))and po(V ) as the respective prior distributions of Z, any U ∈ X − O and any V ∈ O in Equation 1; Compute ds(Z) using Equation 2; END Sort X and return X∗ such that ds(X∗ 1 ) ≥ . . . ≥ ds(X∗ n );

Table 1: Problem localization algorithm.

3.3

Problem remediation recommendations

Once the most significant problem-causing service are located, recovery measures can be taken to restore their normal behaviors and eliminate the performance problem. These measures are usually specific to applications, constitutional services and underlying platforms: 1) Unimportant workload suppression, which stops unexpected and disrupting workloads (e.g. see Subsection 5.3). 2) Failed hardware (CPU, memory or disks, etc) replacement, which restores system capacity. 3) Mis-configuration correction, which restores adequate system settings (e.g. reconstructing an accidentally dropped DB optimization table [3]). 4) Software/hardware reboot, which alleviates an accumulated misbehavior (e.g. a memory leak) or some application failure. Software restarting to get back to a normal state has become so common, some middleware now ships with the ability to set health policies to restart applications at regular intervals [14]. Seamless and generic ways of integrating such recovery measures with the localization procedure to form autonomic solutions is a promising direction of future research.

4.

EVALUATION THROUGH SIMULATIONS

In this section, the efficacy of the problem localization approach is evaluated in a comprehensive manner using simulations. The two key constitutional techniques are evaluated in isolation, before the approach is assessed as a whole.

Simulations and evaluation methodology

Experiments in this section were conducted within a serviceoriented system of four services simulated in Matlab[11]. The reasons for simulating only four services are two-fold: it is computationally more affordable to repeat the experiments many times; and a rich enough set of workflows can be formed whilst sufficiently covered by the experimental runs. The simulated services receive and send calls to each other and randomly generate a processing delay upon receiving calls. They are assembled together by different workflows to constitute simulated applications. The simulated delays (and response times) are fed directly into a Matlab implementation of Table 1. Each experiment in this section was run a large number of times. In each run, selected simulated services were slowed down to cause response time problems under a Randomly Generated Condition: 1) The workflow can range from mainly sequential to highly parallel. 2) The number of observable services and abnormally performing services varies. 3) The degree of abnormality can vary from minor to major. 4) The difference between the elapsed time on different services is gradually increased. First, performance problem localization was launched to select the single most significant cause as was discussed in Section 3. Each slowed service was then “recovered” one at a time in the simulation, while the other slowed services remained unchanged. A corresponding response time was measured for each recovery. The service resulting in the best recovered response time, naturally, was the right service to isolate. The localization was considered correct if it isolated the same service. To summarize results from all runs, the following evaluation metric is used throughout this section: number of correct localization runs Overall Accuracy = total number of runs

Evaluating individual techniques

In the first experiment, the efficacy of elapsed time deduction was evaluated by comparing the overall accuracy of the automated approach when elapsed time deduction was enabled, against the overall accuracy when it was disabled. Overall accuracy was calculated over 5000 runs of this experiment, each with a randomly generated condition. A summary of the results are shown in Table 2, illustrating a 33% improvement in overall accuracy with the use of elapsed time deduction. Table 2: Evaluation of elapsed time deduction. Damage assessment with deduction

0.84

Damage assessment without deduction

0.51

Random guesses

0.25

Table 3: Evaluation of damage assessment.

The second experiment contrasted the damage assessment component in our approach with two traditional approaches. Note that full service elapsed time data availability was assumed so as to highlight the efficacy of damage assessment. The two traditional techniques considered are: Abnormality Determination (e.g. as is in [10]) and High Latency Determination (e.g. as is in [2]). The implementation of abnormality detection calculates the ratio between the currently measured mean elapsed time of a service and the 95%

Damage assessment

0.98

Abnormality determination

0.45

High latency determination

0.43

Random guesses

0.25

Table 3 summarizes a comparison between the overall accuracy of each approach after 4000 runs, each under a different condition. The results show that damage assessment largely outperformed the other two approaches.

4.3

Overall evaluation

In order to evaluate the entire approach, the second experiment in the last subsection was rerun, with elapsed time deduction enabled and some services made unobservable. 0.9 0.8 0.7 damage assessment

0.6

abnormaly detection high latency detection

0.5

random guesses

0.4 0.3

(3)

In this sense, a basic method that randomly guesses the problem-causing service would be around 25% accurate.

4.2

percentile of its baseline distribution. The service with the highest ratio is chosen. High-latency determination is implemented such that it simply searches for the service with the greatest mean current elapsed time.

All−round accuracy

4.1

0.2 wf 1

wf 2

wf 3

Workflow from sequential to parallel

wf 4

Figure 3: Evaluation of the approach as the parallelism of the workflow increases. The experimental outcomes were plotted against the simulated workflows ordered in ascending parallelism when only two services were observable, and then against the number of observable services when various workflows were used. In Figure 3, the approach consistently outperformed the other two approaches for all four workflows considered with two unobservable services. This advantage generally increased as the workflow became more parallel. The reason for this is that when the workflow is highly parallel, abnormal or slow services have a higher possibility of running in parallel with other (slower) services who may dictate the end-to-end response time. Consequently, these services are unlikely to be major causes of the response time problem, but can still be wrongly considered so by the traditional approaches. Figure 4 clearly shows that the approach in this work achieved greater accuracy than the other two approaches regardless of how many services were observable.

5.

EXPERIMENTAL VALIDATION

This section validates that the approach in this paper is effective against performance problems found in real-world settings. A known problem in eDiaMoND test-bed is chosen for this purpose.

1

Table 4: Damage scores for eDiaMoND services during image data loading.

damage assessment 0.9

abnormaly detection high latency detection

All−round accuracy

0.8

random guesses

0.6

0.3 0.2 0

1

2

3

Number of observable services

4

Figure 4: Evaluation of the approach as the number of observable services increases.

Prototype implementation for eDiaMoND

The eDiaMoND services shown in Figure 1, ogsa dai local, image locator local, ogsa dai remote, and image locator remote were hosted by four AIX machines with two 3.0 GHZ dual core CPUs and 2GB memory. The image list and work list services were hosted by a single Redhat Linux machine with two 3.0 GHZ dual core CPUs and 1GB memory. Since the entire test-bed is within the same sub-net, extra routing was imposed between image list and image locator remote through request forwarding to simulate a connection to a remote site. eDiaMoND services are instrumented to report statistics (i.e. elapsed/response time and workflow) to an IBM Cloudscape database server where the data can be queried, retrieved and fed into a Matlab implementation of the localization algorithm.

5.2

Problem description

As new mammograms are scanned in eDiaMoND, they must be loaded into the database(s) from time to time. Such data loading can be lengthy and puts considerable load on the hosting systems. As a consequence, service(s) running on the system will be affected, leading to increased response time for user requests served during the loading period. Following 1 hour of system characterization (which builds the baseline and prior distributions) under normal workload, the experiment was carried out as two scheduled 1hour image data loads were simultaneously performed to the databases of ogsa dai local and ogsa dai remote. Note that ogsa dai local is invoked in parallel with image locator remote, which has a much longer elapsed time than the former due to the remote connection to image locator remote. As a result, the slowdown on ogsa dai local should be overshadowed and have minor influence on end-to-end response time. In addition, monitoring on image list and work list was deliberately turned off to test how the proposed localization approach responds to missing data. In this setting, the approach was analyzed to validate 1) that the localization process can distinguish between the two slowdowns and correctly isolate ogsa dai remote as the most significant cause, and 2) that it is sufficiently lightweight.

5.3

img loc local

img loc remote

ogsa dai local

ogsa dai remote

work lst

-0.040

0.000

0.018

0.000

0.095

-0.015

0.5 0.4

5.1

img lst

0.7

Experimental results

The damage scores computed by the localization procedure outlined in Table 1 for each eDiaMoND service are shown in Table 4. The ogsa dai remote service has the highest score and is indeed identified as the most significant

cause. Data loading at the remote hospital may be stopped and postponed to restore the normal performance of this service and bring a 9.5% reduction in overall response time. All other services either performed much closer to their respective baselines, or were overshadowed (even for the slowed down ogsa dai local ) by parallel services with consistently longer elapsed times and scored zero. The on-line overhead (i.e. monitoring overhead) caused by the prototype was measured to be around 5.5 milliseconds per service. This amount can be deemed negligible in the eDiaMoND context where image retrieval requests often take seconds to complete. The localization procedure (including training the RTBN prior distributions) consumed around 2 minutes. This time frame appears reasonable considering that it is an automated off-line procedure and that several hundreds states are used to encode each RTBN variable.

6.

RELATED WORK

There is an existing body of literature that tackles problem determination based on passive instrumentation, which broadly falls into two categories. The first category features traditional approaches founded on normal performance behavior characterization of individual components via instrumentation [10, 1, 17, 2]. These methods are effective in signaling problematic behaviors of individual (or groups of) components. However, unlike the approached proposed in this paper, they lack the correlation between individual behaviors and end-to-end metrics and often fail to pinpoint components that are true causes. In contrast, the second category of instrumentation-based work has taken the end-to-end perspective into account. This consideration is achieved either by simply learning the correlation between component resource (e.g. CPU, disk) usage states and SLA violation states [8], or by further estimating end-to-end impact of local resource allocation decisions using classic analytical models like queueing networks [16]. One common limitation of these approaches is that they may focus on components that are performing normally and expensive to accelerate. In addition, the correlation acquired in [8] ignores the total metric provisionable amount (e.g. elapsed time duration); whereas analytical models can be extremely difficult to build, overly simplified, and confined to a particular type of system. The approach presented in this paper does not exhibit these shortcomings. Note also that the above works have not considered the issue that performance data may not be available from some components, which is addressed in this research through deriving the elapsed time distributions for these unobservable components using obtainable measurements. Contrary to the attempts reviewed in the first two paragraphs in this section, some recent work has focused on probing systems externally to attack fault determination [6, 13, 15]. Not only does probing inevitably impose extra workload on the system by exercising it with synthetic requests, it may also be difficult to replicate the system conditions

under which the performance problems occurred [10].

7.

CONCLUSIONS AND FUTURE WORK

This paper has presented an automated approach to performance problem localization in service-oriented systems. This approach focuses on the isolation of problem-causing services that are the most important causes of (unusually) slow end-to-end response time. Having observed service performance through instrumentation, a lightweight Bayesian network model is adopted to assess the end-to-end response time degradation caused by services even in the event of missing data. Multiple factors are considered to ensure the localization of services that can be easily fixed by recovering them to normal (e.g. restarting the services), and if recovered, will most dramatically improve the response time. The approach is general enough to be applied to any (partially) instrumented transaction oriented systems, and work with models other than RTBN (e.g. the tree augmented Bayesian net built by Cohen et. al. [8]) that support probabilistic inference. Once problem-causing services are isolated, the approach can be applied on a more fine-grained level, further isolating problem-causing constitutional services (e.g. software/hardware elements). A decentralized, multi-tier architecture is being considered, that will see data gathering and analysis for lower-level services pushed (from a central location) down to individual services (or beyond). This architecture can ensure the approach scales well even for extremely large systems. The localization approach proposed here currently only targets those services that are abnormally slow. However, some autonomic strategies may be interested in accelerating consistently poor performers through various autonomic resource actions. The approach in this paper builds the framework for considering this type of capability by providing a flexible baseline distribution which can be augmented with the prediction of provisioned elapsed times. The independence assumption made (in Subsection 3.1) in the current RTBN model to facilitate a lightweight solution, although seemingly accurate (as is illustrated in Section 4 and 5), has left room for improvement. Extensive evaluations of our approach both in real-world applications other than eDiaMoND and through larger-scale simulations are being considered that will further refine the solution for use in realworld autonomic management processes.

8.

REFERENCES

[1] M. K. Agarwal, K. Appleby, M. Gupta, G. Kar, A. Neogi, and A. Sailer. Problem determination using dependency graphs and run-time behavior models. In Proceedings of the 15th IFIP/IEEE Distributed Systems: Operations and Management, Davis, California, USA, November 2004. [2] M. Aguilera, J. Mogul, J. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 74–89, Bolton Landing, NY, USA, 2003. [3] N. Alur, M. Goodwin, H. Kawada, R. Midgette, D. Shenoy, R. Warley, and A. Betawadkar-Norwood. Db2 ii: Performance monitoring, tuning and capacity planning guide. Technical report, IBM Corporation, November 2004.

[4] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, CA, USA, December 2004. [5] J. Brady, D. Gavaghan, A. Simpson, M. Mulet-Parada, and R. Highnam. eDiaMoND: A grid-enabled federated database of annotated mammograms. In Grid Computing: Making the Global Infrastructure a Reality, pages 923–943. Wiley Series, 2003. [6] M. Chen, E. Kiciman, E. Brewer, and A. Fox. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the IEEE International Conference on Dependable Systems and Networks, pages 595–604, Bethesda, MD, USA, 2002. [7] I. Foster and C. Kesselman. The Grid 2: Buleprint for a New Computing Infrastructuer. Morgan Kaufmann, 2004. [8] C. Ira, C. Jeffrey, G. Moises, K. Terence, and S. Julie. Correlating instrumentation data to system states: a building block for automated diagnosis and control. In Proceedings of Symposium on Operating Systems Design and Implementation, pages 231–244, San Francisco, California, USA, 2004. [9] J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, 2003. [10] G. McKnight and D. Watts. Help me find my ibm eserver xseries performance problem. Technical report, IBM Corporation, 2004. [11] C. Moler. Numerical Computing with MATLAB. Society for Industrial and Applied Mathematics, 2004. [12] R. Neapolitan. Probabilistic Reasoning in Expert Systems. Wiley Interscience, 1989. [13] I. Rish, M. Brodie, N. Odintsova, S. Ma, and G. Grabarnik. Real-time problem determination in distributed systems using active probing. In Proceedings of the 9th IEEE/IFIP Network Operations and Management Symposium, pages 133–146, Seoul, Korea, 2004. [14] B. Roehm, T. Erker, C. Finneran, V. Mann, K.-M. Wan, and P. Wiedeking. Using websphere extended deployment v6.0 to build an on demand production environment. Technical report, IBM Corporation, August 2006. [15] M. Steinder and A. Sethi. End-to-end service failure diagnosis using belief networks. In Proceedings of the 7th IEEE/IFIP Network Operation and Management Symposium, pages 375–390, Florence, Italy, 2002. [16] B. Urgaonkar, P. Shenoy, A. Chandray, and P. Goyalz. Dynamic provisioning of multi-tier internet applications. In Proceedings of the 2nd International Conference on Autonomic Computing, pages 217–228, Seattle, Washington, USA, 2005. [17] G. Wood and K. Hailey. The self-managing database: Automatic performance diagnosis. Technical report, Oracle Corporation, November 2003. [18] R. Zhang, S. Heisig, S. Moyle, and S. McKeever. Ogsa-based grid workload monitoring. In Proceedings of the 5th IEEE International Symposium on Cluster Computing and the Grid, pages 668–675, Cardiff, UK, May 2005.

Suggest Documents