Dependable Service-Oriented Computing
Building Accountability Middleware to Support Dependable SOA The Intelligent Accountability Middleware Architecture (Llama) project supports dependable service-oriented architecture (SOA) monitoring, runtime diagnosis, and reconfiguration. At its core, Llama implements an accountability service bus that users can install on existing service-deployment infrastructures. It collects and monitors service execution data from a key subset of services; enables Llama users to incorporate others’ advanced diagnosis models and algorithms into the framework; and provides enterprise service bus extensions for collecting service profiling data, thus making process problems transparent to diagnose. Finally, experimental results indicate that using Llama contributes a modest amount of system overhead.
Kwei-Jay Lin, Mark Panahi, Yue Zhang, Jing Zhang, and Soo-Ho Chang University of California, Irvine
16
Published by the IEEE Computer Society
S
ervice-oriented architecture (SOA) is the prevailing software paradigm for dynamically integrating loosely coupled services into one cohesive business process (BP) using a standards-based software component framework.1 SOA-based systems can integrate both legacy and new services, either that enterprises have created and hosted internally or that are hosted by external service providers. When users invoke services in their BPs, they expect them to produce good results that have both functionally correct output and acceptable performance levels in accordance with quality-ofservice (QoS) constraints (such as those in service-level agreements [SLAs]). So, if a service produces incorrect results 1089-7801/09/$25.00 © 2009 IEEE
or violates an SLA, an enterprise must hold the service provider responsible (also known as accountability). Identifying the source of a BP failure in a SOA system can be difficult, however. For one thing, BPs can be very complex, having many execution branches and invoking services from various providers. Moreover, a service’s failure could result from some undesirable behavior by its predecessors in the workflow, its execution platform, or even its users. To identify a problem’s source, an enterprise must continuously monitor, aggregate, and analyze BP services’ behavior.1 Harnessing such a massive amount of information requires efficient support from that enterprise’s service-deployment infraIEEE INTERNET COMPUTING
Building Accountability Middleware
structure. Moreover, the infrastructure should also detect different types of faults and support corresponding management algorithms. So, a fault-management system for a SOA must be flexible enough to manage numerous QoS and fault types. SOA makes diagnosing faults in distributed systems simultaneously easier and more difficult. The Business Process Execution Language (BPEL) clearly defines execution paths for SOAs such that all interactions among services occur as service messages that we can easily log and inspect. On the other hand, external providers might own their services, hiding their states as black boxes to any diagnosis engine and making diagnosis more difficult. To address these issues, we developed the Intelligent Accountability Middleware Architecture (Llama). It includes components to help monitor services on a BP user’s behalf, find the root cause of failures when they occur, and perform reconfigurations if necessary. Here, we look at Llama’s implementation and performance. First, however, we examine the concept of accountability and challenges in building accountable SOAs.
Service Accountability
Many enterprise systems use business activity monitoring (BAM) tools to monitor BP performance and alert them when problems occur. Current BAM tools report information via a dashboard or broadcast alerts to human managers, who then initiate corrective action. For SOA systems, BAM might become part of the enterprise service bus (ESB), which is a common service integration and deployment technology. Enterprises can extend ESBs to support monitoring and logging and to provide both data analysis and visualization for various services deployed on it. We believe a dependable SOA should perform automated diagnosis and reconfiguration, so the Llama project studies accountability support for SOA.
Accountability Framework for SOA Algirdas Avižienis and his colleagues define accountability as “the availability and integrity of the identity of the person who performed an operation.”2 Both legal and financial communities use the notion of accountability to clarify who is responsible for causing problems in complex interactions among different parMARCH/APRIL 2009
ties — it’s a comprehensive quality assessment to ensure that someone or something is held responsible for undesirable effects or results during an interaction.3 We believe accountability is also an important concept in SOA because all services should be effectively regulated for their correct executions in a BP. The root cause of any execution failure should be clearly inspected, identified, and removed to control damage. If we impose accountability on all services, service consumers will get a clearer picture of what constitutes abnormal behavior in service collaborations and will expect fewer problems when subscribing to better services in the future. To make SOA accountable, we must build a system infrastructure that can detect, diagnose, defuse, and disclose service faults (the “four D’s”). Detection recognizes abnormal behavior in services: an infrastructure should have fault detectors that can recognize faults by monitoring services, comparing current states to acceptable service properties, and describing abnormal situations. Diagnosis analyzes service causality and identifies root service faults. Defusing recovers a problematic service from the identified fault. It should produce an effective recovery for each fault type and system-management goal. Disclosure keeps track of services responsible for failures to encourage them to avoid repeating mistakes.
Accountability Challenges in SOA We identified the following accountability challenges that a SOA’s inborn characteristics introduce: • A SOA accountability mechanism must be able to deal with the causal relationship that exists in service interactions and find a BP problem’s root cause. • A SOA accountability mechanism should adopt probabilistic and statistical theory to model the uncertainty inherent in distributed workflows and server workloads. • The problem diagnosis mechanism must scale well in large-scale distributed SOA systems. • To prevent excessive overhead, a system should collect as little service data as possible but still enough to make a correct diagnosis. Our current Llama implementation uses the Bayesian network diagnosis engine4 to model 17
Dependable Service-Oriented Computing
AC AA
Service
Service
Service
AA
Accountability authority
AC
Accountability console
Service Service
Llama accountability Service Web service service bus Accountability agent
Service Service
Service Service
Service
Service Service
Service
Service
Figure 1. Example of an accountable service-oriented architecture. All services are deployed on the Intelligent Accountability Middleware Architecture (Llama) ASB middleware. The middleware uses multiple agents to address monitoring requirements. Each agent can monitor a subset of services (shown as the circled areas). All agents report to the accountability authority (AA), which performs diagnosis. The AA is controlled and observed by users via the user console (AC). casual relationships between services and conduct a probabilistic analysis on service failures. It could also use other diagnosis algorithms that can capture a BP’s causal and probabilistic nature. Llama is designed for efficient monitoring to reduce runtime cost.5 Given that a BP might comprise many services, monitoring all of them can create significant overhead, severely impacting performance. We designed the Llama configuration algorithms to select a small set of agents and evidence channels and thus detect problems efficiently. After diagnosis, Llama performs automated reconfiguration based on either a preselected or dynamically calculated alternate service process path.6
The Llama Middleware Architecture
The Llama middleware extends an ESB to provide transparent management capabilities for BPs. Llama supports monitoring and diagnosis mainly via service-oriented distributed agents. The Llama middleware can restructure monitoring configuration dynamically. Situation-dependent BP policies and QoS requirements drive Llama’s selection of diagnosis models and algorithms. The middleware then adopts and deploys a suitable diagnosis service. Finally, Llama accommodates service recovery by rerouting service messages to new service providers or versions on the service bus. We implement Llama with three main com18
ponents (see Figure 1): the accountability service bus (ASB) transparently and selectively monitors service, host (for example, CPU), and network behaviors; accountability agents (agents) observe and identify service failures in a BP; and the accountability authority (AA) diagnoses service faults in BPs and conducts reconfiguration operations. We further describe Llama’s implementation later in the article As part of the Llama project, we’ve also studied and developed algorithms to compose BPs, configure the middleware, and conduct diagnosis.5,7,8 The Llama project’s main contribution is its powerful architecture and support for dependable SOA. Llama seamlessly integrates runtime monitoring, diagnosis, and recovery for SOA systems. Specifically, it provides • Efficiency. The Llama framework collects and monitors service execution data from only a subset of services, thus reducing runtime overhead and data inspection overhead. • Flexibility. The Llama framework can adopt many advanced diagnosis models and algorithms, examples of which are available elsewhere.9,10 • Transparency. The Llama framework allows service providers to install the ASB to collect service and host profiling data and make service behavior visible to the diagnosis engine.
www.computer.org/internet/
IEEE INTERNET COMPUTING
Building Accountability Middleware
Related Work in Diagnosis and Monitoring
D
istributed systems and dependable systems diagnoses are well-known research areas with extensive work.1–6 Our aim with Llama wasn’t to design new diagnosis models or algorithms. Rather, we wanted to design a service-oriented architecture (SOA) middleware architecture with the flexibility to utilize advanced diagnosis models and algorithms others have developed. We adopted Bayesian network reasoning7,8 because it can provide probabilistic reasoning based on business processes’ causal structure. Our plan for Llama is ultimately to integrate other diagnosis models and algorithms in different accountability authority (AA) versions so that enterprises can use them to diagnose different faults according to user requirements. Abdelkarim Errandi and his colleagues present wsBus, a Web service middleware for reliable and fault-tolerant computing.9 To handle faults, wsBus performs runtime monitoring on functional and quality-of-service (QoS) aspects of services as well as runtime adaptation based on user-specified policies. However, wsBus’s monitoring and adaptation components are designed and executed on an individual service basis, whereas the Llama framework uses the accountability model to monitor the end-to-end quality of processes and adapt at the process level with an alternate path. In addition, we designed the Llama enterprise service bus’s configuration gateway and rerouting facility to be transparent and very efficient. The Tiresias middleware collects application-level (response time) and system-level performance metrics (CPU usage, memory usage, and network traffic).10 The system uses these performance data to make black-box failure predictions through trend analysis. Llama differs from Tiresias in terms of data-dependency reasoning and analysis because our target applications are business processes. Finally, Ying Li and his colleagues present a framework for autonomically reconfiguring service-based systems.11 They’re concerned mostly with the problem of reconfiguring hosts by diagnosing problematic configurations, developing a reconfiguration plan, and executing the plan by migrating services from
When enterprises use SOA, choosing which service to use at what instance can fluctuate continuously depending on current service performance, cost, and many other factors. For such a highly dynamic environment, few existing frameworks can automate the analysis and identification of BP problems or perform reconfigurations (see the “Related Work in Diagnosis and Monitoring” sidebar for an examination of other projects). Next, let’s look at the Llama middleware components and features that make automated process monitoring, analysis, and reconfiguration possible. MARCH/APRIL 2009
one machine to another, essentially redistributing load. In comparison, Llama performs diagnoses on individual services by checking their executions. References 1. I. Cohen et al., “Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control,” Proc. 6th Conf. Operating Systems Design & Implementation (OSDI 04), Usenix Assoc., 2004, pp. 231–244. 2. M.Y. Chen et al., “Path-Based Failure and Evolution Management,” Proc. 1st Conf. Networked Systems Design and Implementation (NSDI 04), Usenix Assoc., 2004, pp. 309–322. 3. P. Reynolds et al., “Pip: Detecting the Unexpected in Distributed Systems,” Proc. 3rd Conf. Networked Systems Design and Implementation (NSDI 06), Usenix Assoc., 2006, pp. 115–128. 4. P. Barham et al., “Using Magpie for Request Extraction and Workload Modeling, Proc. 6th Conf. Operating Systems Design & Implementation (OSDI 04), Usenix Assoc., 2004, pp. 259–272. 5. M.K. Aguilera et al., “Performance Debugging for Distributed Systems of Black Boxes,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP 03), ACM Press, 2003, pp. 74–89. 6. J. Dunagan et al., “Fuse: Lightweight Guaranteed Distributed Failure Notification,” Proc. 6th Conf. Operating Systems Design & Implementation (OSDI 04), Usenix Assoc., 2004, pp. 151–166. 7. K.B. Korb and A.E. Nicholson, Bayesian Artificial Intelligence, Chapman & Hall, 2004. 8. U. Lerner et al., “Bayesian Fault Detection and Diagnosis in Dynamic Systems,” Proc. 17th Nat’l Conf. Artificial Intelligence, AAAI Press, 2000, pp. 531–537. 9. A. Erradi, P. Maheshwari, and V. Tosic, “Policy-Driven Middleware for SelfAdaptation of Web Services Compositions,” Proc. 7th Int’l Middleware Conf., Springer, 2006, pp. 62–80. 10. A.W. Williams, S.M. Pertet, and P. Narasimhan, “Tiresias: Black-Box Failure Prediction in Distributed Systems,” Proc. 15th Int’l Workshop on Parallel and Distributed Real-Time Systems, IEEE Press, 2007, pp. 1–8. 11. Y. Li et al., “Self-Reconfiguration of Service-Based Systems: A Case Study for Service Level Agreements and Resource Optimization,” Proc. Int’l Conf. Web Services, IEEE Press, 2005, pp. 266–273.
Llama Components As Figure 1 shows, Llama administrators deploy their services on the ASB. In addition to a service requester and any services deployed, the Llama architecture’s two main components are the AA and the agents. These components collaborate to perform runtime process monitoring, root-cause diagnosis, service process recovery, and service network optimization. The AA deploys multiple agents to address scalability requirements. Each agent monitors a subset of services (the cylinders in Figure 1) during BP execution. Figure 2 shows the Llama accountability core’s detailed 19
Dependable Service-Oriented Computing Accountability authority
Diagnosis data repository
Service repository
Process recovery Recovery plan generator Recovery plan executor
Process administrator
AA setting
Diagnosis result reporter
Process register
Management gateway
Bayesian network (BN) diagnosis BN reasoning engine
Agent deployment Agent repository
Agent selector
Evidence channel selector
Portal
Agent informer
Exception receiver
Agent Error origin investigator
Portal Runtime data poller
Service log analyzer
Exception reporter
Host performance analyzer
Runtime monitor and problem detection
ASB deployment ASB configurator
Monitoring data
EC data analyzer
Configuration gateway
EC data receiver
Profiling data dispatcher
ASB
Service logger
Host CPU/mem data collector
Log
Interceptor/service monitor
Integration platform/ESB
Figure 2. Llama middleware components. The accountability authority (AA) performs intelligent management for the deployment, diagnosis, and recovery of a service process. Agents collect data from the accountability service bus (ASB) for problem detection and analysis. The ASB extends enterprise service bus (ESB) capabilities by providing a profiling facility to collect service execution and host performance data. architecture. The AA also performs intelligent management to deploy, diagnose, and reconfigure service processes. The AA • receives BP management requests from proc ess administrators; • deploys and configures the accountability framework once a process user submits a process for management; • conducts root-cause diagnosis when agents (sometimes concurrently) report exceptions; and • initiates a process reconfiguration to recover process execution. Agents act as intermediaries between the ASB, where data is collected, and the AA, where it goes for analysis and diagnosis. They’re responsible for • configuring evidence channels on the ASB; 20
• performing runtime data analysis on the basis of information the ASB pushes to them; • reporting exceptions to the AA; and • initiating fault-origin investigation under the AA’s direction. The Llama ASB extends ESB capabilities by providing a distributed API and framework on which agents can collect service execution and host-performance data. Agents can do this by either using the ESB’s existing service-monitoring API or any attached profiling interceptors to collect monitoring information such as service execution time (the current Llama ASB prototype, which we describe in more detail later, uses the latter method ). Both services and agents can be invoked across administrative boundaries. Agents can push or pull data and collect and send it at configurable intervals. Enterprises can install the Llama ASB on any existing ESB framework as long as that frame-
www.computer.org/internet/
IEEE INTERNET COMPUTING
Building Accountability Middleware
Service network planning
Function graph
QoS-based service selection Selectio n
Composed service network and backup service plans
Updated services’ reputation (QoS)
Agent deployment
Service network recovery Reputation database
Reputation network refinement
Diagnosis result
Recovered service network
Evidence channel selection States of evidence channels
Diagnosis result Bayesian network diagnosis
Service process execution
Figure 3. Accountable services deployment and reconfiguration flow. Users submit requests for a business process and the end-to-end quality-of-service (QoS) requirements. The QoS broker then composes the service network for deployment. The Llama middleware, in turn, configures the diagnosis and recovery environment. During service process executions, fault detection, diagnosis, and recovery are continuously conducted to ensure process performance. Services’ reputations are also recorded in a database for future reference. work supports service-request interception or other means to collect service data. In addition to these components, the Llama project also implements a QoS broker,7 which offers QoS-based service selection, to assist the service requester in fulfilling end-to-end QoS requirements during a BP composition. Furthermore, we’re designing reputation network brokers that will help evaluate, aggregate, and manage services’ reputations (that is, performance history). A service’s reputation is a QoS parameter that affects the BP composition; users are more likely to select services with better reputations.
Deploying Accountable SOA and Fault Recovery Figure 3 shows the steps of the accountable SOA configuration and deployment process. Users (with help from the QoS broker) first compose the BP they wish to execute, based on QoS requirements for the process. The QoS broker also automatically generates a backup service path for each selected service in the process for fault-tolerance reasons. The backup path can be as simple as another service that replaces the current one when it’s no longer available or as complex as a new subprocess going from the service’s predecessor to the end of the complete service process. Algorithms for finding the backup path appear elsewhere.6 MARCH/APRIL 2009
The current AA implementation produces a Bayesian network for the service process on the process graph, as well as both historical and expected service performance data.8 The AA then runs the evidence channel selection algorithm to yield the best locations for collecting execution status about the process. It also selects and deploys monitoring agents that can best manage the services in the BP.5 In addition, the AA configures the hosts of the selected evidence channels so that they’re ready to send monitored data at regular intervals to responsible agents. Once the process starts to execute, the ASBs will collect runtime status about services and the process from the evidence channels and deliver it to agents. If an agent detects an unexpected performance, it will inform the AA to trigger fault diagnosis. The AA’s diagnosis engine should produce a list of likely faulty services. For each potential faulty service, the AA asks its monitoring agent to check the service’s execution data, located in the ASB’s log (see Figure 2). Those data might confirm whether a service has a fault. When the AA finally identifies a faulty service, the AA will initiate the service recovery by first deploying the backup path. In cases in which the predefined backup path isn’t suitable for the detected problem (for example, there are multiple service faults), the AA will ask the QoS broker to produce a new backup path 21
Dependable Service-Oriented Computing
or even a new BP for reconfiguration. The AA keeps the diagnosis result in a service-reputation database to disclose the likelihood of the service having a fault, along with the context information about it. Such information is valuable to the QoS broker because it can indicate that the service might be error-prone in some specific context.
Transparency and Service Provider Participation We designed the Llama framework to help enterprises pinpoint responsible parties when BP failures occur. To achieve this, service provider transparency is not only critical to the user but also provides important input for the agent. However, third-party service providers have a right to decide the trade-offs between transparency on one hand and privacy and security on the other. To participate in the accountability framework, external service providers might install the Llama ASB to keep an audit trail locally for their services. Optionally, they can let the ASB push performance data to agents in real time if they want to give users an added level of t ransparency. Agents are themselves standalone services that service clients, service providers, or other third-party providers can all deploy. In our design, the AA will select agents to efficiently and scalably report data about services that belong to a particular BP. Providers of “healthy” services will benefit because the reported performance data can clear them of any failure responsibility. Stated differently, we believe transparency is more valuable than privacy to most service providers. On the other hand, some service providers might not be willing to open up their execution status completely. Llama makes cooperation from service pro viders easy by letting them choose among various levels of transparency. Simple auditing requires the service provider to install only the ASB layer for its services, thus activating data collection. However, this data is stored locally, and an authorized agent gives it to the AA only when requested in the diagnosis process. Dynamic monitoring requires ASB installation and also allows dynamic monitoring of services via deployed agents the service provider installs. Deployed agents need only conform to a standard interface, so service providers can use their own agent implementations to participate in diagnosis. Dynamic third-party monitoring is sim22
ilar to the previous level except that third-party “collateral” agents collect and process the data. Given that external agents produce monitored data in the latter two levels, the diagnosis process must be able to reason about the likelihood of incorrect or incomplete data. Techniques for privacy-preserving data reporting might help overcome this potential problem, which is an interesting topic for future research.
Implementation and Performance
We implemented a Llama prototype by building the ASB as an extension of the Mule ESB (http://mule.mulesource.org) and constructing the AA and agents. We use Genie/Smile (Structural Modeling, Inference, and Learning Engine; http://genie.sis.pitt.edu) as the Bayesian network reasoning engine and the Apache Orchestration Director Engine (ODE; http://ode.apache.org) as the BPEL engine. We collect two types of data during these experiments. First, we use Mule’s interception framework to create profiling interceptors to collect invocation timestamps and execution times for each service. Second, the ASB collects CPU utilization data at 5-second intervals (using the top Linux command) for the host on which it’s deployed. The ASB log stores all collected data, which can be pushed to and queried by agents performing error investigation. These strategies provide a simple way to demonstrate how much overhead data collection might impose on the prototype and how quickly and accurately it responds to detected problems. We used four different hosts for the experiments described here: • Host 1 — 1-GHz Pentium III/256-Kbyte L2 cache/512 Mbytes RAM. • Host 2 — 3.20-GHz Xeon/2-Mbyte L2 cache/3 Gbytes RAM. • Host 3 — 1.34-GHz UltraSPARC/1-Mbyte L2 cache/512 Mbytes RAM. • Host 4 — 1.73-Ghz Pentium M/2-Mbytes L2 cache/1.24-Gbytes RAM. For multihost experiments, the hosts are connected via a 100-Mbps LAN. We used an example BP to test the Llama prototype — a process for targeted mass-mail advertising referred to as the print and mail process (see Figure 4). With this BP, for example, a plumber can send fliers to all addresses in
www.computer.org/internet/
IEEE INTERNET COMPUTING
Building Accountability Middleware
Address quote
BulkMail quote
User specifications
Begin printing
Withdraw money Select vendors
Verify payment
Printer quote
Assemble press sheet
Clear Payment
Begin bulk mailing Collect addresses
Deposit money
Figure 4. The print and mail business process (BP). The BP shows the flow of a mass-mail advertising task. The BP has a total of 13 services. Each node is a service to be invoked in the BP. Nodes with multiple circles have several provider candidates that can be selected; parallel branches are services that can be invoked concurrently. Table 1. Monitoring overhead (average of 30 runs for each test). Monitoring node
One host
Overhead
Two hosts
Overhead
Three hosts
Overhead
None
16,212 ms
–
16,537 ms
–
16,714 ms
–
At accountability service bus (ASB)
16,515 ms
303 ms
16,747 ms
210 ms
16,827 ms
113 ms
To agents
16,716 ms
504 ms
16,913 ms
376 ms
17,130 ms
416 ms
a certain zip code in which the houses are older than 20 years. Each circle in the graph represents a service the BPEL engine invokes. Some nodes have multiple circles to represent services that competing vendors provide. This example has three each of printer vendors, address providers, and bulk mailers. We chose six services to serve as evidence channels.
Monitoring Overhead We deployed all 13 services and two agents for this experiment. We set up three test cases: the first executed the process with all profiling interceptors turned off, ensuring that there would be no data collection overhead. The second executed the process with all profiling interceptors activated, ensuring that all data collection was taking place. The third used the same scenario as the second, except that we also configured the six evidence channels, which pushed data to agents at 5-second intervals. Moreover, we ran each case under three scenarios: • all services and agents are located on the same host (host 1); • five services (two evidence channels) are located on the client machine (host 1) and eight services (four evidence channels) are located on a remote server (host 2); or • three services are located on the client machine (host 1), six services (four evidence MARCH/APRIL 2009
channels) are located on one remote server (host 2), and four services (two evidence channels) are located on a second remote server (host 3). Table 1 shows the results, including the execution time for the complete BP. In row 1, the monitoring is turned off; in row 2, it’s collected only on the ASB; and in row 3, the ASB sends the data to agents. These results indicate that within each scenario, data collection can add roughly 100 to 300 milliseconds of overhead. Pushing data from evidence channels to agents, however, can create up to 300 milliseconds of additional overhead, depending on the degree of distribution.
Llama Diagnosis Accuracy Let’s examine the diagnosis overhead when a BP has an unacceptable long delay according to specified QoS requirements. Scenario 1: Service delays in a single-host environment. To simulate service response time problems, we deployed all services in a single-host environment (on host 4). We chose an acceptable response time for each service of three seconds longer than its worst-case response time. We injected long busy-wait delays into selected services in the BP to simulate response time faults. We tested the Bayesian network diagnosis 23
Dependable Service-Oriented Computing Table 2. Bayesian network diagnosis performance in a single-host environment. Test case
1
2
3
5
6
7
8
9
Diagnosis time (sec)
0.6
0.3
0.8
1
4
1.2
1.2
1.8
1.6
2.2
2
10
Diagnosis round
1
1
3
4
4
5
5
4
6
9
Fault injected
1
1
1
1
2
2
2
2
4
4
Table 3. Host diagnosis performance in a three-host environment. Services on problematic host Used as evidence channel? Diagnosis time (sec) Diagnosis round
1 No .3 2
2 Yes .2
.7
1
for service performance problems, as Table 2 shows. We found that if we injected delays into only one or two services, the diagnosis engine found all root causes in four or five diagnosis rounds. We also tried to inject delays into four services, which is a very high error rate for a BP of only 13 services. In such a case, diagnosis can take as much as nine rounds. Scenario 2: Host problems in a three-host environment. We also simulated a scenario for detecting a problematic host for some services within a BP. To construct this scenario, we deployed the services on three separate hosts, with host 1 serving as the client machine executing the BPEL engine and a proportional number of services deployed on all three hosts (1 through 3). When executing the BP, we injected a large CPU load (90 percent utilization) into host 3. We expected that the services deployed on this host would then exhibit response time delays. The diagnosis engine should have first noticed these delays and then, when investigating their source, determined that the host (CPU) was at fault. Table 3 shows Llama’s performance in identifying host errors based on the number of BP services located on that host and whether one of those services is an evidence channel (“Used as evidence channel?” row). We wanted to determine Llama’s performance in finding the first faulty service. From the results, we see that an evidence channel can cut down both the number of diagnosis rounds and the time required to find the first faulty service. However, the more services that exist on a faulty host, the more time and diagnosis rounds will be required to attribute all service problems to the host. 24
No 3
3 Yes .5 1
No 1 2
Yes .1 1
O
ur current Bayesian-network-based prototype of Llama produces encouraging results, but we plan to further develop Llama’s configuration capabilities by leveraging existing diagnosis and monitoring models. We will investigate service composition algorithms that have diagnosis and accountability considerations. Finally, we hope to make the Llama ASB compatible with other integration platforms and ESBs. References
www.computer.org/internet/
1. M.P. Papazoglou et al., “Service-Oriented Computing: State of the Art and Research Challenges,” Computer, vol. 40, no. 11, 2007, pp. 38–45. 2. A. Aviienis et al., “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, 2004, pp. 11–33. 3. H. Nissenbaum, “Computing and Accountability,” Comm. ACM, vol. 37, no. 1, 1994, pp. 72–80. 4. K.B. Korb and A.E. Nicholson, Bayesian Artificial Intelligence, Chapman & Hall, 2004. 5. Y. Zhang, M. Panahi, and K.J. Lin, “Service Process Composition with QoS and Monitoring Agent Cost Parameters,” Proc. IEEE Joint Conf. E-Commerce Technology (CEC 08) and Enterprise Computing (EEE 08), IEEE Press, 2008, pp. 311–316. 6. T. Yu and K.J. Lin, “Adaptive Algorithms for Finding Replacement Services in Autonomic Distributed Business Processes,” Proc. 7th Int’l Symp. Autonomous Decentralized Systems, IEEE Press, 2005, pp. 427–434. 7. T. Yu, Y. Zhang, and K.J. Lin, “Efficient Algorithms for Web Services Selection with End-to-End QoS Constraints,” ACM Trans. Web, vol. 1, no. 1, 2007, article no. 6. 8. Y. Zhang, K.J. Lin, and J.Y. Hsu, “Accountability Monitoring and Reasoning in Service-Oriented Architectures,” J. Service-Oriented Computing and Applications, vol. 1, no. 1, 2007, pp. 35–50. IEEE INTERNET COMPUTING
Building Accountability Middleware
9. I. Cohen et al., “Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control,” Proc. 6th Conf. Operating Systems Design & Implementation (OSDI 04), Usenix Assoc., 2004, pp. 231–244. 10. M.Y. Chen et al., “Path-Based Failure and Evolution Management,” Proc. 1st Conf. Networked Systems Design and Implementation (NSDI 04), Usenix Assoc., 2004, pp. 309–322. Kwei-Jay Lin is a professor in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. His research interests include service-oriented architectures, e-commerce technology, and real-time systems. Lin has a PhD in computer science from the University of Maryland at College Park. He is a senior member of the IEEE. Contact him at
[email protected]. Mark Panahi is a PhD candidate in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. His research interests include
real-time performance for service-oriented infrastructures. Contact him at
[email protected]. Yue Zhang is a researcher and software development engineer on the Microsoft Windows Azure team. Her research interests include the quality-of-service management and improvement in cloudy computing environments and Internet-scale data storage systems. Contact her at
[email protected]. Jing Zhang is a PhD student in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. Her current research interests include service management and quality of service in serviceoriented computing. Contact her at
[email protected]. Soo-Ho Chang is a postdoctoral researcher in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. Her research interests include service design and management for service-oriented architectures. Contact her at sooho.
[email protected].
The magazine of computational tools and methods for 21st century science. 2009 Peer- reviewed topics Jan/Feb
Reproducible Research
Mar/Apr
Computational Astrophysics
May/Jun
New Directions
Jul/Aug
Cloud Computing
Sep/Oct
Petascale Computing
Nov/Dec
Software Engineering
MEMBERS $47/year
for print and online
http://cise.aip.org and www.computer.org/cise
Subscribe to CiSE online at
MARCH/APRIL 2009
25