Verification of Non-functional Properties of Cloud-Based Distributed System Services Kaliappa Ravindran and Arun Adiththan
Department of Computer Science City Univ. of New York (CUNY – City College) 160 Convent Avenue, New York, NY 10031, USA
[email protected];
[email protected]
ABSTRACT
important because of the third-party control of cloud resources and the attendant issues of fault-handling, security, maintenance, availability, and the like.
For distributed system services implemented on a cloud, system verification assumes added importance because of thirdparty control of cloud resources and the attendant problems of faults, QoS degradations, and security violations. Our paper focuses on a model-based assessment to reason about the non-functional properties of a cloud-based distributed system using observational agents. Our approach is corroborated by measurements on system-level prototypes and simulation analysis of system models in the face of hostile environment conditions. A case study of CDN realized on cloud infrastructures is also described.
The verification of services from a cloud-based distributed system S involves determining how good S meets its intended QoS (quality of service) objectives under uncontrolled external environment conditions incident on S. Say, for example, S is a content distribution system that advertises content delivery to clients within 5 sec of a request (e.g., news). Suppose S achieves the best QoS of 5 sec guarantee only with a probability of 0.4, and achieves a latency distributed between 5 and 15 sec in other cases. Under simplifying assumptions, the capability of S in meeting the latency specs is estimated as 0.7 on a normalized scale [0, 1]. If the content storage/delivery backlog becomes severe, the 5 sec latency is less sustainable (assuming that other parameters of storage/delivery mechanism do not change) — and hence S is now much less than 70%-capable in meeting the latency specs. The system capability may however be enhanced by installing additional proxy server nodes along the content distribution topology. Concomitant with this notion of non-functional goals is a safety aspect, namely, S should not traverse into unsafe situations while meeting its objectives: say, in this example, the client connectivity to a content getting disrupted.
Categories and Subject Descriptors C.2.1 [Computer Systems Organization]: Networked systems, Sensors; M.4 [Service-oriented Architecture]: Service quality, Reliability.
General Terms Model-based system diagnosis
Keywords System testing & certification, Model-aided system simulation, Service-level compliance checks, Probabilistic system guarantees, heuristics for scenario generation.
1.
The goal of our paper is to develop model-based engineering techniques to assess the capability of a cloud-based system S in terms of its non-functional properties. Analyzing the capability of S involves verifying that the safety requirements are met as S strives to meet its QoS objectives under various external environment conditions incident on the underlying cloud components.
INTRODUCTION
We consider an application running on top of the computational and communication services realized over one or more cloud infrastructures. The system as a whole implements a core functionality, with augmentations from the service provider to support a variety of para-functional behaviors. For instance, data replication and content distribution may be offered as core services that are often associated with, say, performance, security, and timeliness attributes. Here, the problem of system verification (i.e., reasoning about whether a system behaves in the way it is supposed to) has become
The QoS feature of S depicts an ability of S to control its performance in response to an underlying infrastructure resource allocation or a change in the external environment conditions. The QoS-to-resource mapping relationship must be established in a quantitative manner under specific environment conditions, in order to meet the performance objectives in predictable way. An example is the determination of content delivery latency over a distribution network set up on a geographically spread-out cloud of content storage nodes, in the presence of node failures. Here, a para-functional goal is to reduce the latency jitter by resorting to content caching techniques, thereby assuring a stable system-level behavior. The mapping between the output of
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
AST’14, May 31 – June 1, 2014, Hyderabad, India Copyright 2014 ACM 978-1-4503-2858-6/14/05...$15.00 http://dx.doi.org/10.1145/2593501.2593508
43
adaptation logic Ap’ final QoS actually achieved
I: input O*: output s*: internal state
external environment conditions incident on software/hardware system (e.g., errors, threat, outage) [parametric representation E*]
Processes, procedures and their interactions
Service-oriented algorithms resource/component virtualization interface
System components (VMs, network links, data storage, compute servers, . . .)
modeled as a black-box computational function g*(I,O*,s*,E*)
internal state s*
Distributed system S exporting a service
Service interface visible state
mapping between internal-state and interface events
sub-system providing core services
specs of desired QoS
(supplied as system input Pref)
External management entity H
reason about system dependability by observing system behavior
CLOUD-BASED APPLICATION
(observed as system output P’)
treats observed system as a composition Ap’
g*(..)
[4]). S interacts with its (hidden) external environment through the core elements g ∗ (· · · ): e.g., responding to client queries on a web server, and delivering content over a network transport connection. Here, the meta-level signal flows between A0p and g ∗ (· · · ) are visible to H. The layered software structure of S intrinsic to cloud-based systems: viz., the infrastructure, service-level algorithms, and adaptive application, stacked in that hierarchy and separated across well-defined interfaces, lends itself well for the dependability analysis by H. The dependability of S may be quantified, with suitable metrics (for certification and control purposes), by analyzing the external state-machine level signal flows by H. Basically, H reasons about the output tracking error |Pref − P 0 | under various environment conditions, and maps it onto a measure of the capability of S. Say, H is a cloud management station, employing a dashboard-based supervisory controller. The paper embarks on a case study of cloud-based content distribution networks (CDN), with a focus on system-level capability assessment. The paper is organized as follows. Section 2 provides a model-based approach to verifying the non-functional properties (e.g., QoS) of a cloud-based system. Section 3 illustrates simulation-based testing methods to assess the system properties. Section 4 describes a study of CDN from the standpoint of the QoS testing issues in cloud-based realizations. Section 5 concludes the paper.
2.
Cloud infrastructure
WHAT TO TEST IN CLOUD SYSTEMS ?
Suppose G depicts the goal to be met by a cloud-based distributed system S. The goal G may include a prescription of one or more non-functional attributes associated with the QoS delivered by S to an application: such as resilience to external disturbances, stability against resource fluctuations, and responsiveness to user-triggered requirements. The capability of S is prescribed in terms of such nonfunctional attributes — and is hence distinct from the correctness goal of S which is a functional attribute (yielding a YES/NO result). Here, the capability of S is a measure of how well the adaptation functions programmed into S adjust to the changing external environment conditions in meeting an application-level goal G specified for S.
Figure 1: Adaptation processes in a network system
S and platform resources should be known with reasonable accuracy: either as a closed-form model of S or through a series of incremental allocate-and-see invocations on S [1, 2]. The domain-specific core adaptation function in a cloudbased system S is viewed as a control-theoretic feedback loop acting on a reference input Pref : say, for cloud resource management. Figure 1 concretizes this view. The controller C generates its actions I based on a computational model of S, denoted as: g(I, O, s, E) — where O is the plant output in response to the trigger I, s is the plant state prior to the incidence of I, and E depicts the environment condition. Since the true plant model g ∗ (I, O∗ , s∗ , E ∗ ) is not completely known, C refines its input action in a next iteration based on the deviation in observed output O∗ from the expected output O when action I occurs. Upon S reaching a steady-state (over multiple control iterations) with output P 0 , the tracking error |Pref − P 0 | is analyzed to reason about the system capability. Our approach is guided by the concepts and taxonomy of dependable computing [3].
2.1
Para-functional Attributes
The system S that exports a service is often realized on distributed cloud infrastructure, with the cloud resources and components geographically spread out and often under different levels of administrative control. We view the system S as a software-oriented composition of two things: 1. A set of algorithm and software elements RPP that implements the core functionality of S; 2. A set of software processes Ap that are wrapped around the RPP to infuse the adaptation needed in S when reacting to hostile environment conditions E ∗ incident on the RPP.
An external management entity H views the system S as supporting adaptation processes A0p wrapped around a core system g ∗ (· · · ), i.e., A0p ⊗ g ∗ (· · · ) — where ’⊗’ denotes the composition in an object-oriented software view. A0p is embodied in a distributed agent-based software module that forms the building-block to structure S (as architected in
A functional requirement FUNC(S) pertains to the localized effects of system S, i.e., FUNC(S) affects only the core part of S addressing the functionality defined by the requirement. Whereas, para-functional requirements PARA(S) prescribe
44
CONTENT DISTRIBUTION NETWORK
APPLICATION-LEVEL USER actual output Pact
pul
V_out age nt 1
t3
l(p -b)
z
incidence of uncontrolled environment E*
pul
pull( p-a) l(p -b)
u
pull(p-a)
x
Axioms of system observation
client 2
[quality(S)=good] AND [V_out=good] Î [conf(H)=high]; conf(H): [quality(S)=good] AND [V_out=bad] Î [conf(H)=low]; confidence of designer on the verifier H [quality(S)=bad] AND [V_out=bad] Î [conf(H)=high]; [quality(S)=bad] AND [V_out=good] Î [conf(H)=low]; [quality(S)=80%-good] AND [V_out=90%-good] Î [conf(H)=medium];
agent 2
pull(p -b)
pull(p-a)
p-a
. .
content server R
U({ p -b })
p-b
y
content p-a pages p-b
q
v
a, U({pp-b})
p-b
R
input reference Pref
client 1
client 3
ntag e
RPP
evaluate para-functional behavior [PARA(S),Pref,Pact),E]
Cloud-based distributed system S
System verifier H
itor mon core-functional adaptation signal elements flows processes (algorithms, trigger physical resources, . .) Ap
evaluate functional behavior [FUNC(S),Pref,Pact)]
age n
U({p-a, p-b})
{S is 100%-good, S is 90%-good, . ., S is bad,…}
CORE SYSTEM S w attacks, failures, outages, . . (environment E*)
Content-forwarding proxy site
Figure 2: Management view of system verification
Idle proxy site & interconnect Local access link
how well S operates in the environment where it is embedded, while meeting the correctness specs FUNC(S). Both FUNC(S) and PARA(S) are associated with the output behavior of S, as exported to an application through a service interface. See Figure 2. The application-level goal G is an instantiation of PARA(S) that specifies the constraints to be satisfied by S as non-functional objectives, say: 90% resilience against component failures, 95% stability against resource fluctuations, and 85% responsiveness to user-level performance specs. Such an assessment of the capability of S is distinct from a verification of the correct behavior by S, as captured in FUNC(S) — the latter is satisfied always1 .
Content push/pull-capable proxy site
U({x}):
update message for pages {x} (push)
Figure 3: A management view of CDN with the core functional elements of CDN: namely, the dynamic placement of proxies, link/node failure detection, and push/pull algorithm to update local copies of p. Whereas, an assessment of the para-functional attributes PARA(CDN) involves analyzing how often the latency in delivery of p is within a prescribed limit, the latency jitter due to changes in p and node/link failures, and the per-pull overhead of delivering p to a client. The analysis involves reasoning about how the proxy placement strategy and the push/pull algorithm features (such as update-on-demand and content timestamping) impact the content delivery latency and overhead.
As example, consider a content distribution network (CDN) that allows the delivery of a content maintained at a server R (e.g., a satellite image) to various clients over a geographically spread-out network topology. See Figure 3. To meet the latency constraints and fault-tolerance needs of a content p, one or more nodes in the distribution topology function as proxies for R maintaining a local copy of p, so that clients pull p from a proxy node in their geographic vicinity. A service-layer algorithm places proxy nodes in the distribution tree to store contents, whereupon a combination of ’push’ or ’pull’ strategy is employed to deliver content to the clients [14]. Here, a functionality check of CDN, i.e., correctness verification of FUNC(CDN), involves asserting that an up-to-date p is accessible to every client despite failures of links/nodes in the distribution topology and content changes occurring at R. Here, a verification exercise deals
Given a distributed software system S realized on a cloud, we determine the ability of S to control its service behavior in response to an underlying infrastructure component allocation or a change in the external environment conditions. Our work is about assessing how good S meets its parafunctional goals PARA(S) specified at service interface level, in relation to the underlying service-layer algorithm and infrastructure and the external environment parameters.
2.2
Current Methods for System Verification
Existing methods for verification of a distributed system are in the limited space of communication protocols and algorithms,covering the technical areas of protocol validation, verification, and testing. Network protocols and algorithms are modeled as terminating finite-state machines (FSM) —
1 [5] describes a language to capture the non-functional goals PARA(S).
45
2.3
QoS specs Pref
computed input action I
(e.g., resource allocation , component configuration, . . )
actual traffic of task requests /*
v1 v2
..
schedule task execution by A observed
system
execution of output O* Service-level vN algorithm A
g*(I,O*,E*,s*)
script of task flow specs
computational view of E* (E)
/
model workload generator
(processed by algorithm A)
cloud-based distributed system X APP Q
There have also been system-level tools developed to aid system verification and testing: such as controlled faultinjection [11] and plug-in based model-solvers [12]. Chisel [13] separates the core functionality of a system from its non-functional behaviors, so that the latter can be tested in the face of unanticipated external events (e.g., how resilient is a CDN against the failure of a node in the distribution path). In the above light, our work is on a model-based verification of the behavior of complex network systems using probabilistic techniques.
modeling error \=[O”- O*]
Pact: steady-state value of O*
G(V,E)
Model-based controller C
tracking error H=[Pref- Pact]
T(V’,E’)
and hence are amenable for an automated verification as to whether a candidate protocol or algorithm is correct or otherwise [6]. These methods pertain largely to the protocols and algorithms embodied as core functionalities in a distributed system, but do not work well when analyzing the non-functional objectives of a system that embodies both adaptation behaviors and functional behaviors [7, 8]. This is because of a system-of-systems type of communication networks (e.g., an airborne network for surveillance purpose) — which often have a non-FSM representation and Turingmachine like behavior. Accordingly, a probabilistic approach is needed to reason about a system-level compliance to nonfunctional goals. Here, a non-functional attribute deals with adjusting the system operations according to the environment conditions (e.g., increasing the number of server replicas to improve web service performance)2 .
external environment conditions E*
system model solver task flow events
(closed-form description of X)
modelestimated output O”
g (I,O”,E,s)
Machine Intelligence in System Assessment Figure 4: Schema to infuse machine-intelligence
The structure of a distributed system S consists of an intelligent core module Ap that realizes the domain-specific adaptation functionality of S exercised on the physical environment processes (RPP). In a compositional view, S ≡ Ap ⊕ RPP — where ⊕ is the composition operator. Ap is embodied in a distributed agent-based software module that realizes a model-based controller C to compute the input actions I exercised on the RPP. Figure 4 shows the modelbased structure of S.
multiple steps allows us to solve the otherwise-intractable problem of verifying complex distributed systems. Ap is more than a collection of network components RPP, but instead consists of a software wrapper that controls RPP in such a way to infuse a self-contained and intelligent behavior. The behavior of Ap is observed by a management module H that reasons about the overall operations of Ap . For instance, H may be a network management station, employing a dashboard-based supervisory controller. The domainspecific functions of management entity H are encouched in the stub software interfaces to Ap with the domain-neutral functions of H residing external to S (for compliance testing and verification purposes).
The controller C employs a computational model of the RPP, denoted as g(I, O, E, s), to determine the actions I needed under the current operating point: [O, E, s, e] — where (O, s) depicts the current (output,state) of RPP, E ⊂ E ∗ is the (partially) observable external environment, and ² is the output tracking error (i.e., difference between the desired and actual outputs of RPP). Since C does not know the true model of S — denoted as g ∗ (I, O∗ , E ∗ , s∗ ), C employs machine-learning techniques for model identification and refinement based on an initial model g(I, O, E, s). The latter is a closed-form mathematical formula or computational procedure that expresses the I/O-relations of RPP in an approximate way. The output tracking error (Pref − P 0 ) is a measure of how good S meets its QoS-oriented goals.
H observes the output tracking error (Pref − Pact ) to reason about the capability of Ap under the plant controller C. Aided by management signaling, H exercises Ap to realize the output monitoring and reasoning, and assess the systemlevel capability therein.
3.
The controller C implements a Partially-observable Markov Decision Process (PO-MDP) to compute the input action I, based on the observed modeling error (O∗ −O0 ) and tracking error (Pref − P 0 ). The interleaving of a closed-form analysis with the piece-wise linearity of control-theoretic error correction (using system observation-based feedback) over
METHODS TO ASSESS QOS OF CLOUD
Assessing the quality of QoS enforcement allows certifying the internal processes and workflows of a cloud service provider (CSP). The assessment involves collecting a log of the various parameters and control variables exchanged between the CSP and its consumer(s), and then reasoning about how good the QoS support mechanisms are orchestrated in the system. These activities, which constitute a QoS audit [15], employ a signaling mechanism for parameter exchanges at the subsystem boundaries: say, at the SaaS-
2
Existing works on model-based diagnosis of hardware systems that employ AI approaches (such as [9, 10]) are more towards analyzing fault trees.
46
3.1
ACTUAL SYSTEM UNDER AUDIT
Model-Aided Assessment Processes
Remote application
We consider an application running on top of the computational and communication services realized over one or more cloud infrastructures. The system as a whole implements a core functionality, with augmentations from the service provider to support a variety of para-functional behaviors. For instance, data replication and content distribution may be offered as core services that are often associated with, say, performance, security, and timeliness attributes. Here, the problem of system certification (i.e., reasoning about whether a system behaves in the way it is supposed to) has become important because of the third-party control of cloud resources and the attendant issues of faulthandling, security, maintenance, and availability. Furthermore, a client application may possibly deviate from its QoS specs at run-time (maliciously or benignly): say, generating a higher-than-specified traffic load on the service. Thus, a cloud setting raises the issue of less-than-100% trust between service provider and consumer.
actual QoS achieved (output)
QoS audit log reproduce input QoS specs
Computational model of system S
desired QoS specs (input)
Cloud-based realization of service-support system S Service-oriented distributed algorithm (exercises infrastructure resources to provision QoS)
simulated input environment
E ( E* )
[mathematical formulas & computational procedures for QoS-to-resource mapping] d ve ser ace b d o t tr rea utpu o
generate simulated output QoS [fast-forward mode] determine QoS tracking error & system modeling error
system trace & error data
incidence of external environment conditions (uncontrolled) E*
Figure 5 shows the functional elements needed for testing in a cloud-based application setting. The service-support system S, realized on cloud infrastructure resources and components, is QoS-programmable that can be exercised by an application. The latter prescribes a desired QoS with the service provider, wrapping it up as part of a SLA negotiated therein. The actual QoS experienced by the application can then be compared with desired specs for SLA compliance evaluation. At run-time, a management module logs the QoS and system parameters exchanged with S, which is then used in, say, a subsequent offline evaluation. Typically, a simulation model of S is employed (based on a reference implementation) to reproduce the I/O events from the log: namely, the QoS specs, attained QoS, and external environment conditions. A testing module H analyzes the simulation-generated QoS error events to reason about SLA violations, if any. Given the complex interactions of S with its external environment, H may employ machineintelligence techniques for a probabilistic reasoning about the SLA compliance of S.
Intelligent QoS auditor H reason about SLA compliance of system S
Figure 5: Functional modules for QoS testing functional attributes PARA(S): such as stability, resilience and responsiveness in the QoS behavior of S. A computational model of S is employed to generate a virtualized execution environment where hostile events are presented to S for stress-testing to assess the ability of S to fight through (e.g., increasing the number of server replicas to improve the performance and availability of a web query service). With such an assessment, a management entity can reason as to whether a service provider functions optimally to meet its QoS obligations to the client applications. In this light, many existing works on system verification have focused primarily on the functional correctness, as stipulated by the core requirements FUNC(S): namely, a guaranteed execution of certain actions and states by S even under an extreme environment condition e ∈ E ∗ (e.g., a web service returning correct query result even under server attacks) [6].
Here, a computational model of S is employed to provide the reference event flows by simulation, which are then compared with the actual event flows in S. The compliance analysis subsumes a verification that the safety requirements are met as S strives to meet its QoS objectives under various external environment conditions E ∗ incident on the underlying cloud components and resources. The QoS assessment process is carried out using the log of test data collected by H during run-time execution of S, aided by machineintelligence based reasoning. A ranking of S relative to other systems, say, based on reputation analysis [16], can be objectively provided by our MBE-based approach.
3.2
SIMULATED MODEL OF SYSTEM co tra llect ce Q da oS ta
PaaS interface in a cloud setting. In this paper, we focus on the management functionality needed for QoS auditing: namely, the methods to test and certify a CSP.
In an earlier work [17], we studied the assessment framework for Internet-type network systems. Here, multiple video sources share an end-to-end data path to a receiver R set up using UDP-IP based packet transport. When congestion occurs along the path, say, due to cross-traffic and/or overadmission of sources, R sends a loss report to the sources to have the latter collectively reduce their video send rates for congestion relief. In the absence of knowledge about the cross-traffic interference and the source bandwidth demand, the loss observation and rate reduction proceed over multiple control steps until the congestion is relieved. In this setting, we assessed the system capability to achieve fairness among sources in the presence of a malicious source that does not obey the rate reduction rule in order to enjoy a higher share of the available bandwidth at the expense of other sources. We injected source-level faults of varying severity in a sim-
Stress-Test to Assess System Capability
Our paper focuses on model-based assessment of the QoS capability of a cloud-based system S. How good S adapts its internal operations to meet the environment conditions and QoS specs is the goal of our evaluation. We study the para-
47
ulated model of the video transport system. Our studies show 85% resilience of a well-engineered system adaptation process, even with a source failure probability of 0.6 (i.e., the deviation in proportional fairness is no more than 15%).
where L(min) depicts the minimum latency incurred for content access (i.e., with no queuing/processing delays at the proxy & server nodes: v, x, q, and R) and B is the bandwidth available on the node interconnect segments.
In general, stress-testing of a simulated model of the target system is a major element of our MBE approach. In the example of CDN, the stress parameter λλsc is varied over a wide range uner various system configurations i.e., the geographic spread-out of proxy servers relative to the client sites. This causes the system to reconfigure its internal push and pull algorithmic processes CL/CL0 /SR dynamically (if needed) to achieve an optimal content delivery latency behavior. Refer to Figure 3. Under these conditions, the ability of latency control algorithm to quickly search for a near-optimal proxy placement using heuristic methods is assessed (relative to an exhaustive search). Such a model-based testing is however subject to an assessment bias, i.e., the assessment process itself may not be accurate. For instance, a 90%-capable system getting assessed as 85%-capable indicates a bias of 5.6%. Such a bias, though arises due to the intrinsic complexity of the system being tested, can be kept low by employing machine-intelligence techniques in the assessment process.
For requests from client C3 on the other hand, that node z exclusively serves C3 is factored in the overhead and latency estimates (which is unlike the sharing of pa at node v by C1 and C2 ). Accordingly the probability of a request from C3 seeing an out-of-date copy of pb at node z is:
4.
u3b =
(3)
O(3b)
=
u3b × (c + d) +
L(min) (3b)
=
u3b ×
(c + d) . B
λs ×c λc3 (4)
In a case where the proxy for pb is removed from node z, the per-pull overhead and latency are simply (c + d) and (c+d) B respectively, because C3 always needs to pull pb from R — as is the case with pa .
STUDY OF CDN LATENCY CONTROL
The above operational analysis of CDN performance can be generalized for an arbitrary case of shared and non-shared proxy nodes in a distribution tree, with each client pulling various pages via its serving proxy node. Let G(V, E) is a connected graph representing the topology set up in the infrastructure, where the vertices V are the proxy-capable compute-nodes and the edges E are the node interconnects. Given a set of clients {Ci }i=1,2,··· ,M , each client Ci is attached to a node vi ∈ V located in its geographic proximity for accessing the content X hosted by remote server R. The ˆ on service-layer algorithm creates an overlay tree T (Vˆ , E) ˆ ˆ the topology G(V, E), where V ⊆ V and E ⊆ E with one or nodes V 00 ⊆ Vˆ hosting a content distributing proxy server. The QoS specs is: γ ≡ [Oi , Li ]i=1,2,··· ,M , where Oi and Li indicate the overhead and latency respectively tolerated by Ci . Due to the modeling complexity of CDN system, the actual output QoS γ 0 in a control round may be different from a model-estimated QoS: γ 00 = [OA,i , LA,i ]i=1,2,··· ,M . Ci may execute a machine-intelligence procedure to reduce the modeling error: (γ 00 − γ 0 ) over multiple rounds of proxy placement, with the observed output QoS γ 0 in each round refining the proxy placement in a next round. The observeand-adapt cycle continues over multiple control rounds until the desired QoS is achieved, i.e., the tracking error (γ − γ 0 ) falls below an acceptable level.
Without loss of generality, the occurrence of updates on pa and the arrival of read requests on pa from C1 and C2 are modeled as Poisson processes with the rates λs(a) , λc(1,a) and λc(2,a) respectively. Each operation: client read or server update, spawns multiple sub-tasks at a proxy node for content indexing, retrieval/write, and forwarding — which involve the processing, storage, and network elements in that sequence (the task flows are additive, because of the M/M/G/1 property). Given the sharing of node v between clients C1 and C2 , the probability that a request from C1 sent through the CL algorithm finds an out-of-date pa at node v may be shown as: λ2 c1 .λs . (λc1 +λs )2 .(λc1 +λc2 )
.
In the simplistic case of no queuing/processing delays at z, q and R, the per-pull overheads and latency incurred by C3 are:
Figure 3 shows our management-oriented view of a sample CDN realized on a cloud infrastructure. A traffic-related λ environment parameter is e ≡ λ s(a) : the ratio of the frec(1,a) quency of updates on page pa at server R to the frequency of read requests on pa from client C1 (λs(a) , λc(1,a) > 0). We assume d units of system overhead (storage, processing, and communication) to move content over a single hop.
u1a =
1
λs λc3 s + λλc3
(1)
How often a request from C1 traverses up the tree from v to pull the up-to-date content from R, instead of pulling from v, is determined by u1a . This in turn determines the message overhead and latency incurred for the content access. The synchronization of page pa at v with that at R incurs a page update overhead of 2(c + d), with the control message overhead of 2c/update amortized across various pulls by the s factor λc1λ+λ . Thus, the average per-pull overhead and c2 latency incurred by C1 are:
The on-line control schema allows an exhaustive search of the solution space to determine an optimal distribution tree with an appropriate choice of proxy placement. The modelbased determination of optimal solution however may possibly not match with the actual CDN output performance due to inaccuracies in the modeling process itself. The modeling error is factored into the algorithm for revising the optimal placements in subsequent iterations of the solution search. Given this cycle of computing a tree, actual instantiation of the on-tree nodes, and observing the user-level QoS therein
O(1a) =
(1 − u1a ) × (c + d) + u1a × 3(c + d) + λc1 × 2c λc1 + λc2 (c + d) 3(c + d) L(min) (1a) = (1 − u1a ) × + u1a × (2) B B
48
during on-line control of the CDN, the quality of adaptation may itself suffer: such as output convergence and stability. Though a final output may be optimal, an algorithmic process of reaching the optimal point that makes the effect of search jitter visible to the clients is not desirable. This is because the user-level QoS may also be affected by how many placements that are actually tried out on the CDN system during the process of determining the optimal proxy placement at the end. Thus, a heuristics-aided search of the solution space that determines a reasonable placement in less number of cycles is desirable.
Adaptations in Networked Systems. In IEEE Transactions on Computers, 58(11), Nov. 2009. [5] N. S. Rosa, P. R. F. Cunha, and G. R. R. Justo. ProcessNFL: A Language for Describing Non-Functional Properties. In proc. 35th Hawaii International Conference on System Sciences (HICSS), IEEE, 2002. [6] D. Lee, A. Lopes, and A. P. Heffter (editors). Formal Techniques for Distributed Systems. In IFIP Intl. Conf FMOODS2009-FORTE2009, Springer-Verlag publ., LNCS5522, 2009. [7] I. Schaefer and A. P. Heffter. Slicing for Model Reduction in Adaptive Embedded Systems Development. In workshop on Software Engg. for Adaptive Self-managing Systems (SEAMS), May 2008. [8] J. Yi, H. Woo, J. C. Browne, A. K. Mok, F. Xie, E. Atkins, and C. G. Lee. Incorporating Resource Safety Verification to Executable Model-based Development for Embedded Systems. In IEEE Real-time and Embedded Technology and Applications Symp., pp.137-146, 2008. [9] M. Cordier, P. Dague, M. Dumas, F. Levy, A. Montmain, M. Staroswiecki, and L. Trave-massuyes. A comparative analysis of AI and control theory approaches to model-based diagnosis. In proc. 14th European Conference on Artificial Intelligence, pp.136-140, 2000. [10] E. Fabre, A. Aghasaryan, A. Benveniste, R. Boubour, and C. Jard. Fault detection and diagnosis in distributed systems: an approach by partially Stochastic Petri nets. In Journal of Discrete Event Dynamic Systems, vol.8, pp.203-231, 1998. [11] P. E. Lanigan, P. Narasimhan, and T. E. Fuhrman. Experiences with a CANoe-based Fault Injection Framework for AUTOSTAR. In proc. IEEE/IFIP Conf. on Dependable Systems and Networks (DSN’10), Chicago (IL), June 2010. [12] A. Rowe, G. Bhatia, and R. Rajkumar. A Model-Based Design Approach for Wireless Sensor-Actuator Networks. In workshop on Analytic Virtual Integration of Cyber-Physical Systems (AVICPS), San Diego (CA), 2010. [13] J. Keeney and V. Cahill. Chisel: A Policy-Driven, Context-Aware, Dynamic Adaptation Framework. In proc. IEEE Intl. Workshop on Policies for Distributed Systems and Networks (POLICY’03), June 2003. [14] Y. Chen, R. Katz, and J. Kubiatowicz. Dynamic Replica Placement for Scalable Content Delivery. In proc. Intl. Workshop on Peer-to-Peer Systems, LNCS-2429, Springer-Verlag, pp.306-318, 2002. [15] K. Ravindran. QoS Auditing for Evaluation of SLA in Cloud-based Distributed Services. In proc. Cloud Security Auditing Workshop (CSAW), IEEE Services Conference, Santa Clara (CA), June 2013. [16] J. Abawajy. Determining Service Trustworthiness in Intercloud Computing Environments. In proc. 10th Intl. Symp. on Pervasive Systems, Algorithms, and Networks, Kaohsiung (Taiwan), Dec. 2009. [17] K. Ravindran. Model-based Engineering for Certification of Complex Adaptive Network Systems. In proc. IEEE Workshop on Cyber-Physical Networking Systems, ICDCS’12, Macau (China), June 2012.
For model-based CDN assessment by H, a specific choice of search strategy: greedy versus genetic algorithms,is less important. The latency and overhead difference of these algorithms manifests as a quantifiable difference in the robustness of corresponding CDN instances against fluctuating client workloads.
5.
CONCLUSIONS
We treat the verification of system capability to meet application QoS at a meta-level (regardless of the system complexity). The capability assessment of a cloud-based system involves three aspects: measurement, prediction, and adaptation of system behavior — which are enabled by the programmability of cloud-based infrastructures and services. Guided by the concepts provided in [3], the paper studied the management intelligence principles for verifying the QoS capability of complex distributed systems. In our management view, the core domain-specific functions of S are embodied in an adaptation module A0p that interacts with the computational processes g ∗ (· · · ) realized on the system infrastructure with the aid of service-layer algorithms. A management entity H incorporates the observation logic to reason about system capability by quantification of control errors and probabilistic QoS assessment. A case study of latency-adaptive CDN was undertaken. The focus of study was on the assessment of how good the adaptation processes are against the imperfect knowledge about system operating conditions (such as the available cloud resources and traffic demands) and the system resilience & robustness against unpredictably changing environments. Overall, our management structure for system assessment reduces the development cost of distributed control software, by model reuse and service-level programming.
6.
REFERENCES
[1] B. Li, K. Nahrstedt. A Control-based Middleware Framework for Quality of Service Adaptations. In IEEE Journal on Selected Areas in Commun., 17(9), 1999. [2] C. Lu, Y. Lu, T. F. Abdelzaher, J. A. Stankovic, S. H.. Son. Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers. In IEEE Trans. on Par. and Distr. Sys., 2006. [3] A. Avizienis, J. C. Laprie, B. Randell, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. In IEEE Transactions on Dependable and Secure Computing, 1(1), Jan. 2004. [4] P. G. Bridges, M. Hiltunen, and R. D. Schlichting. Cholla: A Framework for Composing and Coordinating
49