Adaptivity Metric and Performance for Restart Strategies in Web Services Reliable Messaging Philipp Reinecke
Katinka Wolter
Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin, Germany
Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin, Germany
[email protected]
[email protected]
ABSTRACT Adaptivity, the ability of a system to adapt itself to its environment, is a key property of autonomous systems. In this paper we propose a benefit-based framework for the definition of metrics to measure adaptivity. We demonstrate the application of the framework in a case study of the adaptivity of restart strategies for Web Services Reliable Messaging (WSRM). Using the framework, we define two adaptivity metrics for a fault-injection-driven evaluation of the adaptivity of three restart strategies in a WSRM implementation. The adaptivity measurements are complemented by a thorough discussion of the performance of the restart strategies.
Categories and Subject Descriptors C.2.4 [Computer-communication networks]: Client/server, distributed applications; C.4 [Performance of systems]; D.2.8 [Software Engineering]: Metrics—Performance measures
General Terms Experimentation, management, measurement, performance, reliability
Keywords Adaptivity, Adaptivity Metrics, Service-oriented architectures, Web-Services Reliable Messaging
1. INTRODUCTION Today’s computer systems grow ever more complex. ServiceOriented Architectures (SOAs) provide but one example for this trend: Complex systems export their functionality as services, which can be composed to form powerful new composed services. As a consequence of the rise in complexity, however, human administration and management of composed services become less and less feasible. In recent years, academia and industry alike concentrate on the development of autonomous systems to solve this challenge. Autonomous
systems aim to reduce maintenance effort by adaptation to the environment. Adaptivity, the ability of the system to adapt itself to its environment, is a key property of autonomous systems. However, as of yet no metrics to measure this property exist. Approaches such as the robustness radius [1] or metrics from control analysis and sensitivity analysis [6, 5] require that there exists a model of the system under study. This limits their applicability to systems for which an extensive modeling process has been performed. Metrics based on the optimal allocation of trials [7], on the other hand, rely on complete knowledge of the outcome of individual actions of the system, and are therefore interesting only from a theoretical point of view. Finally, payoff accumulation metrics [7] tend to average out the dynamics of the adaptation process. This paper aims to provide metrics for the measurement of adaptivity. We propose a framework for the definition of benefit-based adaptivity metrics, and demonstrate its application by defining two adaptivity metrics for restart strategies in Web Services Reliable Messaging (WSRM). The paper is structured as follows: In Section 2 the general model employed throughout the paper is introduced. In Section 3 we first identify four important properties of an adaptivity metric and then present our benefit-based adaptivity metric framework in Section 3.1. Section 4 illustrates the use of the framework in a case study. We define two adaptivity metrics for restart strategies for Web Services Reliable Messaging (WSRM). The adaptivity metrics are then applied in a fault-injection-driven evaluation of the adaptivity of restart strategies in a WSRM implementation (Section 5).
2.
ON ADAPTATION
Figure 1 shows the system-task model employed throughout the paper. In the study of adaptation the focus lies on the system B, which is a client to a system C and acts as a server to another system A. That is, A delegates a task b to B, which completes this task, using an invocation c of C. From A’s perspective, B’s task completion properties determine how ‘useful’ a completed task b is to A. Since the system B delegates a part c of its task b to the server C, B in turn has to contend with C’s task completion properties. Out of two systems B and B ′ , the one that completes tasks in a manner more ‘useful’ to A is the one more suitable. Thus
Client 0
Server 0/Client 1 Tasks
System A
Server 1 Tasks
System B Results
System C Results
Environment
Figure 1: The general model considered in the study of adaptation. for B adaptation means a change in the way it operates on tasks, with the aim of becoming more ‘useful’ to A. Figure 1 shows that B operates in an environment determined by both A and C: C’s task completion properties may have a negative impact on B’s usefulness, and thus its usefulness to A. On the other hand, A often also influences B through the characteristics of the workload it puts on B, e.g. larger tasks or more frequent invocations may require more resources in B.
2.1 Parameters of the Adaptation Problem Adaptation means that the system B modifies its behaviour in order to improve its task completion properties, as they are perceived by B’s client system A. Our formalisation must account for the changes that occur in B’s treatment of tasks. We then have to refine the vague notion of ‘useful’ employed so far. Drawing from the formalisation in [7], adaptation is a sequence of trials. In the framework, each invocation b delegated to the system B is a trial. The adaptation problem then has the following parameters: Structures S The part of B’s internal state that determines the behaviour of the system B is called its structure. The exact nature of a structure will depend on the nature of the task performed by the system. Payoff P The payoff encodes how ‘useful’ a task b completed by the system B is to the system A. The payoff function P maps observations of the metrics that describe B’s task completion properties to IR. Suppose that, from A’s perspective, task completion can be measured using performance metrics M1 , M2 , . . . , Mn , describing properties such as completion time, correctness or fairness. Then the function P : M1 × M2 × · · · × Mn 7→ IR provides a measure for how useful the completed task was to A. In the following we assume that P ∈ [0, 1], with P (t) = 1 denoting optimal payoff. Environments E The environment comprises all parameters that affect B’s task completion properties. Time T Trials are ordered by time. Furthermore, both the structure and the environment at the time t ∈ T of a trial may depend on time. Note that the payoff is considered time-invariant. When studying adaptation, we analyse B’s behaviour within these parameters based on an observation function Obs Obs : T 7→ P × E × S Obs(t) := (P (t), E(t), S(t)), that, for every time t, encodes the environment P (t), the structure S(t) used within this environment and the payoff P (t) thus obtained. The observation function Obs(t) is
sampled at times t1 , t2 , . . . of trials b1 , b2 , . . . . Note that this observation function implies an omniscient observer that can observe the payoff P , structures S and environment E. To the system B, these parameters are at least partially hidden.
3.
ADAPTIVITY METRICS
The choice of a metric should be guided by the purpose behind using that metric. An adaptivity metric should provide a way to assess the adaptivity of a system, thus facilitating an informed choice between adaptive systems. Furthermore, it may be useful to automate this choice. Based on these premises the following four desirable properties of a metric can be identified: Boundedness The values of the metric should be bounded to some fixed, known interval, e.g. to the interval [0, 1]. Bounded metrics carry some amount of information even in single values, without the need for comparison to others, since one can infer something about the system from how close the metric is to the interval limits. Comparability Since our purpose in finding a metric is comparing systems, comparisons between two values v1 , v2 of the metric for two systems S1 , S2 are necessary. This requires that a relation ≤M be defined for all values of the metric. Furthermore, the comparison should foster an intuitive interpretation, which is addressed as part of the next property. Intuitive Interpretation The use of any metric in the decision-making process always requires some amount of interpretation. Interpretation is necessary in order to select an appropriate metric – What does this metric tell about the system? – and to understand and compare the measurements. Otherwise one may arrive at wrong results due to misconceptions in what was measured or due to wrong conclusions drawn from the measurements. Consequently, the metric should be derived in a way that can be easily interpreted by someone without a strong background in a specific modelling technique, and the relation used to establish the order required for comparisons should also not rely on extensive modelling. Simple and Efficient Computation During the study of a system often large amounts of data are gathered. In order to analyse this data using the chosen metric, simple and efficient means of computing the metric are required. This requirement is even more important for the design of management components that monitor the adaptivity of systems at runtime. These components rely on values for the metric being available in real-time, i.e. on time for the decisions to be made. Therefore on-line computation of the metric must be possible, which necessitates that the metric can be computed within a reasonable time-span and using data available during operation.
3.1
A Framework for Benefit-based Adaptivity Metrics
There a three design goals for the adaptivity metric framework: The framework shall address the dynamic nature of the adaptation process; its applicability should not be limited by the visibility of the parameters of the adaptation
Study adaptation as a sequence of trials Similar to the optimal allocation of trials metric [7], the adaptation process is considered a sequence of trials. This addresses the dynamic nature of adaptation. However, unlike to the optimal allocation of trials approach in [7], neither the current structure chosen by the system nor an ordering of structures according to their payoff need to be known. Study system behaviour from the outside Alike to the approach with payoff accumulation [7], the system is studied as a black box in its environment. Only the payoffs obtained in each trial are assumed to be known. Unlike the payoff accumulation metric, the metric to be developed focuses not on the accumulation of payoff. Instead, payoff changes throughout the sequence of trials are studied. Defined in this manner, the framework can be applied as long as the payoff of individual trials is observable (i.e. a system A exists which can measure B’s performance). Normalise metrics to [0, 1] An optimal system will be used as a reference to normalise metrics defined using the framework. The metrics will be expressed as the ratio between an actual and an optimal system. This bounds the metric to the interval [0, 1]. The derivation of the adaptivity metric framework proceeds as follows: Let the times ti ∈ T for which observations Obs(t) exist be ordered (i.e. ∀i < j : ti < tj ). We then study the sequence of N := |T | > 1 trials
1
Optimal system S* Real system
Payoff
problem (Section 2.1), and metrics defined using the framework should possess all of the desirable properties. These considerations inspire the following fundamental ideas:
0 1
3
4
5
Trials
Figure 2: Example of adaptive system behaviour. (but not detrimental, either). For the real system in Figure 2, these sets are D⊖ D⊙ D⊕
= {3} = {5} = {2, 4}
The benefit of a positive decision shall be pi + pi−1 . 2 That is, the benefit of a positive decision is larger than that of staying constant at the lower payoff pi , but smaller than that of staying at pi+1 . ∆i :=
When defining the benefit of a neutral decision we have to take into account that values of the metric are to be normalised using values from an optimally adaptive system as a reference. In every trial, the optimally adaptive system chooses the structure that yields the optimal payoff. This means that for the optimal system the payoff stays constant throughout the sequence (see Figure 2):
Obs(t1 ), Obs(t2 ), . . . , Obs(tN ). Since we observe the system’s behaviour from the outside, only the payoff P (t) in each observation Obs(t) is visible. Defining pi := P (ti ), the payoff value of the ith trial, we have a sequence of payoffs
2
p1 = p2 = · · · = pN = 1. Consequently, in order to ensure that the adaptivity metric correctly reflects the optimally adaptive system’s optimality, the benefit of a neutral decision must be equal to the payoff obtained by that decision.
p1 , p2 , . . . , pN that the system obtained in the N trials. E.g. in Figure 2, the payoffs for the real system are p1 = 0, p2 = 0.5, p3 = 0.25, p4 = p5 = 1 Each payoff pi is considered the result of a decision made by the system between the previous trial i−1 and the next trial i. That is, the system could either choose another structure, or keep the current one. The payoff reflects how good either decision was: An increase in the payoff between subsequent trials (pi > pi−1 ) implies a positive decision, and a payoff decrease (pi < pi−1 ) is the result of a negative decision. A constant payoff (pi = pi−1 ) signals a neutral decision. The metric is based on the quality of the decisions. This means that we must define the benefit associated with each type of decision. Let D⊖ D⊙ D⊕
:= := :=
{i = 2, 3, . . . , N | pi−1 > pi } {i = 2, 3, . . . , N | pi−1 = pi } {i = 2, 3, . . . , N | pi−1 < pi }
denote the sets of negative, neutral and positive decisions, respectively. Negative decisions are considered non-beneficent
We can now define the metric: The sum of the benefits obtained by all decisions in the sequence is X X pi . ∆i + i∈D⊕
i∈D⊙
The optimally adaptive system S ∗ is characterised by the fact that its decisions are always optimal, i.e. always yield the optimal payoff. Therefore, for S ∗ all decisions are neutral: ∗ D⊙
= {i | i = 2, 3 . . . , N }
∗ D⊕
= ∅,
The sum of the benefits obtained by the optimal system is X X ∗ 1 = |D⊙ | = N − 1. ∆i = ∗ i∈D⊙
∗ i∈D⊙
Thus the maximum benefit any system can accumulate is equal to the number of decisions, N − 1. Then, P P i∈D⊕ ∆i + i∈D⊙ pi . P Ad := ∗ pi i∈D⊙ P P i∈D⊕ ∆i + i∈D⊙ pi . (1) = N −1
pi−1 > pi pi−1 = pi pi−1 < pi
ei−1 = ei si−1 = si si−1 6= si – ⊖ ⊙ ⊙ – ⊕
ei−1 = 6 ei si−1 = si si−1 6= si ⊖ ⊖ ⊙ ⊙ ⊕ ⊕
s1
s2
s3
s4
Time (Source)
s5
lost r4
r2
r3
r5
ETT
Time (Destination)
Table 1: Decision Quality Figure 3: ETT and URC. expresses how close the benefit accumulated by the measured system is to that of the optimal system S ∗ . Since N − 1, the benefit of the optimal system, is equal to the number of decisions, Ad also encompasses the probability that the system makes a beneficial decision.
Example. Figure 2 illustrates the behaviour of a real and the optimal system. Here, adaptivity of the real system is Ad = =
0.25 + 0.5 + 0.75 4 0.375
The metric, as defined here, does not depend on knowledge of either the environments or the structures. However, some refinements may be applied if such knowledge exists. As Table 1 shows, several combinations of changes in Obs may be possible: The environment may stay constant (first two columns), or it may change (second two columns) between decisions. Likewise, for each situation the system may choose a new structure or use the same as before. Then, the outcome of these decisions may be weighted differently, depending on the changes the system had to contend with. For instance, one may consider a payoff decrease due to a change in the environment less severe than one just due to the trial of a new structure. This, however, will be left for future work. The table reflects this limitation: Changes in the environment or the structures have no impact on whether a decision is classified as detrimental (⊖), neutral (⊙) or beneficial (⊕). It should be noted that with a few modifications metrics defined using the framework presented here can fit the definition of performability metrics [12]. The main difference lies in the fact that the performability framework unifies the structures and environments parameters. By keeping these parameters apart, one can identify decisions and evaluate their usefulness more easily. Distinguishing the structures and environments parameters thus more closely suits the adaptation process, which entails a choice of structures to fit the environment.
4. CASE STUDY: AN ADAPTIVITY METRIC FOR RESTART STRATEGIES IN WSRM We will now present a case-study for the application of the metric framework defined in the previous section to the adaptivity of restart strategies in a Web Services Reliable Messaging (WSRM) implementation. The case-study illustrates the use of the adaptivity metric framework defined in the previous section, and it will let us explore problems that
may arise in the practical application of the framework. The WSRM standard defines a protocol to provide reliable transmission of SOAP messages over possibly unreliable channels [4]. WSRM implementations utilise acknowledgement schemes to infer the message transmission status, and resend messages if no acknowledgement arrives before a retransmission timeout. Web Services, and thus the WSRM, use typical application-layer protocols (e.g. HTTP, SMTP) of the TCP/IP stack as SOAP transports. In this sense SOAP message retransmissions by the WSRM pose an instance of application-level restart [21, 18]. In order to apply the adaptivity metric framework we must identify the parameters of the adaptation problem for restart in WSRM, as follows: Each application message mi to be transmitted by the WSRM constitutes a trial, i.e. a task bi delegated to the WSRM by the application (systems B and A in Figure 1, respectively). The retransmission timeout constitutes the structure. The environment describes the workload generated by the Web Service application and the characteristics of the TCP/IP stack. The definition of a payoff function requires adequate performance metrics. These will be defined in the next section.
4.1
Performance Metrics for Restart in WSRM
In the following we define three performance metrics. The Effective Transmission Time (ETT) and the Unnecessary Resource Consumption (URC) metrics are generic metrics applicable to a wide range of systems (cf. [18]). The Savings (SAV) metric, on the other hand, measures the reduction of completion times obtained by restart, and is therefore limited to systems that apply restart.
Effective Transmission Time (ETT) metric. Figure 3 illustrates the concept behind the ETT metric: Every application message mi is sent ni ≥ 1 times by the WSRM source. For each of these transmission attempts there is a pair of sent/received times (sij , rij ), with j = 1, . . . , ni enumerating the attempts, sij ∈ IN and rij ∈ IN ∪ ∞ (where rij = ∞ denotes a transmission failure, i.e. message loss). Based on the assumption that failures are transient, we expect at least one transmission k to be successful, i.e. there will be a tuple (sik , rik ) with rik < ∞. Let ki∗ denote the earliest such transmission, i.e. ki∗ =
argmin
rik .
k∈{k:rik SAV ∗ 2 ∗ (mi ) = . (5) PSAV 0 : else
pre-deployment adaptation we also consider restart after a constant interval of 4 s.
5.1
Experiment Setup
A testbed was designed and implemented to study the adaptivity of the restart strategies in network environments with IP packet loss. The testbed consists of a sample application to generate a workload, a WSRM implementation inside of which the restart mechanisms operate, and an operation environment, where fault injection occurs. These components represent systems A, B and C in Figure 1, respectively. The task bi to be performed by the WSRM implementation (system B) is the transmission of the ith application message. Tasks cij refer to (re-)transmissions of the ith message over the network stack (system C). Within the testbed, a Web Service (WS) client transmits messages asynchronously to a WS server. Message transmissions are uni-directional, i.e. the server does not generate response messages. An enhanced version of Sandesha1/Axis1 [3, 2] provides the WSRM implementation for the experiments. The modifications add support for asynchronous message transmissions (implemented similarly to [23]) and improve Sandesha’s stability and performance.2 The operation environment consists of a 10 Mbit LAN connection, emulated on top of the physical 100 Mbit Ethernet. Fault injection occurs in the 10 Mbit connection. Experiments are organised in pairs of a scenario describing the environment, and a restart strategy. Experiments are run as follows: First, the server is restarted. This is done to avoid the effects of potential software aging in the server. Then, the fault injection environment is configured, followed by the start of the application. In each experiment the application sends 20000 messages with a payload of 256 bytes and a message interarrival time of 100 ms. The experiment duration is limited to 1 hour, after which the experiment is aborted and the next experiment is started. Experiments that had to be aborted are not included in the analysis. Data was gathered and analysed for five non-aborted experiment runs for each combination of a restart strategy and an environment.
The second payoff function is bounded to [0, 1] and has its optimum at 1.
Loss model. IP packet loss is generated by a time-driven
5. APPLICATION OF THE ADAPTIVITY MET- implementation of a simplified two-state Gilbert loss model. Gilbert loss models generate sequences of alternating loss RICS episodes and loss-free periods with exponentially-distributed We will now employ the adaptivity metrics defined in the previous section in an experimental evaluation of the adaptivity of restart strategies for Web Services Reliable Messaging (WSRM). We consider three restart strategies: The first strategy uses the well-known Jacobson/Karn algorithm [9, 11] to compute the retransmission timeout. With the Jacobson/Karn algorithm the timeout is computed from the smoothed mean and the variance of round-trip time (RTT). The second strategy uses a modified version of the algorithm presented in [20], which we term the QEST algorithm in the following. This algorithm obtains a histogram representation of past completion times and tries to find the timeout that minimises expected completion times. The version used here also employs exponential backoff on restarts. In order to assess the benefit of adaptive behaviour over
lengths (see e.g. [25, 24, 22, 19, 8]). Results for two scenarios are presented here: In the S1 scenario, loss episodes and loss-free periods have a mean length of 0.05 s and 120 s, respectively. For the S2 scenario, these parameters are 1 s and 30 s.
Parameters for the restart algorithms. The restart algorithms use the following parameters For Jacobson/Karn, k = 4, α = 1/8, β = 1/4, with the initial timeout RTOinitial = 2 The current branch of Sandesha development, Sandesha2/Axis2, supports asynchronous invocations natively. However, this version was not stable enough for the experiments.
4 s. For the QEST algorithm, the number of buckets is H = 1000, the maximum timeout tmax = 60 s and the initial timeout was RTOinitial = 4 s.
0.99
5.2 Preparation of Measurements
0.96
Measurements were obtained by off-line analysis of message send/receive events recorded during the experiments. The analysis consisted of computing the EffectiveTransmission Time (ETT), the Unnecessary Resource Consumption (URC), the savings (SAV) and the payoffs P 1 and P 2 for each message. The adaptivity metrics Ad1 and Ad2 were then computed for each sequence. The R statistics package [15] was employed in the statistical analysis of the measurements.
0.95
0.98
Ad1
0.97
Fixed Intervals (S1) Jacobson/Karn (S1) QEST Algorithm (S1) Fixed Intervals (S2) Jacobson/Karn (S2) QEST Algorithm (S2)
0.94 0.93 0.92 0.91 0.9 0.89 0.01
0.25
puted from timestamps recorded on two different machines, both system clocks needed to be synchronised. System clocks were synchronised using NTP. Clock synchronisation in the testbed was assessed based on NTP log files. It was found that system clocks stayed within +/- 2 ms of each other during the experiments, which is considered sufficiently accurate so as to not necessitate the use of skew removal procedures such as presented in e.g. [14, 13, 10].
5.3 Results Figures 4 and 5 present results for the adaptivity metrics Ad1 and Ad2 in the two scenarios S1 (0.05 s/120 s packet loss) and S2 (1 s/30 s packet loss). The values are based on data from five runs of each strategy in each scenario.. Adaptivity in each run is computed from values for the payoff functions P 1 and P 2 , respectively. Data for each run consists of 19001 samples (message numbers 1000, . . . , . . . 20000, see above). For the first adaptivity metric, Ad1 , Figure 4 shows data for five choices of the weighting factor α, which determines the relative importance of ETT over URC in P 1 . An α of 0.01 signifies very low importance of ETT, while a high α is used when completion times are much more important than
0.9
0.1
0.01
0.001
Ad2
1e-04
1e-05
Fixed Intervals (S1) Jacobson/Karn (S1) QEST Algorithm (S1)
1e-06
Fixed Intervals (S2) Jacobson/Karn (S2) QEST Algorithm (S2)
1e-07 0ms
Furthermore, the sequences fall into two classes that differ by minimum and median transmission times (best visible in the ETT values). Measurements for Fixed Intervals have minimum values of 8–9 ms (median 9–13 ms), whereas the other measurements have minima of 12–13 ms (median 10– 18 ms). All experiments whose measurements have lower minima were performed some time after the experiments with higher minima. The decrease indicates a change in network characteristics in the time in between. The nature and cause of this change is unknown. However, since the difference is very small in comparison to the thresholds ET T ∗ (in c(ET T ) in the first payoff metric, P 1 ) and SAV ∗ the effect on the results is negligible.
0.75
Figure 4: Ad1 values for scenarios S1 and S2 , with 95% confidence intervals.
Artifacts. It was observed that many sequences exhibited a transient increase of ETTs shortly after the start of the experiment. The root cause of this behaviour could not be identified. Since this phenomenon affected experiments irrespective of the scenario or the restart strategy used, the effect on the measurements was considered an artifact introduced by the testbed itself. This artifact was removed by only including message numbers 1000, . . . , 20000 in the analysis.
0.5 \alpha
Measurement accuracy. Since measurements were com-
100ms
500ms
1000ms
2000ms
Savings threshold SAV*
Figure 5: Ad2 values for scenarios S1 and S2 . Note that the Ad2 axis is scaled logarithmically. Ad2 is zero for Fixed Intervals in S1 . fairness. The threshold value ET T ∗ is set to 100 ms for all computations of Ad1 . The second adaptivity metric Ad2 is presented for five values of the minimum savings value SAV ∗ . Figure 4 shows that, irrespective of the weighting factor α or the scenario, the adaptivity metric Ad1 reports all strategies as about equally adaptive. Curiously, Fixed Intervals are shown to be slightly more adaptive than either Jacobson/Karn and QEST in the S2 scenario. With the second adaptivity metric Ad2 (Figure 5) show a different picture: Here, we observe a difference between the non-adaptive Fixed Intervals strategy and the adaptive Jacobson/Karn and QEST algorithms. The Ad2 metric reports the latter two as being more adaptive by one order of magnitude. For higher saving thresholds SAV ∗ , the QEST strategy appears more adaptive than the Jacobson/Karn algorithm. Furthermore, in the S2 scenario (which has a higher fault rate) adaptivity values Ad2 are higher. However, the relative ordering of the strategies in regard to their adaptivity values stays the same: Fixed Intervals have the lowest Ad2 values, followed by Jacobson/Karn and QEST, which has higher values for larger savings thresholds SAV ∗ .
5.4 Discussion
Given values of the payoff metrics P 1 and P 2 , both adaptivity metrics are easily computed. On-line computation, however, relies on the performance metrics ETT, URC and SAV being available. In the experiments values for these metrics were obtained by off-line analysis of events sampled using a dedicated experiment testbed. In particular, ETT measurements require synchronised clocks, and all measurements rely on the existence of a TransmissionID field to distinguish retransmissions of a message. Both facilities often do not exist in a practical scenario. Therefore, computation of the metrics can be difficult in practice. However, one may approximate ETT values by RTT measurements, and a TransmissionID header can be easily added by a WSRM implementation. A study of these issues is left as future work. This section will discuss the consistency of measurements with system behaviour, and the influence of the environment on the adaptivity values.
5.4.1
Measurement consistency
The previous section showed conflicting results for the adaptivity of the restart strategies, depending on which adaptivity metric is used: The ETT/URC-based adaptivity metric Ad1 reports that all strategies are about equally adaptive. Moreover, using this metric we may consider the nonadaptive Fixed Intervals more adaptive than both Jacobson/Karn and QEST, albeit by a very small margin. The SAV-based adaptivity metric Ad2 , on the other hand, shows that Jacobson/Karn and QEST are more adaptive than Fixed Intervals. We must therefore consider whether the adaptivity metrics do indeed reflect system behaviour correctly. In order to perform such a discussion, we have to check whether adaptivity, as reported by the adaptivity metrics, is consistent with system behaviour, as measured by the performance metrics.
Background: Influence of the TCP-RTO on the HTTP transport. An understanding of the behaviour of the TCP/IP stack is required in order to correctly interpret the results. In the experiments, the WSRM used the HTTP SOAP transport to transmit messages. For each message to be transmitted, the HTTP SOAP transport establishes a new TCP connection to the WSRM destination. This entails a TCP connection setup, a data transfer phase, and a TCP connection tear-down for each message. The TCP uses acknowledgements and a retransmission timeout (RTO) to detect IP packet loss and retransmit lost packets. In the three-way handshake in the connection setup phase the RTO starts with a value of 3 s, which doubles upon each timeout (exponential backoff). In the data transfer phase the timeout is adjusted based on round-trip time (RTT) observations [11]. The initial RTO timeout of 3 s in the connection setup phase causes TCP connections that experience packet loss during connection setup to stall for at least 3 s, because at least 3 s must elapse before the RTO expires and the TCP sender de-
3500 Fixed Intervals (S1) Jacobson/Karn (S1) 3000
QEST Algorithm (S1)
2500
2000 ms
Section 3 lists four desirable properties of a metric: Boundedness, comparability, intuitive interpretation and simple and efficient computation. Boundedness and intuitive interpretation for both metrics are ensured by their definition. The scalar values by which both metrics represent adaptivity ensures comparability (within values for one metric).
1500
1000
500
0 Median
95%q
99%q
Maximum
Mean
Std. Dev.
Statistical Properties of ETT samples
Figure 6: Statistical Properties of the ETT observations in S1 . tects that an IP packet has been lost. The completion times of short connections are often strongly affected by these 3 s delays [16, 17, 18]. The messages sent here used a payload of 256 bytes, and thus were transmitted over short connections. We may therefore expect the 3 s delay during connection setup to have a significant impact on Effective Transmission Times (ETT).
ETT/URC Measurements. Figure 6 presents ETT statistics for scenario S2 . The graph shows average values obtained from the 5 runs for each strategy in each scenario, together with the 95% confidence intervals. We observe that, in regard to the standard statistical properties mean, median and standard deviation all strategies within each scenario yielded similarly low completion times ET T . The low median (50% quantile), in particular, indicates that most transmissions finished very fast. The first difference between strategies that meets the eye is in the ETT maxima for S1 . Here we see that Fixed Intervals result in maxima slightly above 3 s, measurements for Jacobson/Karn have maxima of about 2 s, and maxima for the QEST algorithm are at about 500 ms. This can be explained by the impact of the initial 3 s RTO. The Fixed Intervals algorithm restarts message transmission after 4 s. TCP connections that experience exactly one IP packet loss during connection setup are delayed by 3 s, and therefore do not trigger a restart with a restart interval of 4 s. Both Jacobson/Karn and QEST adjust the timeout based on observations of the Round-Trip Times (RTTs) of previous transmissions, which are typically much smaller than 3 s (indicated by the small median and mean values in the table). Consequently, both strategies computed timeouts below 3 s, which resulted in restarts of those transmissions whose underlying TCP connections stalled during connection setup. However, the stalled connections also reached the destination, albeit some time later. The restarts performed by Jacobson/Karn and QEST therefore resulted in maximum URC values of 1 and 2, respectively, whereas with Fixed Intervals the maximum URC is zero in all runs in the S1 scenario. Results from scenario S1 show that restart improves completion times by avoiding waiting for very slow invocations.
6000
3500 Fixed Intervals (S2)
Fixed Intervals (S1)
Jacobson/Karn (S2)
Jacobson/Karn (S1) 3000
QEST Algorithm (S2)
5000
QEST Algorithm (S1)
2500 4000
ms
ms
2000 3000
1500 2000 1000
1000
0 Median
500
95%q
99%q
Maximum
Mean
Std. Dev.
Statistical Properties of ETT samples
0 Median
95%q
99%q
Maximum
Mean
Std. Dev.
Statistical Properties of SAV samples
Figure 7: ETT properties in S2 .
Figure 8: Properties of SAV observations for S1 .
On the other hand, scenario S2 (Figure 7) does not exhibit this difference in maximum ETT values. However, here we observe that the 99% quantile for the adaptive algorithms is below 3 s, whereas it is above 3 s for Fixed Intervals. This indicates that the adaptive algorithms still tend to avoid (by restarting them) transmissions that are affected by a single IP packet loss during connection setup, however, the ‘shift’ from the maximum to the 99% quantile also implies that there are ETTs above 3 s even for the Jacobson/Karn and QEST strategies.
Fixed Intervals the 99% quantile is at 3 s, the RTO timeout of the TCP, again indicates that large ETTs with Fixed Intervals are mainly due to its restart interval of 4 s. In general, both adaptive strategies avoid completion times of 3 s by restarting earlier. The larger 95% quantile for QEST, on the other hand, can be explained by an increase in system load that resulted in an overall increase of transmission times. This is illustrated by the distribution of ETTs for those messages where no restart occurred (i.e. U RCi = 0 for all ET Ti considered now), shown in Figure 10: For these messages, the ETT is the time taken by the first transmission, without direct interference of the restart algorithm. We observe that for non-restarted messages the 95% and 99% quantiles of the ETT obtained with the QEST algorithm are consistently higher than with Jacobson/Karn, i.e. QEST resulted in a larger number of slow non-restarted transmissions than Jacobson/Karn.3 This difference cannot be caused by a bad choice of restart interval for these messages, since the messages were not restarted. However, in this scenario maximum URC values are 1 for Fixed Intervals, 2 for Jacobson/Karn and 4 for QEST, i.e. for messages that were restarted QEST performed much more frequent restarts than the other strategies. These more frequent restarts increased the load on the network and on both the client and the server machines, thus slowing down all transmissions, including those that were not restarted.
This can be traced back to the exponential backoff upon timeout that both algorithms use: In the first scenario, fault episodes are, on average, very short (0.05 s). Thus it is unlikely that two subsequently sent messages both experience a 3 s delay due to IP packet loss in the TCP connection setup, i.e. it is unlikely that two timeouts are triggered in the restart algorithm without a fast transmission in between. In this case, the algorithm restarts the delayed message and increases the timeout exponentially. The next message (which is not delayed) completes without triggering the (now doubled) timeout, which causes the algorithm to reset the timeout to the lower value computed from RTT samples. In the higher-loss scenario, on the other hand, the average length of loss episodes is 1 s. Longer loss-episodes affect several subsequently-sent messages, which results in the timeout elapsing several times in short succession. With each expiry the timeout is incremented exponentially, and thus it quickly grows large enough to let even very slow transmissions complete without restart. These very slow transmissions then account for the growing number of ETT samples above 3 s that we observe in the ‘shift’ of the quantiles. Quantiles for the S2 scenario show a peculiar tendency: While for Jacobson/Karn and QEST the 99% quantile is considerably lower than for Fixed Intervals, the 95% quantile for QEST is more than double that of Fixed Intervals and Jacobson/Karn (204 ms and 193 ms vs. 430 ms for QEST) The 50% quantile, however, is very similar for all strategies (10 ms, 16 ms and 15 ms for Fixed Intervals, Jacobson/Karn and QEST). Jacobson/Karn’s and QEST’s lower 99% quantile implies that both algorithms tend to avoid long completion times by restarting. In fact, the observation that with
SAV Measurements. Figures 8 and 9 present overall statistics for the SAV metric. Here we observe a number of differences between the strategies: In the S1 scenario the Fixed Intervals algorithm did not obtain any savings at all (SAV = 0 in all runs). This can be explained by the timeout of 4 s, which was not triggered in these experiments, as evidenced by the maximum URC values of 0 for this scenario. In contrast, Jacobson/Karn and QEST reduced completion times by restarting. In the S2 scenario, Fixed Intervals also reduced completion times. However, both Jacobson/Karn and QEST achieved higher savings, because, due to their lower timeouts, both could restart slow transmissions ear3 The discussion in this paragraph excludes Fixed Intervals, because the distribution of non-restarted ETTs for this strategy is skewed by the 3 s peak caused by the TCP RTO (see above).
16000
5000 Fixed Intervals (S2)
Fixed Intervals (S2)
Jacobson/Karn (S2) 14000
Jacobson/Karn (S2)
4500
QEST Algorithm (S2)
QEST Algorithm (S2) 4000
12000 3500 3000 ms
ms
10000
8000
2500 2000
6000
1500 4000 1000 2000
500
0 Median
95%q
99%q
Maximum
Mean
0 Median
Std. Dev.
Statistical Properties of SAV samples
Figure 9: Properties of SAV observations for S2 . lier. In particular, the Fixed Intervals algorithm could not restart transmissions with a length of 3 s (see the discussion for ETT and URC above). Comparing both adaptive strategies, we see that, according to the 99% quantile, QEST is more successful at reducing the completion time. This is also reflected in the mean SAV values, which in the S2 scenario are 15.7 ms and 47.0 ms for Jacobson/Karn and QEST, respectively
5.4.2
Summary
We can thus summarise the discussion as follows: According to averages of the ETT and URC metrics, the strategies do not differ significantly. Compared to Fixed Intervals, the adaptive restart strategies do not necessarily reduce mean transmission times, and may in fact increase transmission times due to increased system load caused by restart. This means that within the tradeoff between timeliness and fairness all three strategies perform equally well, regardless of adaptation. Results for the first adaptivity metric are thus consistent with system behaviour. On the other hand, both Jacobson/Karn and QEST are more successful at avoiding very slow transmissions than restart at constant 4 s intervals. The second adaptivity metric correctly reflects this benefit of adaptive restart strategies.
5.4.3
Influence of the environment
With the second adaptivity metric (Figure 5) we see that adaptivity values depend on the scenario: Although the relative ordering of the strategies according to their adaptivity stays the same, in the higher loss level scenario the absolute values for the adaptivity of each strategy are also higher. Furthermore, adaptivity values are rather low for the Ad2 metric. This can be explained by a property of the metric: Ad2 is normalised to [0, 1] using a theoretical optimally adaptive strategy that obtains a payoff P 2 = 1 in every trial. This optimal strategy can only exist in scenarios where restart yields a significant reduction SAV in the completion time for every message. That is, for every single message the first transmission must have a larger completion time than one of the retransmissions. This was not the case here, where almost all messages were transmitted very fast on the first attempt. However, slower first transmissions occurred more
95%q
99%q
Maximum
Mean
Std. Dev.
Statistical Properties of ETT samples
Figure 10: Properties of ETT observations for message transmissions without restart in S2 . frequently with higher loss levels (as indicated by the higher URC values for S2 in Figure 5). More slow transmissions that can be sped up by restart approximate the situation where the optimal strategy as described above may exist more closely, and hence the Ad2 values increase for all algorithms.
6.
CONCLUSION
In this work adaptivity was defined as the ability of a system to adjust itself to its environment, and to thereby improve its own performance. Based on a list of four desirable properties, an adaptivity metric framework was derived. Metrics defined using this framework measure adaptivity using the benefit of subsequent decisions. Given a payoff function bounded to the interval [0, 1], the abstract definition can be used as a framework to derive an adaptivity metric for a concrete application context. A case study of the adaptivity of restart strategies in Web Services Reliable Messaging was performed. Two payoff functions for restart mechanisms were defined and used to derive two adaptivity metrics. The first payoff metric measures performance of restart algorithms in the tradeoff between timeliness and fairness. The second payoff metric focuses exclusively on the reduction in completion times obtained by restart. The metrics were employed in a faultinjection driven evaluation of the adaptivity of three restart strategies for WSRM. According to the first adaptivity metric, all three strategies adapt equally well, while the second metric showed better adaptation in the QEST and Jacobson/Karn strategies than with Fixed Intervals, with QEST being more adaptive than Jacobson/Karn. The discrepancy in the results depending on the adaptivity metric prompted a discussion of the performance of the restart strategies. It was found that the metrics indeed reflect system behaviour correctly. This highlights the importance of employing payoff metrics that measure relevant performance aspects of the system under study. Furthermore, it should be noted that the performance results differ from those presented in [18], where a different fault model was employed in the fault injection mechanism.
The effect of different fault models on adaptivity should be investigated in more detail.
7. ACKNOWLEDGEMENTS
[17]
Philipp Reinecke is supported by the German Science Foundation (DFG), grant number Wo-898/2-1.
8. REFERENCES [1] S. Ali, A. A. Maciejewski, H. J. Siegel, and J.-K. Kim. Measuring the Robustness of a Resource Allocation. IEEE Transactions on Parallel and Distributed Systems, 15(7):630–641, 2004. [2] The Apache Software Foundation: Apache Axis. http://ws.apache.org/axis/. [3] The Apache Software Foundation: Apache Sandesha. http://ws.apache.org/sandesha/. [4] BEA Systems, IBM, Microsoft Corporation Inc, and TIBCO Software Inc. Web Services Reliable Messaging Protocol (WS-ReliableMessaging), February 2005. [5] D. G. Cacuci. Sensitivity & Uncertainty Analysis, Volume I: Theory. CRC Press, 2003. [6] J. L. Hellerstein, Y. Diao, S. Parekh, and D. M. Tilbury. Feedback Control of Computing Systems. Wiley Interscience, 2004. [7] J. Holland. Adaptation in Natural and Artificial Systems. The University of Michigan Press, 1975. [8] W. Jiang and H. Schulzrinne. Modeling of Packet Loss and Delay and Their Effect on Real-Time Multimedia Service Quality. In Proc. NOSSDAV, 2000. [9] P. Karn and C. Partridge. Improving Round-Trip Time Estimates in Reliable Transport Protocols. ACM Transactions on Computer Systems, 9(4):364–373, November 1991. [10] H. Khlifi and J.-C. Gr´egoire. Low-complexity offline and online clock skew estimation and removal. Computer Networks: The International Journal of Computer and Telecommunications Networking, 50(11):1872–1884, 2006. [11] B. Krishnamurthy and J. Rexford. Web Protocols and Practice. Addison Wesley, 2001. [12] J. F. Meyer. Performability modeling: Back to the future? In Proceedings of the 8th International Workshop on Performability Modeling of Computer and Communication Systems, pages 5–9. CTIT, 2007. [13] S. B. Moon, P. Skelly, and D. Towsley. Estimation and Removal of Clock Skew from Network Delay Measurements. Technical report, University of Massachusetts, Amherst, MA, USA, 1998. [14] V. Paxson. End-to-End Internet Packet Dynamics. In Proceedings of the ACM SIGCOMM ’97 conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, volume 27,4 of Computer Communication Review, pages 139–154, Cannes, France, September 1997. ACM Press. [15] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3-900051-07-0. [16] P. Reinecke, A. P. A. van Moorsel, and K. Wolter. A Measurement Study of the Interplay Between Application Level Restart and Transport Protocol. In
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
M. Malek, M. Reitenspieß, and J. Kaiser, editors, ISAS, volume 3335 of Lecture Notes in Computer Science, pages 86–100. Springer, 2004. P. Reinecke, A. P. A. van Moorsel, and K. Wolter. Experimental Analysis of the Correlation of HTTP GET Invocations. In A. Horv´ ath and M. Telek, editors, EPEW, volume 4054 of Lecture Notes in Computer Science, pages 226–237. Springer, 2006. P. Reinecke, A. P. A. van Moorsel, and K. Wolter. The Fast and the Fair: A Fault-Injection-Driven Comparison of Restart Oracles for Reliable Web Services. In QEST ’06: Proceedings of the 3rd International Conference on the Quantitative Evaluation of Systems, pages 375–384, Washington, DC, USA, 2006. IEEE Computer Society. H. Sanneck, G. Carle, and R. Koodli. A framework model for packet loss metrics based on loss runlengths. In Proceedings of the SPIE/ACM SIGMM Multimedia Computing and Netoworking Conference 2000, San Jose, CA, January 2000. SPIE/ACM SIGMM. A. van Moorsel and K. Wolter. Analysis and Algorithms for Restart. In Proc. 1st International Conference on the Quantitative Evaluation of Systems (QEST), pages 195–204, Twente, The Netherlands, September 2004. A. P. A. van Moorsel and K. Wolter. Analysis of Restart Mechanisms in Software Systems. IEEE Transactions on Software Engineering, 32(8), August 2006. M. Varela, I. Marsh, and B. Gr¨ onvall. A systematic study of PESQ’s behavior (from a networking perspective). In Proceedings of the 5th International Conference MESAQIN 2006: Measurement of Audio and Video Quality in Networks, 2006. U. Zdun, M. V¨ olter, and M. Kircher. Pattern-Based Design of an Asynchronous Invocation Framework for Web Services. Int. J. Web Service Res., 1(3):42–62, 2004. Y. Zhang, N. Du, V. Paxson, and S. Shenker. On the Constancy of Internet Path Properties. In Proceedings of the ACM SIGCOMM Internet Measurement Workshop, 2001. Y. Zhang, V. Paxson, and S. Shenker. The Stationarity of Internet Path Properties: Routing, Loss, and Throughput. ACIRI Technical Report, 2000.