Approximating matrix-exponential distributions by ... - Semantic Scholar

65 downloads 0 Views 1MB Size Report
exponential distributions. The approximations themselves result in proper probability distributions. For such a global randomization with the Erlang- distribution ...
Approximating matrix-exponential distributions by global randomization 1 Appie van de Liefvoort School of Computing and Engineering University of Missouri - Kansas City Kansas City, MO 64110, USA [email protected]

Armin Heindl Computer Networks and Communication Systems University of Erlangen-Nuremberg 91058 Erlangen, Germany [email protected] Abstract

Based on the general concept of randomization, we develop linear-algebraic approximations for continuous probability distributions that involve the exponential of a matrix in their definitions, such as phase types and matrixexponential distributions. The approximations themselves result in proper probability distributions. For such a global randomization with the Erlangdistribution, we show that the sequences of true and consistent distribution and density functions converge uniformly on . Furthermore, we study the approximation errors in terms of the power moments and the coefficients of the Taylor series, from which the accuracy of the approximations can be determined apriori. Numerical experiments demonstrate the feasibility of the presented randomization technique – also in comparison with uniformization.

1

Introduction

Literally dozens of methods have been proposed for the computation of the exponential of a matrix. The stability and efficiency of various such methods are discussed in the classical paper by Moler and Van Loan [25] and we refer to its recent update [26] for newer developments like Krylov subspace techniques. Often, the particular setting imposes a structure on the matrix, which may be exploited by dedicated methods with advantages over general-purpose solvers. For example, uniformization (and its variants [17, 18, 1, 27]) and Krylov subspace techniques (e.g., [28]) have mostly been applied to the transient analysis of continuous-time Markov chains (CTMCs), where the focus is on computing a probability vector. Many of the methods are compared in [30, 12], especially for large sparse systems and including stiff situations in performability and reliability modeling. In this paper, we focus on the efficient approximation of scalar probability distribution functions (and their densities), whose definitions involve the exponential 1

This work was supported in part by US NSF under grant ANI 0106640, by UMKC under a Faculty Research Grant, and by DFG under grant HE3530/1, while Armin Heindl was with SCE at UMKC.

1

of a matrix. The approximants should themselves be true distributions with pertinent densities and should at least approximately preserve other important properties (like moments). Thus, we require that the approximants be nonnegative, that the density integrates to one, that the derivative of the distribution function yields the density, etc. Preferably, these expressions can be performed by symbolic algebra systems in exact arithmetic over the rational numbers. Moreover, we desire uniform convergence on for an approximating sequence of distributions/densities. We consider randomizing the exponential of a matrix in a global sense. Unlike uniformization (or Jensen’s method or local randomization), which has been designed to cope with huge matrices, our randomization technique is mainly intended for moderately sized matrices (which are common in performance modeling with matrix-exponential or phase-type distributions) and we study the qualitative properties of the (scalar) density/distribution approximants. By making use of rational Laplace-Stieltjes transforms (LSTs) with matrix arguments, we derive (approximate) closed-form algebraic expressions for the exponential of a matrix and also for related distribution functions and densities. In its general form, the proposed global randomization technique thus relies on the randomizing function to have a rational LST. This is best exploited by matrix-exponential (ME) distributions, which as a class are equivalent to the class of distributions with a rational LST. For the distributions to be randomized globally, we also consider ME distributions. For the randomizing distribution function, any ME distribution with a sufficiently low squared coefficient of variation may serve as a promising candidate to form approximating functions, which are themselves true density/distribution functions (after normalization). For global randomization with Erlang- distributions, we prove that these approximations converge uniformly on (for increasing ) and derive explicit expressions for the convergence of Taylor series coefficients and power moments. Both types of expressions lend themselves to an inverse error analysis, i.e., allow us to predetermine the (single) parameter of the approximation for any given error tolerance. The concept of randomization, or more specifically Erlangization, is not new and has been applied in various fields of stochastic modeling, like approximating transition probabilities in CTMCs [29, 3], finite-time ruin probabilities [4] and very recently fluid queues [20]. We have not seen a related discussion with respect to approximations of densities/distribution functions in the literature. The approximants proposed in this paper are useful for the generation of random numbers and the computation of percentiles or probabilities (especially in the tail region of ME or phase-type distributions) in an efficient and reliable (i.e., with error bounds) way. Waiting time distributions of queueing systems, whose percentiles are relevant performance measures in applications, often have an ME representation (e.g., 2

see [5] for the GI/ME/1 queue). Such ME representations may also arise from moment-fitting [21]. The paper is organized as follows. Section 2 briefly reviews ME distributions and their properties. In Section 3, we discuss in more technical detail related methods, like uniformization, and their shortcomings (or benefits) in our setting. Section 4 presents the proposed randomization technique, while we prove several convergence properties for the Erlang randomizing function in Section 5. Inverse error analysis and complexity issues are considered in Section 6. Finally, we give numerical experiments in Section 7, followed by concluding remarks.

2

Matrix Exponential Distributions

Matrix Exponential (ME) distributions are probability distributions whose distribution functions and densities are defined by for

(1)

and for

(2)

respectively. The triple is called the representation of the distribution , and consists of a -by- vector , called the starting vector, an -by- matrix , called the progress rate matrix, and an -by- vector (we use the prime to denote the transpose). The minus sign in indicates a natural generalization from a scalar exponential distribution to a vector representation. The class of ME distributions includes a number of well-known distributions, in particular all phase types, and have similar closure properties regarding mixtures and convolutions. The th power moment of the ME distribution is given by (3) ME distributions are the continuous-time distributional model of Linear-Algebraic Queueing Theory (LAQT, [23]). There has been a growing interest in LAQT and its applications in recent years and many queueing results are available (see e.g., [5, 6, 8, 24]). Obviously, the class of phase-type (PH) distributions, which also follows a matrix formalism, has a close resemblance. When the ME representation can be interpreted as the time till absorption in a CTMC with transient states, it naturally is of phase type, i.e., is the initial-state probability vector of the transient Markov chain, is the negative of the infinitesimal generator matrix and is a vector of all ones. Generally, however, the elements of , and need 3

not have a probabilistic interpretation and are only subject to the requirement that must be a probability distribution function. The resulting freedom in algebraic manipulation opens the way for finding minimal representations, for solutions to inverse problems (e.g., moment matching) and for performing similarity transformations [21]. For any -by- matrix (with inverse ), the triple represents the same distribution as the triple . Thus, a representation is not unique (which is true also for phase types). The order of the representation is and the degree of the distribution is the minimal order over all the possible representations. As an example, where has no internal structural properties, we consider the so-called companion form [22, 5]. Assume the following rational Laplace-Stieltjes transform (LST) of a distribution function, (4) where . Furthermore, if then , which is assumed in the remainder. We may now define matrix (based on the denominator of the companion form) and vectors and as follows:

.. . .. .

..

.

..

.

..

.

.. . .. .

(5)

With these definitions, the distribution is ME with representation and LST . This shows that the class of ME distributions is identical to the class of distributions whose LSTs exist and are rational. Sparse representation (5) is of minimal degree, if the numerator and denominator of the LST have no common factors. Note that it is not always possible to construct a triple corresponding to this distribution function and such that it is of phase type. A density in the difference set of ME and PH distributions is (with LST ), because PH densities must have for all . We give two alternative representations – one of them being the companion form. Both representations use the same -vector, 4

, but have different

and :

or (6) With , this ME distribution undershoots the minimal squared coefficient of variation of PH distributions of order 3 (but there are other ME distributions of order 3 with a even smaller). For queueing analysis, the desired performance measures are often expressed in simple terms of the constituents , and , possibly involving matrix inverses. Thus, the ME representation of the distribution is often sufficient, and the actual evaluation of the matrix exponential in ME distributions is rarely needed. Only for the computation of percentiles, the generation of random numbers, and other special circumstances, reliable numerical values have to be found for several different values of and fast computation is desired. Furthermore, teletraffic blocking-type applications often demand highly accurate computation (or at least with a known error bound) in the tail region of waiting/response time distributions. Thus, the interest of this study is to approximate the matrix by under the restriction that is still a probability distribution. The corresponding density will be denoted by , which we would expect to be close (in some sense) to .

3

Related work

The explicit mention of the similarity transforms in Section 2 points to the use of matrix decomposition methods (methods 14 to 18 in [26]) to compute the exponential of a matrix in (1) and (2), i.e., by transforming matrix into a more suitable form for exponentiation. While these methods may be very efficient, especially for large symmetric and when results for many different are required, not much can be said about the collective properties for the approximated density/distribution points. Nevertheless, many PH/ME distributions are already given in a particular favorable form (Hessenberg, triangular, tridiagonal, symmetric, etc.) so that tailored algorithms become computationally very attractive. Actually, the companion form (5) is such a representation, which is discussed in [26] as method 13. Since the characteristic polynomial ensues directly from the entries of (5), the polynomial method based on the Cayley-Hamilton theorem (method 8) has some appeal

5

here2 . In all of these cases, truncation of infinite series or termination of iterations prevents studying the qualitative (i.e., stochastic) properties of the scalar approximations when considered as functions of . On the other hand, one should keep in mind that for classes of small matrices – matrices of dimensions less than 5-by-5 are not uncommon for distributions – explicit expressions for have been developed [7, 10] and may prove useful. We will now briefly discuss other general methods to compute the exponential of a matrix with respect to their (in)ability to meet our requirements that and be nonnegative functions and that sequences thereof converge uniformly on . Except for uniformization for phase-type distributions, the methods known to the authors do not warrant the requirement of nonnegativity. In particular, this is true for all ODE solvers (also see methods 5 to 7 in [26]). Popular Pad´e approximations (method 2 in [26]) may lead to negative values even for the scalar exponential. Methods based on results in approximation theory often use a sequence of orthogonal functions (generally without probabilistic interpretation) that alternate around the original function (Weierstrass, so-called best approximations). This virtually guarantees that fluctuating values in the tail become negative. The strategy of “zeroing” negative values with renormalization as proposed for some techniques is unacceptable in our setting, where is the null vector in the limit for . What concerns uniform convergence on , Chebyshev rational approximations ([11], method 4) and modified diagonal Pad´e approximations [19] achieve this for the scalar exponential. While the efficient Chebyshev rational approximations minimize the -norm, modified diagonal Pad´e approximations [19] deliver good, but slightly undershooting estimates of the -error. Besides the drawback of possibly negative values for both methods, it does not appear trivial to obtain matricial error bounds. And even if obtainable, the measures based on matrix norms would not be appropriate for distribution functions. In applied probability including performability/dependability analysis and the transient analysis of CTMCs, uniformization [17, 18] is probably the most popular approach to computing (or more efficiently vector ) when is a stochastic rate matrix (or a subset thereof). It is easy to implement, flexible and performs well in a wide array of problems. Several variants, e.g., with scaling and squaring [1] and adaptive uniformization [27], have been proposed. A numerical comparison of uniformization methods for CTMCs can be found in [12]. 2 Generally, i.e., when the companion form has to be found first, the class of polynomial methods (methods 8 to 13) is considered inefficient.

6

Uniformization is based on writing

as the power series (7)

where (for all ) and . If is the (negative) generator of a CTMC, can be interpreted as a transition probability matrix of an associated DTMC, which is embedded in a Poisson process with rate . Parameter , by which all the rates in matrix are divided, is also called the uniformization rate. To emphasize the role of the Poisson process in (7), uniformization is also called randomization (not in this paper though!). This probabilistic interpretation for Markov chains has primarily contributed to the popularity of uniformization. In fact, the then ensuing nonnegativity of involved elements warrants numerical stability. In implementations to compute the matrix exponential with this method, the series in (7) needs to be truncated. Depending on the actual values of the Poisson probabilities, truncation may – and especially for large should – be performed in both directions using an algorithm by Fox/Glynn [14]. For PH (but not ME) representations, for , and truncation bounds can thus be determined so that the error remains below an arbitrary given error bound (for the considered ). General shortcomings of uniformization are that it does not perform well for stiff systems (e.g., when compared to ODE solvers) and may require many vectormatrix multplications if is large (large upper truncation bound). In our setting, where is the progress rate operator of a matrix-exponential distribution, the method would have to be generalized to retain some of the favorable properties. In any case, the probabilistic interpretation – along with numerical stability – will be lost when representation is not of phase type. Uniformization mainly aims at providing a solution for a specific time , i.e., it delivers pointwise convergence. Properties of the resulting collection of limit points as a function of are usually not studied. In the light of truncated series expressions, this is a difficult task. However, when random variates are to be generated, the series truncation would not result in accurate (underlying) moments. In following sections, we propose to use global randomization to construct approximants which are themselves distributions/densities (after normalization), as functions of . Technically, the approach to be presented is inverse to uniformization: Instead of starting from an exact expression for the exponential of a matrix (the series in (7)) with subsequent approximation (truncation), a randomized (and thus approximate) expression will be derived, which can be arbitrarily refined.

7

4

The global randomization method

In randomization in the general sense, values3 of distribution function and density are actually not computed for the desired fixed , but for a randomized , i.e., they are computed as or , where location is governed by a probability distribution . We call the randomizing distribution function or in short randomizer. We denote the -th power moment of the distributions and by and and their squared coefficients of variation by and . We assume that . Otherwise, the only assumption we make about the randomizer is that it is an ME distribution.

The general technique Randomization approximates the term (in and ) by Assuming that integration and summation can be interchanged, we have

.

The error in the approximating matrix is

(8) This approximation is exact when is the deterministic distribution with mean . More importantly, if is not the deterministic distribution, then and the approximation must necessarily be of order . If is part of a sequence of distributions converging to the deterministic distribution, the corresponding sequence of approximations attempts to simultaneously reduce all coefficients in the Taylor series of the approximation error. Let the LST of the randomizing distribution exist and be rational, i.e., is a matrix-exponential distribution itself. It is straightforward to derive that (see equation (4) with replacing )

3 Throughout the paper, we will use argument for the distribution and density that are randomized, and argument for randomizing distributions and densities.

8

where the coefficients and depend on the randomizing funtion . When replacing by in the definitions of and , we obtain the corresponding approximations and :

and

By construction, is between and and is nonnegative. Both the distribution and the density are approximated by an expression involving only a few matrix operations (addition, inversion, and vector multiplications). The computational cost in flops is , where the evaluation of a density value requires more (multiplicative) operations than that for a distribution value. There are a number of observations to be made, however. Even though and approximate the distribution and the density , some properties of and are lost. While guaranteed to be nonnegative, and may not be distributions or densities themselves. For example, it is not implied that is an increasing function in . Furthermore, is generally no longer the derivative of and the moments of the approximations may not approximate the moments of the original distributions for arbitrary randomizers. Take for example the exponential distribution as the randomizing function. The resulting approximation for is now identical to the -Pad´e approximation (with matricial argument ). For this case, and is not the derivative of . Furthermore, the expectation of does not exist (i.e., is infinite) regardless of ; nor does the integral of exist, i.e., the total probability would be infinite. These problems for global randomization may be remedied by choosing other randomizing distributions. In the next section, we show that Erlang- randomizers allow us to approximate the density in such a way that the renormalized approximation is again a probability density.

5

Randomizing with the Erlang- distribution

The deterministic distribution is often approximated by Erlang distributions, which have minimal squared coefficients of variation in the class of phase types of identical order [2]. Such Erlangizations have successfully been applied in various fields 9

of stochastic modeling (to approximate probabilities in CTMCs [29, 3], in risk models [4] and fluid queues [20]). Let be an Erlang- distribution with mean . Its LST ; thus we have

and for

This approximation for the density is not a probability density itself, as the total mass is , but after normalization and integration global randomization results in a proper density/distribution pair: for

(9)

and for

(10)

where superscript indicates affiliation to approximants by means of the Erlang randomizer. Computations of are best accomplished when is a power of , i.e., for some , and then cost (multiplicative) flops, once

has been determined (at most of order

). Analogously,

computations of cost flops when for some . It is known that the sequence of Erlang- distributions converges fairly slowly to the deterministic distribution, as the squared coefficient of variation is . Recall that also occurs in the leading term of the -series in (8). Nevertheless, this appears not to be such a great factor as the cost is essentially logarithmic in , i.e., linear in or , as compared with , which is the usual cost for working with the Erlang- distribution in applied probability. Intuition says that if the function or is known to be “slowly varying” in the neighborhood of , then the method is likely to be very successful for relatively small . Let be the number of different values at which the matrix exponential needs to be computed, then the total effort is of complexity .

Properties of density and distribution approximants for Erlang randomizers The following lemmas study properties of the approximants and , equations (9) and (10). In particular, we will give explicit relationships between the moments 10

(and Taylor series coefficients) of the true and the approximating distribution as well as error bounds in terms of derivatives. Lemma 5.1 The are

power moment of

and its squared coefficient of variation

(11) and

Proof 5.1 The moment formula follows from the scalar integral with (3). For

if (and undefined if , the first two moments and

with real) and are finite, and the

squared coefficient of variation is computed from

.

Case has been briefly discussed in Section 4. For , all power moments ( ) are undefined. For , the squared coefficient of variation of the approximating distribution is infinite. Finally, for all values of , systematically overestimates the true (finite) value of the squared coefficient of variation. The moment behavior is more or less expected. If an exponentially decaying density is to be approximated with the guarantee that the approximation is nonnegative, then the approximating function will have a super-exponential tail, which is reflected in the higher moments ultimately being infinity. Indeed, for any given ME distribution, we have constructed a sequence of ’Pareto’-like distributions that converges to the given ME. Lemma 5.2 The nentially in ) to

power moment of as .

converges linearly in

Proof 5.2 The ratio of the two moments is, for

(and thus expo-

,

The center equality is shown by induction and repeated conversion to partial fractions so that . 11

Let be the coefficient for in the Taylor series expansion for (and similar definitions for the other functions). Under our assumption that , we have that as well as . The following expressions can be derived: Lemma 5.3 The

coefficients in the Taylor series of

and

are, for

, (12)

and (13)

Proof 5.3 First, note that for and thus and (last equalities also for ). Differentiaton of equations (10) and (9) and manipulations analogous to the previous proof yield the equalities of the lemma. The “shifted” derivatives of distribution and density are reflected in the formulae for the respective coefficients of the Taylor series and .

This gives us the opportunity to find the asymptotic order of the differences and as (and thus ) grows large. Lemma 5.4 The differences between the distribution (density) and its approximation are (14) (15)

12

Proof 5.4 The asymptotic expressions are immediate consequences of the Taylor series with coefficients of lemma 5.3 and the relationships

respectively for

and

.

We have constructed a sequence of probability distributions with their densities that converge uniformly on to the original distribution (density). Uniform convergence on any finite interval follows directly from Lemma 5.4, as all derivatives are continuous functions for ME distributions. To see uniform convergence on , recall that the right-hand sides of (14) and (15) are composed of a finite number of (trigonometric) terms of type , which can be bounded on any interval . Thus, we can find a so that (or ) is below a given uniform threshold for all . Note the difference to uniformization, where different truncation bounds apply for each . The approximation error depends on the first and second derivative of the distribution (or density) indicating that for smooth functions it is an excellent approximation. Only for functions with a high value of these derivatives, should we have to use a high value of . From (11), (12) and (13), we deduce that the relative errors of the approximate moments and Taylor series coefficients do not depend on the specific distribution (but only on and ). Regarding the moments, the relative errors are even identical irrespective of whether the distribution function or density is approximated. We will exploit this observation in the next section for an inverse error analysis.

6

Inverse error analysis and complexity issues

Inverse error analysis implies that the approximating densities/distribution functions can be computed with a prescribed precision . Naturally for uniformly converging functions, one may attempt to bound the difference terms between approximating and true function for arbitrary a priori. But we will also consider another approach for an inverse error analysis with some practical appeal: here, the first moments of the true distribution are captured with a predefined precision.

13

In global randomization, the error terms in (14) and (15) allow in principle to bound the difference terms for distribution functions and densities, respectively: err

(16)

err

(17)

Theoretically, if it is required to uniformly approximate the true density/distribution function within an -strip on and since err and err are bounded, one only needs to determine so that err

(18)

Finding the involved maxima may not be easy in general (though using approximations (10) and (9) instead of the true functions may help). For hyperexponential distributions, the maxima can usually be found easily as will be outlined in Section 7.4. Generally, studying the error terms (16) and (17) will add insight to our numerical experiments of the next section. As an attractive alternative to the above inverse error analysis – especially in stochastic modeling, the density/distribution approximations may be chosen such that they capture the first moments with a given accuracy. To this end, we reformulate (11) to (for , since relative errors increase with rising moment index):

(19) Choosing such that for expression (19) is less than enforces that the first moments of the approximation are within the relative error of the true moments. Interestingly, this value of holds for density and distribution approximants alike, and moreover is independent of the specific distribution. Table 1 assembles several values of as determined from and , which we deem practical. For example, note that going horizontally from two moments to ten moments (with the same precision) mandates about the same increase in as raising the precision requirement for the two moments by another order (going down vertically). Recall that it may be difficult to compute approximations by uniformization (even for a single ) with a prescribed precision for non-PH representations. In this case the norm of matrices in (7) may not be bounded. As opposed to uniformization, which has been designed to compute the exponential of a matrix for large CTMCs, global randomization produces a sequence of proper (!) distribution 14

Table 1: Choosing parameter so that the first within the desired relative error .

1%= 0.1 % = 0.01 % = 0.001 % = 0.0001 % =

2 9 13 16 19 23

3 10 14 17 20 24

4 11 14 18 21 24

5 11 15 18 21 25

6 12 15 19 22 25

7 12 16 19 22 26

moments are all approximated

8 13 16 19 23 26

9 13 16 20 23 26

10 13 16 20 23 26

11 13 17 20 23 27

12 14 17 20 24 27

13 14 17 20 24 27

functions, which converge uniformly to the given matrix-exponential distribution. These ME distributions (as representations of waiting, service, interarrival times, etc.) are typically of moderate sizes, not exceeding 100 and often much smaller. Despite these qualitatively different goals, it is worthwhile to compare the time complexity of global randomization and uniformization. Counting multiplicative flops only, the time complexity of uniformization is known to be for any “reasonable” accuracy , where is the number of nonzero elements in matrix . A “reasonable” – as suggested in [14] for computing the upper truncation bound of order for the Poisson probabilities – is not too small. The given expression of the time complexity tacitly assumes that matrix has norm less than or equal to 1, which need not be the case for non-PH representations. The time complexity of global randomization is as discussed in Section 5, where we included the constant to account for the matrix inversion4 . Comparing the time complexity, we deduce that for (fixed) large global randomization may be faster than uniformization, despite the fact that in global randomization inverting usually leads to dense matrices and vectormatrix multiplications cannot be applied. Note that in [9], where also matrices of the type have to be inverted, dedicated numerical algorithms (studied for PH distributions) are shown to outperform uniformization for stiff models. In particular, the repeated squaring of the matrix (as opposed to the linear progression in the uniformization series) renders randomization more appropriate for ultra-high precision. Repeated squaring is also performed for method 3 of [26] 4 In subsequent considerations, we will assume (for matrix inversion via LU decomposition, see [16]). On page 121 of this reference, a more efficient scheme, which avoids to compute the inverse explicitly, is suggested, when one is interested in the value of . In our case, where

of course sparse ).

, this implies to perform repeated squaring beforehand (as advisable for

15

14 14 17 21 24 27

(scaling and squaring), which ranks among the most competitive procedures for computing the exponential of a matrix. Note that high precision would ultimately be required for uniformization at values of , where the density is practically zero (e.g., tail region). Randomization benefits here from the nonnegativity of its approximations. Generally, especially for ME distributions and in light of the qualitative properties of the approximations, randomization does have its advantages – the more so for large time parameters.

7

Numerical experiments

The following numerical examples serve to illustrate the theoretical discussion of the previous section. The selected distributions cover typical and/or extreme situations – both with respect to modeling and numerics. In Section 7.1, we approximate Erlang distributions studying the (quite representative) behavior of the error terms for hypoexponential densities and distribution functions. In Section 7.2, we randomize an ME distribution focusing on the approximation in the zeros of the density (also in comparison with uniformization). The convolution example in Section 7.3 demonstrates the practicability of global randomization in a more general setting, while we show how inverse error analysis may simplify for hyperexponential distributions in Section 7.4. All distributions are normalized to a mean and all computations were performed with the software tool MAPLE (see http://www.maplesoft.com/products/maple).

7.1 Erlang distributions In this section, we choose Erlang distributions also as the ones to be randomized (as distinguished from the Erlang randomizers). We give results for Erlang distributions with stage numbers 3, 10 and 100, respectively, providing numbers and insight into both types of inverse error analyses. We depict the behavior of error terms, which we found representative for many hypoexponential distributions. Knowing the Erlang- density functions explicitly, we can determine the accurate error in our density approximations. Figure 1 (right-hand side) plots both (F-coeff, (16)) and err (f-coeff, (17)) for the Erlang-3 and Erlang-10 diserr tributions. The displayed relationship between corresponding err and err is quite representative in that the absolute maxima of the two functions may be located at different values of and that is usually greater than err . It appears that approximating the density is a greater challenge err (as expressed in larger values of for identical accuracy in the numerical randomization procedure) than approximating the distribution function. From (18) in Sec-

16

original function k = 2^7 k = 2^5 k = 2^3 k = 2^2 k = 2^1

1.2 1 0.8

density

error coefficients K_err^f and K_err^F

1.4

0.6 0.4 0.2 0 0

0.5

1

1.5

2 time t

2.5

3

3.5

4

8 Erlang-3 f-coeff Erlang-3 F-coeff Erlang-10 f-coeff Erlang-10 F-coeff

6 4 2 0 -2 -4 0

0.5

1

1.5

2 2.5 time t

3

3.5

4

4.5

Figure 1: Convergence of density approximations for the Erlang-10 distribution by means of Erlang- randomizers (left-hand side) and leading coefficients in the error term for densities and distribution functions for two Erlang distributions (righthand side)

tion 6 – applied to approximating the densities of the Erlang distributions ( ) uniformly with –, has to be chosen equal to , 5 respectively. For , we have , respectively . Note again that while the number of required matrix-matrix multiplications grows linearly, the order of the randomizing Erlang distributions increases exponentially. Generally, with decreasing squared coefficients of variations ( ), the numerical challenge increases: not only because of the growing dimensions of the matrix representations, but also because the densities exhibit sharper humps with steeper slopes. Also considering the shape of the true density (see left-hand side of Figure 1), the density is more difficult to approximate in its extremum, whereas the distribution function is more difficult to approximate in values of , where the density function has steep slopes – an observation that may be generalized. Figure 1 (left-hand side) shows how the approximating densities (9) obtained with Erlang- randomizers converge to the true density function of the Erlang-10 distribution. Already for , which only requires nine matrix-matrix multiplications for the matrix exponential, the approximation cannot be distinguished visually from the original function. We actually approximate an exponential decay by a Pareto tail. Thus for density approximations, the relative errors will eventually increase for – irrespective of the uniform absolute error. Fortunately, this observation does not transfer to distribution functions. 5

For the Erlang-100 distribution:

err

17

.

Regarding the moments, we note from Table 1 that with6 (for any distribution) the first 3 moments are within a relative error of or the first 13 moments are within a relative tolerance of – to give two illustrative examples. Only for index (see Lemma 5.1), the moments of the approximating density will be infinite.

7.2 A true ME distribution In this subsection, we have chosen a density that has zeros, i.e., which is not of phase type. Approximations in the zeros are of particular interest and we compare the convergence behavior of global randomization with that of uniformization. Consider a hypoexponential family of distributions of degree 3 (taken from [21]), whose density is given in the scalar form (20) and with the following LST: (21) When , the distribution converges to the Erlang-3 distribution (as can be verified from the LST). For , we obtain the distribution already introduced at the end of Section 2 (see ME representations (6)), which minimizes the squared coefficient of variation for this family of distributions. With decreasing (negative) value of , the oscillating behavior – as visible in (the left-hand side of) Figure 2 – becomes more pronounced. We approximate the density function for and , i.e., ; then . The ME representation is constructed from (21) using (5). The density takes on zero values for integer multiples of , i.e., for with . Our randomization technique is applied in the very same manner as for the Erlang distributions examined above. Again, Erlang- distributions are used as randomizing functions. Figure 2 (left-hand side) shows the true and several approximating densities. Larger, but yet efficient values of are now necessary to obtain satisfactory approximations especially in the zeros of the density. Function and err are depicted in Figure 2 (right-hand side). From (18), we comerr pute the minimal parameters in order to achieve a uniform convergence to the density in a -strip. By comparing the maxima of the leading error coefficients for densities, we found that choosing in (20) (and thus a 6

We suggest

as a reasonable value for acceptable approximation in many cases.

18

original function k = 2^1 k = 2^2 k = 2^3 k = 2^5 k = 2^7 k = 2^16

1.2 1 density

error coefficients K_err^f and K_err^F

1.4

0.8 0.6 0.4 0.2 0 0

1

2

3 time t

4

5

f-coeff F-coeff

10 5 0 -5 -10 0

0.5

1

1.5

2 2.5 time t

3

3.5

4

4.5

Figure 2: Density approximations (with Erlang randomizers, left-hand side) for true ME distribution and the leading coefficients in the error terms (right-hand side) for density and distribution function

highly oscillating ME distribution) poses a commensurate numerical challenge as the Erlang-100 density with respect to uniform convergence. Ultimately, the critical zero values of the density function are approximated from above with global randomization. Of course, roundoff errors, e.g., introduced by matrix inversion or repeated squaring, may disturb this convergence behavior, which might even lead to slightly negative values for approximations in zero (especially for large and/or low machine precision). Such negative values can be precluded in critical situations by performing global randomization over the rational numbers (which is not possible for uniformization). In our experiments, we found that the numerical errors can easily be reduced to several magnitudes below the approximation error, which is intrinsic to the randomization technique and which we aim to control. We have also experimented with the uniformization method for this case. Uniformization then loses its favorable properties (such as stochastic interpretation, nonnegativity of involved values, stability and predictable accuracy) and negative values may be the result of an approximation error (and roundoff errors). Also, the monotonically increasing convergence is lost. More severely, the nonsubstochastic nature of matrices in (7) – with norms of possibly divergent – prevents to predict the (upper) truncation bound in order to achieve a prescribed accuracy. It appears difficult to draw conclusions from the norms of matrices on the convergence behavior in uniformization. Table 2 demonstrates this for 20digit precision at (first zero) of the considered example density: For two different choices of – based on the maximal absolute diagonal entry of or the maximal absolute overall entry of (which may be different for ME distributions)

19

Table 2: Approximations for the density at the leftmost zero, , using uniformization with two different uniformization rates and 20-digit precision

approx. density 1 2 3 6 10 20 30 40 50 60 140

+1.343638 +3.828094

norm of 35.37 47.15 156.54 2,866.73 37,020.02

approx. density

norm of 2.91 4.71 6.38 10.38 12.77 6.04 9.20 10.10 2.14 8.17 2.99

– elements of the corresponding series at various truncation points are shown. (For 20-digit precision, the bottom values do not improve with larger .) Although the norms of diverge in the first case (third column), the convergence of the density approximations exhibits a more regular behavior than in the second case (fifth column), where the norms of are bounded. Note the significant negativity of (intermediate) approximates in both cases – even for a truncation with as great as . The limits for uniformization may also be negative (e.g., with 10-digit precision, one obtains the value when truncating the series (7) at in our example).

7.3 A convolution example The examples presented so far were chosen to illuminate the numerical properties of the proposed randomization technique. The structure and/or dimensions of these examples actually admitted a symbolic (though expensive) computation of the matrix exponential. This does no longer hold for the following distributions to be randomized, which are not of favorable tridiagonal or triangular structure. The distribution function of this subsection arises from a convolution – a typical scenario in performance modeling – of four distributions. The first two components are Erlang-3 distributions with feedback (see Figure 3). The LST of the density is also given by (21), but with nonnegative and different scalar density function than 20

1

0

0

q Figure 3: Visualization of the Erlang-3 distribution with feedback as a Markov chain

0.6

error coefficient K_err^f

0.8 density

4

original function k = 2^1 k = 2^2 k = 2^3 k = 2^5 k = 2^7 k = 2^8 k = 2^10

1

0.4 0.2

f-coeff

3 2 1 0 -1 -2

0 0

0.5

1

1.5

2 time t

2.5

3

3.5

4

0

0.5

1

1.5

2 2.5 time t

3

3.5

4

4.5

Figure 4: Density approximations (with Erlang randomizers, left-hand side) and the leading coefficient in the error term (right-hand side) for densities for the convolution example

(20). We choose , and thus . The other two components in the convolution are given by the 2-by-2 ME/PH representation and

(22)

The squared coefficient of variation of this component is equal to . Each component is normalized to the mean so that the resulting convolution has mean (and ). In analogy to the previous discussions, the left-hand side of Figure 4 shows how the approximations by global randomization converge to the true density function for low values of , while the right-hand side plots the leading error coefficient of (15). From the maximum , we obtain and err from (18) for the - and -uniform approximations. Note the (not too surprising) qualitative similarity of the presented curves to the Erlang distributions in Section 7.1. 21

We point out that convolutions naturally occur in performance evaluation, e.g., for the response time, which is the sum of the waiting and the service time. Computing percentiles of the response time distribution in GI/ME/1 systems (via global randomization) involves the convolution of two ME distributions [5].

7.4 Hyperexponential distributions (

)

Due to special properties of hyperexponential distributions, inverse error analysis may significantly simplify for such distributions as compared with the hypoexponential distributions studied so far. This is illustrated by means of a non-trivial example of dimension 20. Hyperexponential distributions are often represented or approximated as mixtures of (scalar) exponential distributions (e.g., [13]), for which the matrix exponential is trivially computed (diagonal ). Mixtures of exponentials have completely monotone density functions, and (a relaxed) monotonicity often is a characteristic of arbitrary hyperexponential distributions. From earlier examinations of the error coefficients, we may conclude that (quasi-)monotonicity is a favorable case for our approximations. In addition, for the studied hyperexponential examples, the absolute maximum of err is attained in . Therefore, computing err

err

forgoes the matrix exponential. Thus, using (18), we can easily and accurately pre-select parameter to achieve uniform convergence within an -strip of the true density function. Moment-related accuracy requirements are met as outlined in Section 6 so that both approaches to inverse error analysis can simultaneously be pursued in a convenient way (simply choose the larger resulting from the two approaches). The maximum is not as easily found, but – as for hyerr poexponential distributions – is usually much smaller than the density counterpart. Thus, the -value resulting from the density error analysis will be sufficient. As our example, we introduce the following ME/PH representation

.. .

..

.

..

.

..

.

..

.

.. .

(23) By means of factor , we normalize the mean of the distribution to 1.0. We choose dimension with . Then . The given exam22

Table 3: Data sets for the hyperexponential example: approximation at with relative errors for corresponding -value, density values and both error coefficients for different -values

1 2 3 4 5 7 10 15 20 25 30

2621.438 3932.156 4587.516 4915.195 5079.035 5201.915 5237.755 5242.715 5242.870 5242.874844 5242.874995

rel.err. 0.5 0.25 0.125 0.0625 0.03125

err

0.000 0.001 0.002 0.003 0.005 0.007 0.01 0.1 1.0 4.0 10.0

5242.875 71.452 35.896 23.968 14.400 10.291 7.206 0.721 0.072 0.018 0.005

5242.875 1.380097 0.354034 0.141889 0.055754 0.026960 0.014211 0.000158 0.000004 0.000229 -0.005167

err

0.0 0.106835 0.107515 0.107745 0.107931 0.108002 0.108068 0.108187 0.108201 0.108926 0.094502

ple may be interpreted as a cyclically modulated exponential service, where both the modulation rates (i.e., the off-diagonal elements) and the service rates increase geometrically (with factor 2) until returning to the first state. This representation is numerically interesting, since matrix is stiff and its inverse (and that of ) is dense. Conspicuous is the sharp decay of the density near with large initial value (fifth column of Table 3), which poses the main difficulty in the approximation (see the absolute maximum of err at in the sixth column). The first two columns of Table 3 show how randomization approximates the true density in this critical point for increasing values of . It is interesting to note that – in agreement with (15) – the (non-uniform) relative error practically coincides with . Inversely from (18), the absolute uniform convergence within a - or -strip is realized for and , respectively. This documents the possibly tough requirements for hyperexponential density approximations due to their behavior at . The function err eventually enters the negative domain (see bottom value in sixth column), but not even remotely approaches the maximum at in absolute terms (a negative minimum of approximately is assumed at ). Also note how easily the distribution function is approximated in comparison: with – assumed near – being several magnitudes lower err than the density counterpart, we obtain - and -uniform convergence already for and , respectively. 23

In summary of all presented numerical examples, the proposed global randomization has been demonstrated to be a viable technique to approximate densities and distribution functions – especially considering that targeted ME/PH representations are typically of moderate size (e.g., from fitting techniques, etc.). Our approach is general, efficient and – most important – preserves qualitative properties of distributions.

8

Conclusions

Starting from the general concept of randomization, we developed a technique to obtain linear-algebraic expressions for the exponential of a matrix when the randomizing function has a rational Laplace-Stieltjes transform. Using these expressions in density and distribution functions of ME distributions yields nonnegative approximations of these functions. The approximations can be computed in exact arithmetic over the field of rationals, if the elements of the original ME representation are rational, thereby avoiding the introduction of additional errors. We focused on Erlang randomizers in this paper, for which we showed that efficient approximants can be obtained that represent themselves proper pairs of densities and distribution functions. These approximants were proven to converge uniformly on the nonnegative real line, while we also characterized the convergence of moments and Taylor series coefficients. Numerical experiments demonstrate that the proposed randomization techniques can be applied effectively in a wide range of settings. In particular, the apriori specification of moment-related and/or uniform errors is an attractive feature of global randomization, especially for non-PH representations, where uniformization can generally not warrant a prescribed absolute error. Although we primarily studied the approximation of density and distribution functions in this paper, the provided technique is likely to find application in other fields, where exponentials of a matrix need to be computed and might similarly benefit from ME randomization (e.g., in the supplementary variable approach to non-Markovian systems [15]).

Acknowledgements We would like to thank Chaitanya Garikiparthi for his programming efforts in providing the numerical results and Guy Latouche and David Stanford for sharing their pre-print manuscripts on the Erlangization of fluid queues. We also thank the anonymous reviewers, whose comments helped us to improve the presentation of this paper. 24

References [1] H. Abdallah and R. Marie. The uniformized power method for transient solutions of Markov processes. Computers Operations Research, 20(5):515–526, 1993. [2] D. Aldous and L. Shepp. The least variable phase type distribution is Erlang. Commun. Statist.Stochastic Models, 3:467–473, 1987. [3] J. E. Angus. Some bounds on the eror in approximating transition probabilities in continuoustime Markov chains. SIAM Review, 34:110–113, 1992. [4] S. Asmussen, F. Avram, and M. Us´abel. Erlangian approximations for finite-time ruin probabilities. ASTIN Bulletin, 32:267–281, 2002. [5] S. Asmussen and M. Bladt. Renewal theory and queueing algorithms for matrix-exponential distributions. In Proc. 1st Int. Conference on Matrix-Analytic Methods in Stochastic Models, pages 313–341, 1996. [6] S. Asmussen and C. A. O’Cinneide. Matrix-exponential distributions – distributions with a rational Laplace transform. In S. Kotz and C. Read, editors, Encyclopedia of Statistical Sciences. Wiley & Sons, 1997. [7] D. S. Bernstein and W. So. Some explicit formulas for the matrix exponential. IEEE Transactions on Automatic Control, 38(8):1228–1232, 1993. [8] M. Bladt and M. Neuts. Matrix-exponential distributions: Calculus and interpretations via flows. Commun. Statist.-Stochastic Models, pages 113–124, 2003. [9] R. M. Carmo, E. de Souza e Silva, and R. Marie. Efficient solutions for an approximation technique for the transient solution of Markovian models. Technical Report No. 3055, INRIA, Rennes, France, 1996. [10] H.-W. Cheng and S. S.-T. Yau. More explicit formulas for the matrix exponential. Linear Algebra and its Applications, 262:131–163, 1997. [11] W. J. Cody, G. Meinardus, and R. S. Varga. Chebyshev rational approximation to and applications to heat conduction problems. J. Approx. Theory, 2:50–65, 1969.

in

[12] J. D. Diener and W. H. Sanders. Empirical comparison of uniformization methods for continuous-time Markov chains. In Proc. 2nd Int. Conference on the Numerical Solution of Markov Chains, Raleigh, NC, USA, 1995. [13] A. Feldmann and W. Whitt. Fitting mixtures of exponentials to long-tail distributions to analyze network performance models. Performance Evaluation, 31:245–279, 1997. [14] B. L. Fox and P. W. Glynn. Computing Poisson probabilities. Communications of the ACM, 31:440–445, 1988. [15] R. German. Performance Analysis of Communication Systems with Non-Markovian Stochastic Petri Nets. John Wiley & Sons, 2000. [16] G. H. Golub and C. F. van Loan. Matrix Computations. John Hopkins University Press, Baltimore, 1989. [17] W. K. Grassmann. Finding transient solutions in Markovian queueing systems. Operations Research, 4:47–53, 1977. [18] D. Gross and D. R. Miller. The randomization technique as a modeling tool and solution procedure for transient Markov processes. Operations Research, 32:345–361, 1984. [19] A. Iserles and S. P. Nørsett. Error control of rational approximations to the exponential functions. Constructive Approximation, 2:41–57, 1986.

25

[20] G. Latouche and D. Stanford. Erlangization of fluid queues. in preparation, 2005. [21] A. van de Liefvoort. The moment problem for continuous distributions. Technical Report WPCM-1990-02, School of Interdisciplinary Computing and Engineering, University of Missouri – Kansas City, USA, 1990. [22] A. van de Liefvoort. A note on count processes. Assam Statistical Review, 5(5):1–11, 1991. [23] L. Lipsky. Queueing Theory: A linear algebraic approach. MacMillan, New York, 1992. [24] K. Mitchell, K. Sohraby, A. van de Liefvoort, and J. Place. Approximation models of wireless cellular networks using moment matching. In Proc. Conf. on Computer Communications (IEEE Infocom), pages 189–197, 2000. [25] C. Moler and C. van Loan. Nineteen dubious ways to compute the exponential of a matrix. SIAM Review, 20:801–836, 1978. [26] C. Moler and C. van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45:3–49, 2003. [27] A. P. A. van Moorsel and W. H. Sanders. Transient analysis of Markov models by combining adaptive and standard uniformization. IEEE Trans. on Reliability, 46(3):430–440, 1997. [28] B. Philippe and R. B. Sidje. Transient solutions of Markov processes by Krylov subspaces. In W. J. Stewart, editor, Proc. of 2nd Int. Workshop on the Numerical Solution of Markov Chains, Raleigh, NC, USA, 1995. [29] S. M. Ross. Approximating transition probabilities and mean occupation times in continuoustime Markov chains. Probability in the Engineering and Informational Sciences, pages 251– 264, 1987. [30] R. B. Sidje and W. J. Stewart. A survey of methods for computing large sparse matrix exponentials arising in Markov chains. Computational Statistics and Data Analysis, 29:345–368, 1999.

26