JON R. HILL y. NEC USA, Inc. C&C Research Laboratories. 4 Independence Way. Princeton, NJ 08540. Abstract. TES (Transform-Expand-Sample) is a versatile ...
A SURVEY OF TES MODELING APPLICATIONS BENJAMIN MELAMED JON R. HILL y NEC USA, Inc. C&C Research Laboratories 4 Independence Way Princeton, NJ 08540
Abstract
TES (Transform-Expand-Sample) is a versatile methodology for modeling general stationary time series, and particularly those that are autocorrelated. From the viewpoint of Monte Carlo simulation, TES represents a new and exible input analysis approach. The salient feature of TES is its potential ability to simultaneously capture rstorder and second-order properties of empirical time series ( eld measurements): Given an empirical sample, TES is designed to t an arbitrary empirical marginal distribution (histogram) and to simultaneously approximate the leading empirical autocorrelations. Practical TES modeling is computationally intensive and can be eectively carried out only with computer support. A software modeling environment, called TEStool, has been designed and implemented to support the TES modeling methodology, through an interactive heuristic or algorithmic search approach employing extensive visualization. The purpose of this paper is to introduce TES modeling, and to oer some illustrative examples from a range of applications. The paper rst presents an overview of TES modeling and TEStool as well as the underlying TES theory. It then proceeds to survey a number of models from various application domains, including source modeling of compressed video and fault arrivals, nancial modeling and texture generation. These examples demonstrate the ecacy and versatility of the TES modeling methodology, and underscore the high delity attainable with TES models.
Keywords and Phrases:
Autocorrelated Processes, Input Analysis, Modeling and Simulation, Software Systems, Stochastic Processes, TES Processes.
Current address: RUTCOR, Rutgers University, P.O. Box 5062, New Brunswick, NJ 08903 y Current address: Smith Barney, Inc., 1345 Avenue of the Americas, New York, NY 10105.
1 INTRODUCTION Temporal and spatial dependencies are commonplace in a host of real-world random phenomena. Temporal dependence accounts, to a large extent, for burstiness in telecommunications trac, especially in the context of high-speed communications networks. File transfer and compressed VBR (variable bit rate) video are typical examples from emerging ISDN (Integrated Services Digital Network) applications. Locality of reference in caches and data bases is underpinned by spatial dependence of successive memory accesses. The combined eects of temporal and spatial dependencies underlie fault cascades observed in network management. When formulated mathematically as real-valued stochastic processes, temporal dependencies are often proxied by autocorrelations within a stochastic process, and by cross-correlations among multiple processes. Although dependencies in real-world systems abound, in practice, the natural impulse of modelers is to minimize or eliminate dependence from model descriptions in order to simplify analysis. A typical case in point is what might be termed \classical queueing". The bulk of queueing systems are comprised of one or more GI=GI=m queues, where GI (General Independent) stands for a renewal process; that is, interarrival times and service times are speci ed as iid (independent identically distributed) random sequences. The standard argument in support of this approach is that these assumptions endow the models with analytical or numerical tractability. A related argument is motivated by the insight that analytical models may provide, especially when dealing with systems in the design stage, or with extant systems in the early stages of their life cycle. This argument is entirely legitimate, since in such scenarios one is often merely interested in qualitative understanding of design tradeos, and the modeling eort calls for calculation of performance measures which are rough rather than accurate. Consider, however, performance analysis at a later stage of a system's life cycle. Suppose that the randomness in the system is driven by some stochastic process, and that eld measurements from it are available to the modeler (at this point we don't care to distinguish between performance evaluation approaches, be they analytical or Monte Carlo simulation). The experienced analyst is immediately confronted with two generic and sequential questions: 1. Should one ignore dependence in empirical time series and use a simple model, say, from the class of renewal processes? 2. If dependence is to be modeled, how can it be done in an eective and systematic manner? Experience shows that all too often, the prevailing attitude is to obviate the answer to the second question by simply electing a renewal process model. Certainly, by the Principle of Parsimony, modelers should only care to invest in more elaborate modeling to the extent that the extra eort will buy suciently increased accuracy in the resulting performance measures of interest. The point is that modelers should be aware of the fact that oversimpli ed renewal models can carry grave modeling risks. To x the ideas, assume that we have eld measurements from a trac (or workload) process, and we wish to gauge system performance measures resulting from oering this trac to a queueing system. A little introspection on the nature of burstiness in arrival processes should convince the reader of its deleterious eect on waiting times: Multiple customers arriving in a burst will obviously suer from increased waiting times, while the 1
lulls separating bursts waste server utilization. Indeed, various studies [5, 24, 29] have shown that when autocorrelated trac is oered to a queueing system, the resulting performance measures are much worse than those corresponding to renewal trac, diering by orders of magnitude. A growing realization of the impact of bursty trac on queueing system performance has provided the initial motivation for devising input analysis methods that are able to capture dependence in time series; no doubt, this realization is bound to extend to other modeling domains. The TES (Transform-Expand-Sample) modeling methodology [25, 11, 12, 13] is just such an input analysis approach, speci cally designed to address the second question above. In essence, TES is a modulo-1 reduction of a simple linear autoregressivebased scheme, followed by additional transformations. Furthermore, while the basic TES formulation is Markovian, these transformations usually result in non-Markovian processes. Thus, TES is a non-linear autoregressive-based scheme, encompassing both Markovian and non-Markovian processes. The TES approach is tailored to a speci c world view of what constitutes a good model, namely, that both rst-order and second-order properties of empirical time series should be adequately captured. Speci cally, it stipulates three requirements that should be satis ed by good prospective models: Requirement 1: The marginal distribution of the model should match its empirical counterpart. Requirement 2: The autocorrelation function of the model should approximate its empirical counterpart. Requirement 3: Sample paths generated by a Monte Carlo simulation of the model should \resemble" the empirical data. Note that the these requirements are arranged in decreasing stringency. The rst two constitute quantitative goodness-of- t criteria, whereas the third one is a qualitative requirement which cannot be de ned with mathematical crispness. We argue that, in fact, there is no compelling need to. Obviously, Requirement 3 is a highly subjective statement, but its intuitive meaning and purpose should be clear: qualitatively \similar" sample paths can considerably enhance a practitioner's con dence in a proposed model. Furthermore, qualitative similarity is quite emphatically not a substitute for the two preceding quantitative criteria; rather, it is required in addition, not instead. A similar view is adopted in [23, 21, 32]. An excellent comprehensive survey of methods for constructing stochastic processes with prescribed marginals and autocorrelations may be found in [15], which de nes a taxonomy of input analysis methods. In this taxonomy, TES methods, as well as those proposed in [23, 21, 32], would be classi ed as approximate correlation distortion methods. TES processes and the underlying TES methodology conform neatly to this paradigm. First, the TES modeling methodology guarantees an exact t to arbitrary marginal distributions. In particular, it can match any empirical density (histogram). Second, it aords a great deal of freedom in approximating the empirical autocorrelations, while maintaining an exact match to the empirical histogram. TES autocorrelation functions have diverse functional forms including monotone, oscillatory and others. And third, TES processes generate a wide qualitative variety of simulated sample paths, including cyclical and non-directional sequences. In addition, TES processes enjoy two important computational advantages. The rst advantage is that TES sequences are easily generated on a computer, and their periods are 2
much longer than that of the underlying pseudo-random number stream. Their generation time complexity is typically not much higher than that of the underlying pseudo-random number generator, and their space complexity is negligible. The second advantage is that TES autocorrelations can be computed from fast and accurate numerical formulas, obviating time-consuming calculations of simulation-based statistics. Consequently, TES modeling of empirical data may be carried out interactively and visually. This observation motivated the design and implementation of a TES-based modeling package, called TEStool, which makes heavy use of visualization in order to provide a pleasant interactive environment for modeling dependent stationary time series. Two search approaches have been implemented: heuristic and algorithmic. The former aords more user control through direct interactive search, while the latter allows more elaborate algorithmic searches, thereby largely automating the search activity. The TEStool environment speeds up the modeling search procedure, cuts down on modeling errors and relieves the tedium of repetitive search. Since the search is formulated in visual terms, it is easy to carry out, and can be conducted by experts and non-experts alike. It appears that TES is the only input analysis method in the spirit of the three goodness-of- t requirements listed above, which is currently supported by an elaborate modeling software environment. This paper is organized as follows. Section 2 contains a brief overview of TES processes, an outline of the TES modeling methodology (including its heuristic and algorithmic variants), and a concise description of the TEStool software modeling environment. Section 3 demonstrates the ecacy of the TES modeling methodology and the TEStool modeling environment by a range of examples from various application domains. Finally, Section 4 contains the conclusion of this paper.
2 OVERVIEW OF TES MODELING TES processes are treated in some detail in [25, 11, 12, 13]. An extensive overview of TES processes and the corresponding modeling methodology may be found in [26], and a short one in [27]. A brief overview extracted from these references now follows.
2.1 TES Processes
For any real x, let bxc = maxfinteger n : n xg be the integral part of x, and de ne hxi = x ? bxc to be the fractional part of x. Let fVng denote a sequence of iid random variables with a common, though arbitrary, density fV . Further, let U0 be uniform on [0,1) and independent of the sequence fVn g. The random variables Vn are referred to as innovations. There are two classes of TES processes: TES+ and TES?. Each TES model consists of two random sequences in lockstep, called the background and foreground sequence, respectively. A TES+ background sequence fUn+ g is of the form ( =0 (2.1) Un+ = UhU0 ;+ + V i; nn > 0 n
n?1
while a TES? background sequence fUn? g is of the form ( + ? Un = 1Un?;U + ; nn even odd: n
3
(2.2)
The superscripts in Eqs. (2.1) and (2.2) are suggestive of the fact that TES processes achieve coverage of the full range of feasible lag-1 autocorrelations [11]; TES+ processes cover the positive range [0; 1], while TES? processes cover the negative range [?1; 0]. The superscripts will be omitted when the TES avor is immaterial. It can be shown [11, 26] that the basic TES processes of Eqs. (2.1) and (2.2) constitute stationary Markovian sequences with uniform marginals on [0; 1), regardless of the innovation sequence selected; in fact, the choice of innovations determines just the second-order structure of a TES process. TES background sequences are auxiliary, the real interest lying in TES foreground sequences, fXn+ g and fXn? g, obtained from Eqs. (2.1) or Eq. (2.2) by some transformation D from [0; 1] to the real line (called a distortion), i.e.,
Xn+ = D(Un+ ); Xn? = D(Un?):
(2.3)
The TES modeling methodology (to be explained in Section 2.2) typically employs compound distortions composed of two successive transformations
Xn = F ?1 (S (Un ));
(2.4)
where fUn g is any background TES sequence. The inner transformation, S , is called a stitching transformation; it is selected from a family, parameterized by 0 1, of the form ( y=; 0y