Using Observed Functional Data to Simulate a ...

1 downloads 0 Views 282KB Size Report
sample of data provided by Danone Vitapole Research Department (France). In kneading data from Danone, for a given flour, the resistance of dough.
Using Observed Functional Data to Simulate a Stochastic Process via a Random Multiplicative Cascade Model G. Damiana Costanzo1 , S. De Bartolo2 , F. Dell’Accio3 , and G. Trombetta3 1

2

3

Dip. Di Economia e Statistica, UNICAL, Via P. Bucci, 87036 Arcavacata di Rende (CS), Italy, [email protected] Dip. di Difesa del Suolo V. Marone, UNICAL, Via P. Bucci, 87036 Arcavacata di Rende (CS), [email protected] Dip. di Matematica, UNICAL, Via P. Bucci, 87036 Arcavacata di Rende (CS), [email protected], [email protected]

Abstract. Considering functional data and an associated binary response, a method based on the definition of special Random Multiplicative Cascades to simulate the underlying stochastic process is proposed. It will be considered a class S of stochastic processes whose realizations are real continuous piecewise linear functions with a constrain on the increment and the family R of all binary responses Y associated to a process X in S. Considering data from a continuous phenomenon evolving in a time interval [0, T ] which can be simulated by a pair (X, Y ) ∈ S × R, a prediction tool which would make it possible to predict Y at each point of [0, T ] is introduced. An application to data from an industrial kneading process is considered.

Keywords: functional data, stochastic process, multiplicative cascade

1

Introduction

When data represent functions or curves it is standard practice in the literature to consider them as paths of a stochastic process X = {Xt }t∈[0,T ] taking values in a space of functions on some time interval [0, T ]. Functional data has received in recent years considerable interest from researchers and the classical tools from finite multivariate analysis have been adapted to this kind of data. When dealing with functional data a major interest is to develop linear regression and classification methods (see Escabias et al., 2004, 2007, James, 2002, Ratcliffe et al. 2002a, 2002b, Saporta et al., 2007). In particular, when predictors are of functional type (generally, curves or real time functions) and response is a categorical variable Y defining K groups, K ≥ 2, linear discriminant analysis (LDA) models for functional data are considered. Preda and Saporta (2005) proposed PLS regression in order to perform LDA on functional data. Following this approach, to address the problem of anticipated prediction of the outcome at time T of the process in [0, T ], in Costanzo et al. (2006) we measured the predictive capacity of a LDA for functional data

2

Costanzo, G. D. et al.

model on the whole interval [0, T ]. Then, depending on the quality of prediction, we determined a time t∗ < T such that the model considered in [0, t∗ ] gives similar predictions to that considered in [0, T ]. We consider here a new approach based on the definition of special Random Multiplicative Cascades (RMC for short) to model the underlying stochastic process. In particular, we consider a class S of stochastic processes whose realizations are real continuous piecewise linear functions with a constrain on the increment. Let R be the family of all binary responses Y associated to a process X in S and consider data from a continuous phenomenon which can be simulated by a pair (X, Y ) ∈ S × R, with the same objective of prediction of the binary outcome earlier than the end of the process, we introduce the adjustement curve for the binary response Y of the simulated stochastic process X. Such a tool is a decreasing function which would make it possible to predict Y at each point in time before time T . For real industrial processes this curve can be a useful tool for monitoring and predicting the quality of the outcome before completion. The paper is organized as follows. In Sec. 2 we describe our method based on the definition of special RMCs. In Sec. 3 we present the adjustment curve. Finally, in Sec. 4 we illustrate an application.

2

The Random Multiplicative Cascade Model

  We start by considering a matrix of data of functional type F D = xi (tj ) , i = 1, . . . L, tj = j · TS for j = 0, . . . , S, where each row represents a continuous curve observed at discrete time tj , j = 0, . . . , S. Next, we consider the column vector R = (ri ), i = 1, . . . , L, where ri is a binary outcome associated to the row xi (0) . . . xi (tj ) . . . xi (T ) for i = 1, . . . , L; for example ri ∈ {bad, good}. As an example consider the situation depicted in Fig. 1 where dough resistance (density) has been recorded in a time interval [0, T ] during the kneading process for each type of flour. The achieved dough resistance in T affects the outcome of the process, that is the quality - good or bad - of the resulting cookies. The obtained curves could be used to predict the quality of cookies made with this dough before completion of the kneading. We assume that the data F D and R jointly arise from a continuous phenomenon which can be simulated by a pair (X, Y ), where X is a stochastic process whose realizations  are real continuous functions with {x(0) : x ∈ X} = xi0 : i = 1, . . . , L , linear on the intervals [tj , tj+1 ], tj = j · TS for j = 0, . . . , S − 1, with a constraint on the increment, i.e. |x(tj+1 ) − x(tj ))| ≤ M (x, j) for j = 0, . . . , S − 1. In the simplest case we can assume that the increment does not exceed a certain mean constant value obtainable from the real data, i.e. M (x, j) = M for each x ∈ X, j = 0, . . . , S − 1. We will denote by S the class of such stochastic processes, X = {X (t)}t∈[0,T ] a stochastic process in S, x = x (t) (t ∈ [0, T ]) a realization of X. Moreover R will denote the class of all binary responses Y associated to X. Without loss of generality we can assume Y ∈ {bad, good} . We propose a method by means of which X and Y can be realized via RMCs

Using Functional Data to Simulate a Stochastic Process

3

which depend on a certain number of constants (obtained from the data F D and R) and real positive parameters. A multiplicative cascade is a single process that fragments a set into smaller and smaller components according to a fixed rule and, at the same time, fragments the measure of components by another rule. The central role that the multiplicative cascades play in the theory of multifractal measures is well known. The notion of multiplicative cascades was introduced in the statistical theory of turbulence by Kolmogorov (1941) as a phenomenological framework intended to accommodate the intermittency and large fluctuations observed in flows. RMCs have been used as models to compress, infer future evolutions and characterize underlying forces driving the dynamics for a wide variety of other natural phenomena (cfr. Pont et al., 2009) such as rainfall (see Gupta and Waymire, 1993), internet packet traffic (see Resnick at al., 2003), market price (see Mandelbrot, 1998). Recently statistical estimation theory for random cascade models has been investigated by Ossiander and Waymire (2000, 2002). We defined a RMC generating recursively a multifractal measure µ on the family of all dyadic subintervals of the unit interval [0, 1]. This measure µ is recursively generated with the cascade that is schemetically depicted in Fig. 2 and fully detailed in (Costanzo et al., 2009). In this section we summarize the main steps of our RMC model and describe how to use it to model a real phenomenon using a pair (X, Y ) ∈ S × R. Let us consider the F D matrix of functional of the associated  i data and the vector R i outcome and define sets B = x (T ) : i = 1, . . . , L and r = bad and G =  i x (T ) : i = 1, . . . , L and ri = good . We assume that min(G) > max(B). Step 1. We use the data F D and R in order to define, among others, the following constants (|A| denotes the number of elements of the set A): q0 = |G| / (|G| + |B|) and 1 − q 0 = |B| / (|G| + |B|)  ,   i i i i p= max x (T ) − x (0) − min x (T ) − x (0) /δ, i=1,...,L  i=1,...,L  L X S X |xi (tj ) − xi (tj−1 )| / (LS) . δ= i=1 j=1

In particular q 0 is the ratio between the number of good realizations of the real process - that is the number of those curves whose outcome was good at time T - and the totality of such curves while p determines the number of stages (steps) of the multiplicative cascade. Step 2. Let (α, β) be a pair of random generated numbers in the square [10−1 , 10]×[10−1 , 10]. For given α, β and q ∈ (0, 1) and for each i ∈ {1, . . . , L}, if ri = good (ri = bad) we start the cascade with q = q 0 (q = 1 − q 0 ) by truncating it at the stage p. To the resulting (p + 1)-uple of positive integers we associate a proof of length p + 1 that is a real piecewise linear function with a constrain on the increment which simulates a single row of the matrix F D. For each i = 1, . . . , L the set Ep of L proofs of lenght p + 1 is called

4

Costanzo, G. D. et al.

an experiment of size L and length p + 1. Each simulated experiment Ep is rearranged in a matrix SF Dα,β of experiment data (simulated functional data). Step 3. The closeness of the functional data F D with the simulated functional = SF  Dα,β evaluated for the same subdivision in K  SFDK−1  data 1 , . . . , K , 1 of [0, 1] in terms of the frequency distribution classes 0, K of the original data IF D and the corresponding frequency distribution of the simulated data ISF Dα,β , allows to define the set Eη,θ of all admissible experiments Ep of size L and length p+1. The two fixed positive real numbers η and θ provide a measure of the closeness of the simulated experiment to the real data. Admissible experiments Ep can be obtained via the Monte-Carlo method based on the generated random pairs (α, β) ∈ [10−1 , 10] × [10−1 , 10]. Step 4. Let Eη,θ be the set of all admissible experiments Ep of size L and length p + 1, denote by S(Ep ) the set of L piecewise linear interpolant the data in each single S row of SF D, we define the stochastic process X as the set X = S(Ep ). The associated binary response Y : X → Ep ∈Eη,θ

{bad, good} , Y (s) = YEp (s) is univocally determined since it does not depend on the particular experiment Ep .

3

The Adjustment Curve for Binary Response of a Simulated Stochastic Process

In Costanzo et al (2009) we introduced the notion of adjustement curve as a predictive tool of the binary outcome of a process. We first introduced the definition of the adjustment curve γa,D : [0, T ] → [0, 1] for the binary outcome R of the functional data F D. We required that for real data the condition xi (T ) < xj (T ) for each i : ri = bad and j : rj = good, i, j = 1, ..., L is satisfied. That is, we assumed there exist a value X(T ) ∈ R such that ri = bad if, and only if, xi (T ) < X(T ) and ri = good if, and only if, xi (T ) ≥ X(T ) for each i = 1, . . . , L. Let siD : [0, T ] →  R (i = 1,  ..., L) be the piecewise linear functions whose node-sets are N i = j, xi (tj ) : tj = j · TS for j = 0, . . . , S (i = 1, ..., L). We defined:  bD (j) = max xi (tj ) : i = 1, . . . , L and ri = bad (j = 0, . . . , S) (1)  gD (j) = min xi (tj ) : i = 1, . . . , L and ri = good (j = 0, . . . , S). (2) Let i ∈ {1, . . . , L} and ri = bad (or ri = good). The piecewise linear interpolant siD is called adjustable at the time t ∈ [0, T ] (for short t−adjustable) if there exists tj ≥ t with j ∈ {0, 1, . . . , S} such that siD (t) ≥ gD (tj ) (or siD (t) ≤ bD (tj )). The adjustment curve γa,D : [0, T ] → R for the binary outcome R of the functional data F D is the function  i s (t) : i = 1, . . . , L and si (t) is t − adjustable D D γa,D (t) = (t ∈ [0, T ]). L

Using Functional Data to Simulate a Stochastic Process

5

Given a set of curves coming from a real continuous process the adjustment curve is a decreasing step function which gives the relative frequency of curves adjustable (with respect to the final outcome in T ) at each time t ∈ [0, T ]. The complementary curve 1 − γa,D (t) gives then, at each time, the relative frequency of the curves that are definitively good or bad. Let us observe that by the two data sets (1) and (2) it was possible to deduce the binary response ri (tj ) associated to siD , i ∈ {1, . . . , L} at each time tj , j = 0, . . . , S − 1, before time T ; indeed: if siD (tj ) > bD (j) then ri (tj ) = good or if siD (tj ) < gD (j) then ri (tj ) = bad otherwise ri (tj ) is not yet definite. By analogy with the case of real data we can introduce the adjustment curve γa,Ep : [0, p] → [0, 1] for the binary outcome YEp of the admissible p experiment Ep ∈ Eη,θ . We first consider the change of variable τ = · t (t ∈ T p [0, T ]) and obtain, for every Ep ∈ Eη,θ , γa,Ep (t) = γa,Ep ( · t)(t ∈ [0, T ]). T  The set γa,Ep : Ep ∈ Eη,θ = {γ1 , γ2 , . . . , γN } is finite. We then consider the random experiment “obtain an admissible experiment Ep ” whose sample  i space is the infinite set Eη,θ . We set Eη,θ = Ep ∈ Eη,θ : γa,Ep = γi (i = 1, . . . , N ). Let νi be the frequencies of the curves γi (i = 1, . . . , N ). We define the adjustment curve γa : [0, T ] → [0, 1] for the binary response Y of the N P process X as the function γa (t) = νi γi (t)(t ∈ [0, T ]). In practice, given i=1

a couple (X, Y ) ∈ S × R we can choose a tolerance  > 0 such that, if Ep1 = (x11 , . . . , x1L ), Ep2 = (x21 , . . . , x2L ) are two admissible experiments such L

that max kx1i −x2i k∞ ≤  (here k·k∞ denotes the usual sup-norm) then Ep1 , Ep2 i=1

can be considered indistinguishable. Therefore X becomes a process with a discrete number of realizations and thus we can assume that for i = 1, . . . , N , νi = lim νin , where νin is the relative frequency of γi observed on a sample n→∞

(γ1 , . . . , γn ) of size n. We set γan =

N P i=1

νin γi (n = 1, 2, . . . ). The sequence {γan }

converges to γa on [0, T ] and the variance V ar(γa ) of the random variable γa is less or equal 2. Consequently the classical Monte Carlo method can be used to produce approximations of γa with the needed precision.

4

Application

We present an application of our method to a real industrial process; namely we will show how our model can be used to monitor and predict the quality of a product resulting from a kneading industrial process. We will use a sample of data provided by Danone Vitapole Research Department (France). In kneading data from Danone, for a given flour, the resistance of dough is recorded during the first 480 seconds of the kneading process. There are 136 different flours and then 136 different curves or trajectories (functions of time). Each curve is observed in 240 equispaced time points (the same for all

6

Costanzo, G. D. et al.

flours) of the interval time [0, 480]. Depending on its quality, after kneading, the dough is processed to obtain cookies. For each flour the quality of the dough can be bad or good. The sample we considered contains 44 bad and 62 good observations. In Fig. 1 grey curves (black curves) are those corresponding to the cookies that were considered of good quality (bad quality) after the end of the kneading process. In order to introduce the adjustment curve, we required that with respect to the end values of the process, we have a clear separation between bad and good curves, that is R must depend only on the values at the time T of the real process (see Sec.3). To meet such condition we introduced the concept of  − (m, n) separability for two sets by means of which we find the minimum number of bad curves and/or good curves that can be discarded in such way that the ratio q 0 is kept in a prefixed error . For  = 0.05 we discarded from our analysis eight good curves and six bad curves; the remaining 54 good curves and 38 bad curves are separated in T = 480 at the dough resistance’s value c = X(T ) = 505. In Fig. 3 we show one admissible experiment Ep obtained by the method outlined in Sec.2. In order to obtain by application of Monte Carlo Method the adjustment curve γa with an error less than 10−1 and probability greater than 90% we need to perform n = 4000 admissible experiments. In Fig. 4 we depicted the adjustment curve γa for the binary response Y of the stochastic process X related to Danone’s data. Such curve has been computed on the basis of n = 1000 admissible experiment Ep , obtained requesting a value of the χ2 index less or equal to seven. For each t ∈ [0, T ], the standard deviation is not greather than 0.09 ≈ 10−1 . The intervals of one standard deviation from the mean curve value comprise a frequency of adjustment curves of the Ep ’s admissible experiments which range from a minimum of 65% about - in the time interval between t = 150 and t = 320 about - to a maximum of 87% about; while in the intervals of two standard deviation, such frequency range between 92% and 97%. These last intervals are illustrated in the figure 4 by the plus and minus signs and they comprise the adjustment curve γa,D of real Danone’s data. In such figure we can observe that the simulated process prediction curve gives the same results as the real data one at times near to t = 186, which was the same time t∗ < T we determined in Costanzo et al. (2006) on the whole interval [0, 480]; the average test error rate was of about 0.112 for an average AU C(T ) = 0.746. However, in our case we can observe that after such time and until time t = 326 the γa,D gives an adjustment higher than γa , which denotes instability for such data since from 65% to 42% of the curves are not yet definitively bad or good. Let us remark that the mean absolute difference between the two curves is 0.06 on the whole time interval, while its value is 0.112 in the time interval [186, 326]. Starting from time t = 330 γa is over γa,D so that such time seems for such data a good time to start to predict. An adjustment value of about γa = 0.20 implies in fact, that bad outcomes at such time has a low probability (0.20) of adjustment before the end of the process and they could be discarded or the process could be

Using Functional Data to Simulate a Stochastic Process

7

modified; while at the same time a good outcome has an high probability 1 − γa = 0.80 to remain the same until the end of the process. Further details on the adjustment curve and its interpretation can be found in Costanzo et al. (2009).

5

Conclusion and perspectives

The RMC model proposed in this work is characterized by an intrinsic complexity of the recursive relation generating the cascade structure in terms of multifractal measures. This causes a non immediate (or direct) writing of the partition function and of so called sequence of mass exponents (Feder, 1988). The last is very important since it allows the definition, via Legendre transform, of the generalized fractal dimensions that controls the multi-scaling behaviours of the support of the measures. Works in progress comprise the formulation of such partition function to obtain, if it exists, the multiplicative processes limit and so way define a multifractal spectrum analitically; the validation of the multifractality of a kneading industrial process by means of the analysis of the relative scalings of the observed functional data. For such validation, in accordance to the standards requested by numerical convergence of the multifractal measures we need however of curves constituted by a very high number of data points (Fuchslin et al., 2001). m10 0

800

1

700

1 m1;1

J10

x0

1



pos0 ( x0 ) = χ J 0 ( x0 ) = 1

m12;1

0

600

0

1

1

500

1 J1;1

0

J 12;1

x1



1

pos1 ( x1 ) = ∑ iχ J 1 ( x1 ) = 1 2

400 2 m1;1,1

300

2 m2;1,1

2 m3;1,1

2 m4;1,1

0

i =1

i

1 0

200

2 J1;1,1

2 J 2;1,1

2 2 J 3;1,1 x2 J 4;1,1 1



pos2 ( x2 ) = ∑ iχ J 2 ( x2 ) = 3 4

100

i =1

0

i ;1

3 3 3 3 3 3 3 m1;3,1,1 m2;3,1,1 m3;3,1,1 m4;3,1,1 m5;3,1,1 m36;3,1,1 m7;3,1,1 m8;3,1,1

0

50

100

150

200

250

300

350

400

450

0

1

Fig. 2. The first four stages of the multiplicative cascade

Fig. 1. Danone’s data 800

1

700 0.8

γa,D

600

γa

0.6 500

400

0.4

300 0.2 200

0 100 0 0

0

50

100

150

200

250

300

350

400

50

100

150

200

250

300

350

400

450

450

Fig. 3. An admissible experiment Ep

Fig. 4. The adjustment curves of the process and of Danone’s data

8

Costanzo, G. D. et al.

Acknowledgements. Thanks are due for their support to Food Science & Engineering Interdepartmental Center of University of Calabria and to L.I.P.A.C., Calabrian Laboratory of Food Process Engineering (Regione Calabria APQ-Ricerca Scientifica e Innovazione Tecnologica I atto integrativo, Azione 2 laboratori pubblici di ricerca mission oriented interfiliera).

References COSTANZO, G. D., PREDA, C., SAPORTA, G. (2006), Anticipated Prediction in Discriminant Analysis on Functional Data for binary response. In: Rizzi, A., Vichi, M. (eds.) COMPSTAT’2006 Proceedings, pp. 821-828. Springer, Heidelberg. COSTANZO, G. D., DELL’ACCIO, F., TROMBETTA, G. (2009), Adjustment curves for binary responses associated to stochastic processes. Dipartimento di Economia e Statistica, Working paper n 17, Anno 2009, submitted. ESCABIAS, A.M., AGUILERA, A. M. and VALDERRAMA, M.J. (2004). Principal component estimation of functional logistic regression: discussion of two different approaches. Journal of Nonparametric Statistics 16:365-384. ESCABIAS, A.M., AGUILERA, A. M. and VALDERRAMA, M.J. (2007). Functional PLS logit regression model. Computational Statistics and Data Analysis 51:4891-4902. FEDER, J. (1988): Fractals. Plenum. FUCHSLIN, R.M., SHEN, Y. and MEIER, P.F. (2001), An efficient algorithm to determine fractal dimensions of points sets. Physics Letters A, 285, pp. 69-75. GUPTA, V.K., WAYMIRE, E. (1993), A statistical analysis of mesoscale rainfall as a random cascade. J. Appl. Meteor. 32:251-267. JAMES, G. (2002) Generalized Linear Models with Functional Predictor Variables, Journal of the Royal Statistical Society Series B 64: 411-432. KOLMOGOROV, A. N. (1941), The local structure of turbulence in incompressible viscous fluid for very large Reynolds number. Dokl. Akad. Nauk SSSR 30 :9-13. MANDELBROT, B. (1998), Fractals and scaling in finance: Discontinuity, concentration, risk. Springer-Verlag, New York. OSSIANDER, M., WAYMIRE, C.E. (2000), Statistical Estimation for Multiplicative Cascades. The Annals od Statistics, 28(6):1533-1560. OSSIANDER, M., WAYMIRE, C.E. (2002), On Estimation Theory for Multiplicative Cascades. Sankhy¯ a, Series A, 64:323-343. PREDA C., SAPORTA, G. (2005), PLS regression on a stochastic process. Computational Satistics and Data Analysis, 48:149-158. RATCLIFFE, S.J., LEADER, L.R., and HELLER G.Z. (2002a) Functional data analysis with application to periodically stimulated fetal heart rate data: I. Functional regression. Statistics in Medicine 21:1103-1114. RATCLIFFE, S.J., LEADER, L.R., and HELLER G.Z. (2002b) Functional data analysis with application to periodically stimulated fetal heart rate data: II. Functional logistic regression. Statistics in Medicine 21:1115-1127. RESNICK, S., SAMORODNITSKY, G., GILBERT, A., WILLINGER, W. (2003), Wavelet analysis of conservative cascades. Bernoulli, 9:97-135. SAPORTA, G., COSTANZO, G. D., PREDA, C., (2007), Linear methods for regression and classification with functional data. In: IASC-ARS’07 Proceedings, Special Conf., Seoul, 2007 (ref. CEDRIC 1234).

Suggest Documents