Document not found! Please try again

Interval arithmetic-based simple linear regression ...

2 downloads 0 Views 397KB Size Report
Nov 19, 2011 - Interval arithmetic-based simple linear regression between interval data: discussion and sensitivity analysis on the choice of the metric. ⋆.
Manuscript Click here to view linked References

Interval arithmetic-based simple linear regression between interval data: discussion and sensitivity analysis on the choice of the metric ⋆ Beatriz Sinova,∗ Ana Colubi, María Ángeles Gil, and Gil González-Rodríguez a Departamento

de Estadística, I.O. y D.M., Universidad de Oviedo, 33071 Oviedo, Spain

Abstract The prediction of a response random interval-valued set from an explanatory one has been examined in previous developments. These developments have considered an interval arithmetic-based linear model between the random interval-valued sets and a least squares regression analysis. The least squares approach involves a generalized L2 -metric between interval data; this metric weights squared distances between data location (mid-points/centers) and squared distances between data imprecision (spread/radius). As a consequence, estimators of the parameters in the linear model depend on the choice of the weights in the metric. To investigate about a suitable choice of weighting in the generalized mid/spread metric, a theoretical conclusion is first obtained. Finally, the impact of varying the weights is discussed by considering a Monte Carlo simulation study. Key words: generalized mid/spread metric, interval arithmetic, linear regression analysis, random interval-valued set, t-vector function

⋆ This research has been partially supported by/benefited from the Spanish Ministry of Science and Innovation Grants MTM2009-09440-C02-01 and MTM200909440-C02-02, the Principality of Asturias Grants IB09-042C1 and IB09-042C2, and the COST Action IC0702. Beatriz Sinova has been also granted with the Ayuda del Programa de FPU AP2009-1197 from the Spanish Ministry of Education, an Ayuda de Investigación 2011 from the Fundación Banco Herrero, and two Short Term Scientific Missions associated with the COST Action IC0702. Their financial support is gratefully acknowledged. ∗ corresponding author: [email protected], Fax: (+34) 985103354.

Preprint submitted to Elsevier

19 November 2011

1

Introduction

In investigating the relationship between random elements, regression analysis enables to seek for the causal effect of one (or several) random element(s) upon another. Regression techniques have long been relevant to many fields. Most of the regression methods assume that the involved random elements can be formalized as real-valued random variables. However, there exists an important number of practical situations for which involved attributes do not take real but interval values. In Lubiano [32], Ferson et al. [14], [15], [16], Billard and Diday [3], D’Urso and Giordani [11], Kreinovich et al. [28], and Chuang [8] one can find many instances of the usual sources of interval-valued data. Among them, intermittent measurements, censoring, data binning, cyclical fluctuations, ranges, and so on. So, the statistical analysis of these data becomes especially interesting in real-life. The problem of linear regression analysis with interval data has been studied from different perspectives and in different frameworks (see, for instance, Diamond [9], Lubiano [32], Billard and Diday [2], Gil et al. [18], [19], [20], Manski and Tamer [33], Montenegro [34], De Carvalho et al. [10], Lima Neto and De Carvalho [30], [31]). A least squares approach for an interval arithmetic-based linear model has been recently carried out (see González-Rodríguez et al. [24], [23], Gil et al. [21], Blanco et al. [4], and Blanco-Fernández et al. [6]). This approach involves essential and distinctive features, like the following ones: • The approach is based on the usual interval arithmetic to formalize the linear relationship between the response and explanatory random elements. Consequently, this approach looks jointly at the location and the imprecision characterizing interval data, instead of treating them separately. • The so-called t-vector function or mid/spread characterization of the nonempty compact intervals enables to identify interval data with certain R2 -valued data. This identification allows us to induce a generalized metric between intervals, as well as the model in the probabilistic setting for interval-valued random elements and the associated relevant summary measures of its distribution. • The least squares methodology based on the above-mentioned arithmetic and generalized metric. Estimators of the involved parameters in the linear model have been obtained and analyzed under general conditions (González-Rodríguez et al. [24], [23], Gil et al. [21], Blanco et al. [4], Blanco-Fernández [5], Blanco-Fernández et al. [6]). The estimators depend on the metric between interval data which is considered to formalize the least squares approach. This metric generalizes the well-known Bertoluzza et al. [1] (see also Trutschnig et al. [35] for a related 2

detailed discussion). As it has been outlined by Gil et al. [20] and Montenegro [34] the midpoint/spread (equivalent) expression of this metric has been crucial in interpreting and determining estimators of the parameters of the linear regression problem (see Gil et al. [20]), and in performing tests under linear model assumption (cf. González-Rodríguez et al. [24], [23], Blanco et al. [4]). Kulpa [29] indicated the interest of the mid-point/spread tandem in some other implications from interval arithmetic. In a different approach to the regression between interval-valued data, its importance has been also pointed out later by other authors (see, for instance, De Carvalho et al. [10], Lima Neto and De Carvalho [30], [31]). In Section 2 of this paper preliminary concepts and results will be presented. Section 3 recalls the regression problem between two interval-valued random elements, and the interval arithmetic-based linear model along with the associated parameters’ estimators. In Section 4, a theoretical search of a suitable choice of the metric on the basis of the mean square error of the estimators is first carried out. The conclusions from the theoretical development will be corroborated by an empirical sensitivity analysis of estimators in Monte Carlo simulations from relevant representative situations. Some concluding remarks and future directions will be finally commented.

2

Preliminaries

The analysis of interval-valued data requires an adequate framework so that statistical developments, and especially inferential ones, can be well formalized. The space of interval values for data to be considered in the study is the class Kc (R) of the nonempty compact intervals. Remark 2.1 It should be emphasized that interval data can overlap and, hence, grouped data just mean a particular case of interval ones. In general, no constraints (but the nonempty compactness) will be assumed on interval data. However, in case interval values of the explanatory random element are nested and share the center, then the mid of this random term will be a degenerate random variable. In this case, the interval arithmetic-based regression approach does not make real interest, and even the separate regression study for the mids would not make sense too. After presenting a motivating example, this section aims to recall the required tools for regression problems which will be formulated and discussed in Section 3. 3

2.1 Motivating case-studies The next real-life examples motivate the interest of the regression problem between interval-valued random elements. The random elements the interest of the problem is focused on, correspond to intrinsically interval-valued ones. Example 2.1 Data in Figure 1 have been supplied in 1997 by the Nephrology Unit of the Hospital Valle del Nalón in Langreo (Asturias, Spain). The ‘scatter diagrams’ correspond to the “range of systolic blood pressure over the same day”, X, and the “range of diastolic blood pressure over the same day”, Y , observed over a sample of 59 patients (suffering different types of illness) from a population of 3000 who are hospitalized per year in a given area. Data are collected in Table 1, and additional related interval data can be found for instance, in Lubiano [32] or Gil et al. [21]. X

Y

X

Y

X

Y

X

Y

118-173 104-161 131-186 105-157 120-179 101-194 109-174 128-210 94-145 148-201 111-192 116-201 102-167 104-161 106-167

63-102 71-118 58-113 62-118 59-94 48-116 60-119 76-125 47-104 88-130 52-96 74-133 39-84 55-98 45-95

112-162 136-201 90-177 116-168 98-157 98-160 97-154 87-150 141-256 108-147 115-196 99-172 113-176 114-186 145-210

62-116 67-122 52-104 58-109 50-111 47-108 60-107 47-86 77-158 62-107 65-117 42-86 57-95 46-103 100-136

119-212 122-178 127-189 113-213 141-205 99-169 126-191 99-201 88-221 113-183 94-176 102-156 103-159 102-185 111-199

47-93 73-105 74-125 52-112 69-133 53-109 60-98 55-121 37-94 55-85 56-121 50-94 52-95 63-118 57-113

130-180 103-161 125-192 97-182 124-226 120-180 100-161 159-214 138-221 87-152 120-188 95-166 92-173 83-140

64-121 55-97 59-101 54-104 57-101 59-90 54-104 99-127 70-118 50-95 53-105 54-100 45-107 45-91

Table 1 Data on the ranges of systolic (X) and diastolic (Y ) blood pressure

Values of X and Y are obtained from several registers of the systolic blood pressure and the diastolic one of each patient measured at different moments (usually 60 to 70) over the same day. Blood pressures data are often collected by taking into account simply the lowest and highest registers during a day (actually, some devices used for this purpose only record these extreme values along a day). In these cases, the knowledge of the whole registers for a day and the associated variation could distort the information on the characteristic which is considered to be relevant in some clinical studies: the range. So, X and Y are intrinsically interval-valued. To ease the visualization (see Gil et al. [21]) the corresponding scatter diagram (Figure 1) has been drawn by plotting each pair of interval values as a cross, instead of as a rectangle. The cross center is determined by the pair of midpoints of the interval values for the systolic and diastolic blood pressures of a patient. The length of the horizontal arms equals the radius of the systolic range, and the length of the vertical arms equals the radius of the diastolic range. 4

Fig. 1. Scatter diagrams of ‘systolic vs diastolic blood pressure’

To examine the linear relationship between X and Y in this case study, in Lubiano [32], Gil et al. [19], [20], [21], Montenegro [34], González-Rodríguez et al. [24], [23], Blanco et al. [4], Blanco-Fernández [5], and Blanco-Fernández et al. [6], one can find studies concerning the estimation of the parameters of the linear regression problem and the test of the linear independence (with or without linear model assumption) for these data. Example 2.2 Academical institutions traditionally conduct surveys of students’ opinion about courses taken at them. In order to know about what students think about courses, questionnaires are designed to get opinion from students. Many questionnaires designed for this purpose include a set of questions inquiring about different aspects for each course, and respondents can choose their judgements/opinions from a list of pre-established answers which are often coded by some integer numbers (say 1 to 5) and some statistics (mainly concerning mean values) are later performed. A questionnaire has been applied on the occasion of the II Summer School of the European Centre for Soft Computing (held in July 2008 in Mieres, Asturias, Spain). A survey has been carried out aiming to represent students’ opinion/valuation about different key aspects of each of several delivered courses. Students have been requested to reply by using some intervals in the [0, 100] X-scale (i.e., [0, 100] has been the general support, with 0% indicating the minimum degree of agreement and 100% the maximum degree). In fact, students have been requested to use an extension of interval values, namely, fuzzy values (see González-Rodríguez et al. [25] for more details on the survey, as well as González-Rodríguez et al. [26] for a guideline suggesting how to collect fuzzy data associated with random experiments), but in this example we will 5

focuss on the (interval-valued) so-called 0-level, which means the set of values that the student considers to be compatible to a greater or lesser extent with their valuation. Table 2 has been constructed from the full dataset which can be found in http://bellman.ciencias.uniovi.es/SMIRE/Valuations.html. Data in the table correspond to the valuation of 28 students attending the Summer School about the overall rating of two different courses. The corresponding scatterplot is represented in Figure 2. Course 1

Course 2

Course 1

Course 2

50–78 33–57 44–77 84–97 50–80 50–80 35–66 67–80 60–70 80–90 60–80 30–80 80–100 50–80

75–97 79–96 90–100 70–100 70–100 80–100 33–46 83–96 70–100 91–100 80–100 40–71 73–100 70–98

50–70 57–100 4–20 60–100 65–80 57–67 40–60 30–50 65–85 80–100 80–100 58–78 60–90 60–99

60–100 67–100 60–82 85–95 70–90 77–91 70–100 80–100 72–87 84–95 80–100 74–97 70–100 90–100

60 40

Course 2

80

100

Table 2 Interval valuation of 28 students about the overall rating of two different courses in a Summer School

0

20

40

60

80

100

Course 1

Fig. 2. Scatter diagrams of the Survey study

A visual inspection of Figure 2 reveals the presence of two outliers with quite low mid valuation for Course 2 in relation to the majority. In the posterior analysis these two outlying observations will be removed, although in general this situation demands for the future developing of robust statistical methods for handling random interval-valued sets. 6

The interval arithmetic-based linear relationship between the valuations of students for the two courses could be analyzed in a similar way to that for data in Example 2.1. 2.2 Interval-valued data: arithmetic, metric and vectorial identification Interval arithmetic is a special case of set-valued arithmetic. The sum and the product by a real number (which are the basic elementary operations to deal with interval data in statistics) are extended in a straightforward way by using the image of the considered intervals through the corresponding operation in R. Given two interval-valued data K, K ′ ∈ Kc (R), the sum of K and K ′ is defined as the interval in Kc (R) o n K + K ′ = Minkowski sum of K and K ′ = x + y : x ∈ K, y ∈ K ′ . Given an interval-valued data K ∈ Kc (R) and a real number γ, the product of K by the scalar γ is defined as then interval in Kco(R) such that γK = γ · x : x ∈ K . It should be pointed out that (Kc (R), +, ·) has not a linear (but a conical) structure, since K + (−1)K 6= [0, 0] = neutral element of the Minkoswski sum in Kc (R). On the other hand, any interval K ∈ Kc (R) can be characterized by the socalled t-vector (or mid/spread characterization) of K, tK = (mid K, spr, K), where mid K = mid-point/center of K and mid K = spread/radius of K (see Blanco-Fernández [5], Blanco-Fernández et al. [7], [6]). This characterization enables to embed Kc (R) into R2 through the t-vector function given by t : Kc (R) → R2 s.t. t(K) = tK . The t-vector function preservers the semilinearity of Kc (R) since t(K + K ′ ) = t(K) + t(K ′ ), t(λK) = λ · t(K), for all K, K ′ ∈ Kc (R) and λ ≥ 0. This function allows us to induce a family of L2 metrics on Kc (R) from a family of L2 distances on R2 like the one given for ~x, ~y ∈ R2 (with ~x = (x1 , x2 ) and ~y = (y1 , y2 )) and θ > 0 by dθ (~x, ~y) =

q

h~x − ~y , ~x − ~y iθ =

q

k~x − ~y kθ =

q

(x1 − y1 )2 + θ · (x2 − y2 )2 .

This induction can be carried out so that the t-vector enables to embed isometrically the space Kc (R) onto the cone R × [0, +∞) of R2 . In this respect, and based on the ideas in Gil et al. [20] and Blanco-Fernández [5], BlancoFernández et al. [7], [6] (and, more generally, Trutschnig et al. [35]) the following family of metrics can be introduced: 7

Proposition 2.1 Given θ ∈ (0, +∞), the mapping dθ : Kc (R) × Kc (R) → [0, +∞) such that for any K, K ′ ∈ Kc (R) ′

dθ (K, K ) = dθ (tK , tK ′ ) = satisfies that

r 

mid K − mid K ′

2



+ θ · spr K − spr K ′ )2

i) dθ is an L2 -type metric on Kc (R). ii) (Kc (R), dθ ) is a separable metric space. iii) The t-vector function t : Kc (R) → R2 states an isometrical embedding of Kc (R) (with the fuzzy arithmetic and dθ ) onto a closed convex cone of R2 (with the functional arithmetic and the distance dθ ). Remark 2.2 If should be pointed out that the considered metric dθ includes, as special cases, a wide and valuable class of metrics for interval data. One of the particular instances, d1 could be proved to be equivalent to the well-known Vitale L2 metric (cf. Vitale [36]) for the one-dimensional case, δ2 (K, K ′ ) =

s

1 1 | inf K − inf K ′ | + | sup K − sup K ′ |. 2 2

A more generalized instance is that of Bertoluzza et al.’s metric [1]. More precisely, Gil et al. [20] (see Trutschnig et al. [35] for this and an extended result for a distance between fuzzy numbers) have shown that for any θ ∈ (0, 1] there exists a normalized weighting measure W on the measurable space ([0, 1], B[0,1] ) being symmetric (w.r.t. .5) and nondegenerate such that ′

dθ (K, K ) =

sZ

2

[0,1]

(K [λ] − K ′ [λ] ) dW (λ)

(with K [λ] = λ sup K + (1 − λ) inf K for any λ ∈ [0, 1]), and conversely. This coincidence corroborates the richness and versatility of dθ , and would allow to interpret different choices of θ ∈ (0, 1]; in this respect we could state, for instance, that s d1/3 (K, K ′ ) =

Z

2

[0,1]

(K [λ] − K ′ [λ] ) dλ.

On the basis of the fact that dθ covers a very wide and valuable class of metrics between intervals, the problem in this paper could be stated as follows: given the model, one looks for the ‘best’ metric within this wide class of metrics. The embedding through the t-vector in Proposition 2.1. iii) can be used to induce the notions of Kc (R)-valued random element and associated relevant parameters of its distribution from the concepts of random vectors and associated relevant parameters. 8

2.3 Induced interval-valued random elements and relevant parameters In dealing with interval data for statistical purposes (and especially if we have in mind inferential goals), there is a need for modeling the random mechanisms producing these data within a probabilistic setting. Otherwise, the problems to be considered will not be appropriately formalized. By considering the above-mentioned isometrical embedding, the notion of a random interval-valued set can be immediately induced from that of random element as follows: Definition 2.1 Given a probability space (Ω, A, P ), consider the mapping X : Ω → Kc (R). Then, X is said to be a random interval-valued set (for short RIS) if tX = t ◦ X : Ω → R2 is a two-dimensional random vector, that is, a Borel measurable mapping w.r.t. A and the Borel σ-field generated by the topology induced by dθ on R2 . Random interval-valued sets can be equivalently formalized as given in Proposition 2.2 Given a probability space (Ω, A, P ), consider the mapping X : Ω → Kc (R). Then, the following statements are equivalent to X being an RIS: i) X is a compact convex random set, that is, it is a Borel measurable mapping w.r.t. A and the Borel σ-field generated by the topology induced by the Hausdorff metric on Kc (R). ii) X is a Borel measurable mapping w.r.t. A and the Borel σ-field generated by the topology induced by the metric dθ on Kc (R). iii) X is a random interval, that is, the real-valued functions inf X : Ω → R, sup X : Ω → R are real-valued random variables. The Borel measurability of RISs guarantees that one can properly refer (without needing to define them ad hoc) in this setting to concepts like the distribution induced by an RIS, the stochastic independence of RISs, and so on, which are crucial for inferential developments. In analyzing interval data from an RIS, two relevant summary measures/parameters are to be considered, namely, the mean value and the Fréchet variance, both induced from the expectation and Fréchet’s variance of a random element. Thus, Definition 2.2 Given a probability space (Ω, A, P ) and an associated RIS X such that tX ∈ L1 (Ω, A, P ) (i.e., it is integrable), the (Aumann) mean value of X is the interval E[X] ∈ Kc (R) such that tE[X] = E(tX ), that is, E[X] = [E(inf X), E(sup X)] .

9

Definition 2.3 Given a probability space (Ω, A, P ) and an associated RIS X such that ktX kθ ∈ L2 (Ω,A, P ), the θ-Fréchet variance of X is the real  2 2 number σX = Var(tX ) = E [ktX − E (tX )kθ ] , that is, 2 σX = Var(mid X) + θ · Var(spr X).

Due to the properties of the t-vector function and the random vectors, the mean value of an RIS satisfies the usual properties of linearity, it is the Fréchet expectation w.r.t. dθ , and it is coherent with the usual interval arithmetic. Moreover, the θ-Fréchet variance of an RIS satisfies the usual properties for this concept. In analyzing interval data from two (possibly linked) RISs, the covariance plays a key role. Since R2 has linear structure, then the covariance can be defined on it, although external operations are required. Then, by considering the inner product h·, ·iθ one can state (by following ideas similar to those by Körner and Näther [27]) that Definition 2.4 If X, Y are RISs such that ktX kθ , ktY kθ , htX , tY iθ ∈ L1 (Ω, A, P ), the θ-covariance between X and Y is defined as σX,Y = Cov(tX , tY ) = E (htX − E(tX ), tY − E(tY )iθ ) , that is, σX,Y = Cov(mid X, mid Y ) + θ · Cov(spr X, spr Y ). Although the covariance of two RISs preserves most of the properties for realvalued random variables, some properties like those related to linear transformations of the RISs are not satisfied in general. In particular, σa·X, Y 6= a σX, Y . 3

Interval arithmetic-based simple linear regression model with interval data

As indicated in the introductory section, the problem of linear regression analysis with interval data has been studied from different views and settings (see, for instance, González-Rodríguez et al. [23], Blanco-Fernández [5], BlancoFernández et al. [6] for comments about). An interval arithmetic-based linear model between two random interval-valued sets has been recently analyzed and estimation/testing problems have been discussed (see González-Rodríguez et al. [24], [23], Gil et al. [21], Blanco et al. [4], Blanco-Fernández [5], Blanco-Fernández et al. [6]). Arguments supporting the interest, applicability and ease of interpretation of this model have been detailed in these references. It should be noted that a linear model based on the usual interval arithmetic looks in Subsection 2.2 jointly at the location and the imprecision of interval 10

data (i.e., at each interval data as a whole). Instead, we could either examining location and imprecision separately or treating them as independent random variables. However, the interval arithmetic-based model is always well-defined, whereas one cannot guarantee that a separate linear model for spreads makes always sense. In this section the key previous ideas and results for this model are summarily recalled.

3.1 Regression problem and estimation of the parameters of the linear model Let {(xi , yi )}ni=1 be the sample data with (xi , yi ) ∈ Kc (R) × Kc (R) for all i = 1, . . . , n, this sequence being a realization of a simple random sample {(Xi , Yi )}ni=1 obtained from a random element (X, Y ) : Ω → Kc (R) × Kc (R) such that X and Y are both RISs. If an interval arithmetic-based linear model Y = aX +ε is considered, with a ∈ R and ε being an RIS, then mid Y = a mid X + εm and spr Y = |a| spr X + εs (where εm and εs are real-valued random errors). Consequently, with such a linear relationship a can be interpreted as the rate at which the location of Y increases per unit increase in the location of X, and its absolute value corresponds to the rate of propagation of imprecision from X to Y . Assume that there exists in fact an interval arithmetic-based linear model relating X and Y , that is, Y = aX + ε, where a ∈ R and ε is an RIS with mean value E[ε] = B ∈ Kc (R). This implies that E[Y |x] = ax + B for any interval value x of X. The assumption of such a linear model entails the existence of the well-known Hukuhara difference Y −H aX (i.e., for each ω ∈ Ω there exists (Y −H aX)(ω) = Y (ω) −H aX(ω) ∈ Kc (R) such that Y (ω) = aX(ω) + (Y −H aX)(ω)). This fact should be taken into account in order to estimate the parameter a. In a first approximation to the estimation problem under the linear model assumption, this problem accounts for finding a real value ab ∈ R and an interval value Bb ∈ Kc (R) such that

b n is as close as possible to the sample data {(x , y )}n , and - {(xi , abxi + B)} i i i=1 i=1 - yi = abxi + εi holds for some εi ∈ Kc (R), that is, so that yi −H abxi is welldefined for all i = {1, . . . , n}.

Consider that {(Xi , Yi )}ni=1 is a simple random sample from (X, Y ) : Ω → Kc (R) × Kc (R). To formalize the estimation problem above, and to guarantee that a solution exists, 11

- a Least Squares approach based on the metric dθ has been considered, - either mid X or spr X are nondegenerated for the available sample informad 2 tion, so that σ X > 0, and - the unknown value a and interval B are assumed to fulfill the linear regression model Y = aX + ε (at least for the available sample information). This has been achieved by considering as the Least Squares Problem that of b such that looking for ab and B n  n 2 1X 1X b b dθ (Yi , aXi + B) = (dθ (Yi , aXi + B))2 , min a∈A, B∈Kc (R) n n i=1 i=1

where A(ω) = {a ∈ R : Yi (ω)−H aXi (ω) exists for i = 1, . . . , n}. It is possible to show that for all ω ∈ Ω, the set A(ω) is a nonempty, closed and symmetric interval. The solution for this problem is due to González-Rodríguez et al. [23] (see also Blanco-Fernández [5]) and is compactly given by 

ab = 2 · 1[0,∞)





 

 



 \ σ [ −X,Y X,Y σ c , , σ [ − σ \ − 1 min a , max 0, X,Y −X,Y d d   0  2 2 σ σ X X 

b =Y − a B H bX,

where 1[0,∞) is the indicator function of the interval [0, ∞), X and Y denote the d 2 sample means of X and Y , respectively, σ X = sample (analogue estimator) θ-Fréchet variance of X, σ [ X,Y = sample (analogue estimator) θ-covariance, spr Yi or ac0 = +∞ if spr Xi = 0 for all i ∈ {1, . . . , n}. and ac0 = min i : spr Xi >0 spr Xi

The solution for ab can be algorithmically computed for a sample of inde pendent couples of interval-valued observations, (x1 , y1), . . . , (xn , yn ) , and a prefixed choice of θ as follows: Algorithm to compute the estimate ab:

Step 1. Compute the sample estimates of the following moments of the realvalued random variables mids and spreads:

• the means of the mids and spreads of X and Y , that is, n n 1X 1X mid x = mid xi , spr x = spr xi ; n i=1 n i=1

• the variances of the mids and spreads of X, that is, s2mid x =

n n 1X 1X (mid xi − mid x)2 , s2spr x = (spr xi − spr x)2 ; n i=1 n i=1

• the covariance of the mids of X and Y and the covariance of the spreads of X and Y , that is, 12

smid x,mid y =

n 1X (mid xi − mid x)(mid yi − mid y), n i=1

n 1X (spr xi − spr x)(spr yi − spr y). sspr x,spr y = n i=1 Step 2. Compute the sample estimates of the following moments of RISs:

• the means of X and Y , that is, x = [mid x − spr x, mid x + spr x], y = [mid y − spr y, mid y + spr y]; • the θ-Fréchet variance of X, that is, σcx2 = s2mid x + θ · s2spr x ;

• the θ-covariances of X and Y , and −X and Y , that is, σd [ x,y = smid x,mid y + θ · sspr x,spr y , σ −x,y = −smid x,mid y + θ · sspr x,spr y .

Step 3. Compute the sample estimates ac0 and ab for the available data, that is

• if all the xi are real-valued, then ac0 = ∞, • otherwise, mid xi ac0 = min ; i : spr xi 6=0 spr xi whereas

b = 0; • if σd [ x,y < 0 and σ −x,y < 0, then, a d • if σd [ x,y ≥ 0 and σ −x,y ≤ σ x,y , then, (

σd x,y

ab = min ac0 , c σx2

d • if σ[ [ −x,y ≥ 0 and σ x,y ≤ σ −x,y , then,

(

)

σ[ −x,y

ab = − min ac0 , c σx2

;

)

.

It should be noted that the solution of the Least Squares Problem is unique, but in case σ [ \ X,Y = σ −X,Y > 0 where two solutions can be found, one being the opposite of the other. Actually, this happens iff the analogue covariance of the mids vanishes and that of the spreads doesn’t, since for this case mids cannot determine the ‘sign of the relationship’ and spreads never do it. 3.2 Illustrative examples In this section the case-studies described in Subsection 2.1 will be examined, by analyzing the relation between X as explanatory variable and Y as dependent variable for three different choices of θ (namely, θ = .001, θ = 1 and θ = 1000). 13

In Figure 3 the linear regression estimate of systolic versus diastolic blood pressure for different values of θ is graphically displayed.

Fig. 3. Linear regression estimates of ‘systolic vs diastolic blood pressure’ for θ = .001, θ = 1 and θ = 1000

0.30

^ a 0.35

0.40

Additionally, in Figure 4 the estimation of a for θ ranging in (0, ∞) (more precisely, and equivalently, for θ/(1 + θ) ∈ (0, 1)) is represented.

0.0

0.2

0.4

0.6

0.8

1.0

θ (1 + θ )

Fig. 4. Estimate of a for the linear regression of ‘systolic vs diastolic blood pressure’ for different values of θ

In this case the more weight is given to the spreads (greater value of θ), the smaller estimates for the parameter a are obtained. On the contrary when the mid points are more weighted, then the boundary of the feasible set in the optimization problem is reached, and consequently the estimate equals the limit ac0 . Since it is obvious that if θ > 1 the metric for the spreads is more important than the one for the mids (which seems not to be reasonable in practice, and θ ≤ 1, or approximately lower than, seems to be advisable). Then, in Figure 3, as one can expects, when θ = 1000 the model tends to reconstruct mainly the spreads. In connection with Example 2.2, the linear regression estimate for the overall rating of two different courses of the II Summer School of the European Centre for Soft Computing (excluding the two detected outliers) for different values of θ is represented in Figure 5. In Figure 6 the estimation of a for θ ranging in (0, ∞) is represented. The results in this example are similar to the previous ones regarding the behaviour of the estimates as a function of the weights. Nevertheless, in this case the range of relative variation of the estimates of a as θ varies is larger. In general all the obtained estimates are quite small. 14

100

100

0

20

40

60

80

100

Course 2

90

^ = 0.0053 a ^ B = [75.47,96.66]

70 60

60

70

80

Course 2

90

^ = 0.1315 a ^ B = [68.31,86.62]

80

100 80 60

70

Course 2

90

^ = 0.1404 a ^ B = [67.81,85.91]

0

20

Course 1

40

60

80

100

0

Course 1

20

40

60

80

100

Course 1

Fig. 5. Linear regression estimates of the overall rating of two different courses for θ = 0.001, θ = 1 and θ = 1000 respectively

0.00

0.04

^ a 0.08

0.12

On the other hand, in Figure 6 there is not flat zone which indicates that the boundary of the feasible set of the minimization problem a is not reached.

0.0

0.2

0.4

0.6

0.8

1.0

θ (1 + θ )

Fig. 6. Estimate of a for the linear regression of the overall rating of two different courses for different values of θ

Since the estimates strongly depend on θ, it is essential to analyze which value(s) of θ should be considered in order of getting good estimates.

4

Empirical sensitivity analysis and formal/empirical discussions on the choice of the metric

The expression of ab depends on the choice of θ through the distance involved in the expressions of both the sample variances and covariances. In order to determine the effect of the choice of θ on the estimator, a study about the accuracy of ab will be made. To make clear the dependence on θ, the notation abθ will be used hereinafter instead of ab. A previous empirical analysis (see González-Rodríguez et al. [24]) has shown that abθ is asymptotically unbiased, but biased at finite sample sizes. To assess the performance of abθ as θ varies, both the bias and the standard error should be taken into account through the Mean Square Error (for short MSE), that is, we will analyze MSE(θ) = E((abθ − a)2 ). 15

Since the expression of abθ involves non-linear functions of the sample moments, the general analytic expression of MSE(θ) is difficult to obtain. Nevertheless, a wide simulation study by considering representative scenarios along with the theoretical asymptotic developments will provide information about the way of choosing θ making the MSE approximately as low as possible, and hence abθ approximately as efficient as possible.

Firstly, several different relevant situations concerning the relative variation for mids in contrast to that for spreads (in both the explanatory RIS, X, and the RIS error, ε) will be simulated. Secondly, a suggestion for a suitable choice will be made on the basis of the asymptotic results. Finally, some additional simulation studies using the suggested choice will be carried out.

To ease the graphical display and the associated interpretation, instead of discussing the MSE as a function of the weight θ, the alternative weight in Figure 4 is to be considered, namely, τ = θ/(1 + θ) ∈ (0, 1). Indeed there is a one-to-one relationship between values of τ ∈ (0, 1) and values of θ ∈ (0, ∞). Furthermore, 

dθ (K, K ′ )

2



∝ dτ (K, K ′ )

2

= (1−τ )(mid K −mid K ′ )2 +τ (spr K −spr K ′ )2 ,

which is a convex linear combination of the squared distance between mids and the squared distance between spreads. On the other hand, for purposes of quantifying the MSE of abθ in estimating a, it is indifferent to use θ or τ , since n  n  2 2 1X 1 1X dτ (Xi , X) = dθ (Xi , X) , n i=1 1 + θ n i=1

n n 1X 1 1X htXi − tX , tYi − tY iτ = htX − tX , tYi − tY iθ , n i=1 1 + θ n i=1 i

d 2 whence the ratio σ [ X,Y /σX would be equivalent if dθ and h·, ·iθ are replaced by dτ and h·, ·iτ , respectively. To simplifly the notation, MSE(τ ) will shortly denote MSE(θτ ), θτ being the unique value of θ associated with τ .

4.1 Empirical sensitivity analysis of the MSE as a function of the weight θ The symmetry w.r.t. 0 of interval A establishes outstanding differences between the empirical conclusions for cases in which a 6= 0 and a = 0. Actually, in cases in which a is far from 0 and positive (or negative) it is rather unlikely that the estimate is achieved around the boundary of the negative (respectively, positive) part of A. However, if a = 0 both the positive and negative arms have the same importance. 16

For this reason we will discuss five situations for each of the two scenarios, a 6= 0 and a = 0. Each situation corresponds to different assumptions on the relative variation for the mids in contrast to that for the spreads, and two cases have been considered for each of the five situations, so that examples for different combinations of the relative variation between mids and spreads are examined.

4.1.1 Scenario a 6= 0 The studies which have been simulated are the following ones: • Theoretical linear model: Y = 2X + ε; • Sample size: n = 30 (data from X and ε); • The MSE(τ ) has been approximated by Monte Carlo Method (100,000 iterations) for 20 values of τ (equally spaced from approximately .05 to .95). Situation

distr. mid ε

distr. spr ε

distr. mid X

distr. spr X

Sit. 1

M1

S1

M2

S2

Sit. 2 (high Var(mid ε))

4M1

S1

M2

S2

Sit. 3 (high Var(spr ε))

M1

4S2

M2

S2

Sit. 4 (high Var(mid X))

M1

S1

4M2

S2

Sit. 5 (high Var(spr X))

M1

S1

M2

4S2

•• Case 1: M1 , M2 ∼ N (0, 1), S1 , S2 ∼ standardized χ21 . •• Case 2: M1 , M2 ∼ standardized β(3, 2), S1 , S2 ∼ standardized U (0, 1).

0.30 0.10

0.20

MSE

Sit. 1 Sit. 2 Sit. 3 Sit. 4 Sit. 5

0.00

0.00

MSE

Sit. 1 Sit. 2 Sit. 3 Sit. 4 Sit. 5

0.10

0.20

The simulation results obtained have been represented in Figure 7.

0.2

0.4

0.6

0.8

0.2

τ

0.4

0.6

0.8

τ

Fig. 7. Approximation of the MSE(τ ) in Cases 1 and 2 for the 5 situations

These results can be summarized as follows (the optimal value of τ refers to the one leading to the smallest MSE in estimating a by means of ab): 17

(a 6= 0) Situation-Case

Variation of MSE

Optimal value of τ

Sit. 1 – Case 1

very small

τ ≃ .38 (moderate weight to mids’ distance)

Sit. 1 – Case 2

small

τ ≃ .34 (moderate weight to mids’ distance)

Sit. 2 – Case 1

high

τ ≃ .81 (high weight to spreads’ distance)

Sit. 2 – Case 2

high

τ ≃ .76 (high weight to spreads’ distance)

Sit. 3 – Case 1

high

τ ≃ .048 (high weight to mids’ distance)

Sit. 3 – Case 2

high

τ ≃ .048 (high weight to mids’ distance)

Sit. 4 – Case 1

small

τ ≃ .33 (moderate weight to mids’ distance)

Sit. 4 – Case 2

small

τ ≃ .28 (high weight to mids’ distance)

Sit. 5 – Case 1

small

τ ≃ .38 (moderate weight to mids’ distance)

Sit. 5 – Case 2

small

τ ≃ .34 (moderate weight to mids’ distance)

In summary, in both Case 1 and Case 2 the MSE is very sensitive to the choice of τ whenever either the mid or (and especially) the spread of the error is highly variable in comparison with the other involved variables. In addition different distributions imply rather substantial differences. 4.1.2 Scenario a = 0 The studies which have been simulated are the following ones: • Sample size: n = 30 (data from X and ε); • The MSE has been approximated by Monte Carlo Method (100,000 iterations) for 20 values of τ (equally spaced from approximately .05 to .95). Situation

distr. mid ε

distr. spr ε

distr. mid X

distr. spr X

Sit. 1

M1

S1

M2

S2

Sit. 2 (high Var(mid ε))

4M1

S1

M2

S2

Sit. 3 (high Var(spr ε))

M1

4S2

M2

S2

Sit. 4 (high Var(mid X))

M1

S1

4M2

S2

Sit. 5 (high Var(spr X))

M1

S1

M2

4S2

•• Case 1: M1 , M2 ∼ standardized β(3, 2), S1 , S2 ∼ standardized U (0, 1). •• Case 2: As for Case 2 for all the situations, but such that the ratio of the spr ε and spr X takes on a value over 1.

Cases 1 and 2 are quite different from a theoretical point of view. In Case 1 no restriction about the ratio between spr ε and spr X is considered. This ratio could be very close to 0 and, consequently, it is expected that the estimates of a are frequently achieved at the boundary of A. In this case, the distribution of the mentioned ratio will have a great impact on the behaviour of the estimator. In order to avoid such impact, in Case 2 the ratio is restricted to take values far away from 0. Thus, it is expected that the estimator performes as in the previous scenario, where a 6= 0. 18

0.4

Sit. 1 Sit. 2 Sit. 3 Sit. 4 Sit. 5

0.2 0.1

MSE

0.3

Sit. 1 Sit. 2 Sit. 3 Sit. 4 Sit. 5

0.0

MSE

0.00 0.01 0.02 0.03 0.04

The results obtained have been represented in Figure 8.

0.2

0.4

0.6

0.8

0.2

0.4

τ

0.6

0.8

τ

Fig. 8. Approximation of the MSE(τ ) in Cases 1 and 2 for the 5 situations

The interpretation of these results are now summarized. (a = 0) Situation-Case

Variation of MSE

Optimal value of τ

Sit. 1 – Case 1

small

τ ≃ .95 (high weight to spreads’ distance)

Sit. 1 – Case 2

small

τ ≃ .48 (balanced weight)

Sit. 2 – Case 1

small

τ ≃ .95 (high weight to spreads’ distance)

Sit. 2 – Case 2

high

τ ≃ .95 (high weight to spreads’ distance)

Sit. 3 – Case 1

high

τ ≃ .047 (high weight to mids’ distance)

Sit. 3 – Case 2

high

τ ≃ .047 (high weight to mids’ distance)

Sit. 4 – Case 1

small

τ ≃ .23 (high weight to mids’ distance)

Sit. 4 – Case 2

small

τ ≃ .28 (high weight to mids’ distance)

Sit. 5 – Case 1

small

τ ≃ .95 (high weight to spreads’ distance)

Sit. 5 – Case 2

small

τ ≃ .95 (high weight to spreads’ distance)

In summary, in Case 1 the MSE is especially sensitive to the choice of τ whenever the spread of the error is highly variable, whereas in Case 2 the MSE is very sensitive to the choice of τ whenever either the spread or (and especially) the mid of the error is highly variable in comparison with the other random elements. Thus in this case different distributions lead to quite substantial differences. In addition it should be noted that in Case 1 the range of variation of MSE is quite small in comparison with the other cases analyzed. As expected the results for Case 2 are similar to those for a 6= 0. The empirical studies in this subsection indicate that cases a = 0 and a 6= 0 are quite different, as also happens with the formal discussion. From this empirical analysis one can conclude that in general it is very important to choose an appropriate value of θ (or τ ) when the variability between the spreads and the mids of the errors are quite different (Situations 2 and 3). Based on these simulations, a tentative approximate suitable choice for these situations seems to be θb = s2mid ε /s2spr ε (if the errors are estimated assuming θ = 1). In Figure 9 the linear regression estimate of the systolic versus diastolic blood pressure in Example 2.1 for θb = s2mid ε /s2spr ε = .874/.231 = 3.79 (if both errors and variances are estimated assuming θ = 1) is graphically displayed. 19

Fig. 9. Linear regression estimates of ‘systolic vs diastolic blood pressure’ for θb = s2mid ε /s2spr ε

= 0.0385 = [74.14 ,94.53 ]

80 60

70

Course 2

90

100

In Figure 10 the linear regression estimate of the Course 2 overall rating versus Course 1 overall rating in Example 2.2 for θb = s2mid ε /s2spr ε = 19.081/18.513 = 1.031 (if both errors and variances are estimated assuming θ = 1) is graphically displayed. It should be emphasized that, as above commented, one can easily detect two outliers in the mids (which have been removed in the analysis in Figure 5) and another influence point on the left, so the scatter diagram in the analysis below has been filtered accordingly.

40

60

80

100

Course 1

Fig. 10. Linear regression estimates of ‘Course 2 vs Course 1 overall rating’ for θb = s2mid ε /s2spr ε based on data in the filtered scatter diagram

The choice θb = s2mid ε /s2spr ε , conveniently corrected for a general situation, is theoretically supported in the next subsection. 4.2 Formal discussion on the choice of the metric

To find the optimal value of θ in the MSE sense would be a too complex task, due to the difficulties in stating a general expression for MSE(θ) = E((abθ −a)2 ). However, we can approximate this optimal value by minimizing the asymptotic expression for n · MSE(θ), the approximation being a valuable choice when we deal with large samples. 20

The next result establishes the asymptotic expression for n · MSE(θ) as well as the value of θ for which the minimum of the limit is achieved. Theorem 4.1 Consider two RISs X, Y : Ω → Kc (R) associated with the probability space (Ω, A, P ), and satisfying that (ktX kθ )2 , ktY kθ , htX , tY iθ ∈ L1 (Ω, A, P ) and Y = aX +ε, with a ∈ R\{0} and ε : Ω → Kc (R) being an RIS 2 such that σε|x does not depend on the interval value x (that is, homocedasticity is assumed for the conditional distributions of ε). For each n ∈ N let {(Xi , Yi )}ni=1 be a simple random sample from (X, Y ) for d 2 which σ X > 0, assume that all the involved sample estimators fulfil uniform integrability conditions, and compute MSE(θ), which will depend on n. Then, if sgn(a) = sign of a, lim n · MSE(θ) =

n→∞

Var(mid X) · Var(mid ε) + θ2 Var(spr X) · Var(spr ε) (Var(mid X) + θ · Var(spr X))2

+ sgn(a)

2θ Cov(mid X, spr X)Cov(mid ε, spr ε) (Var(mid X) + θ · Var(spr X))2

(which indicates the convergence speed of the sequence of MSEs). Moreover, the value of θ minimizing this asymptotic function of θ is given by h

i

Var(mid X) Var(spr X) · Var(mid ε) − sgn(a) Cov(mid X, spr X)Cov(mid ε, spr ε) θ=

h

i.

Var(spr X) Var(mid X) · Var(spr ε) − sgn(a) Cov(mid X, spr X)Cov(mid ε, spr ε)

In case either mid X and spr X or mid ε and spr ε are uncorrelated, then, θ = Var(mid ε)/Var(spr ε). It should be outlined that, as shown empirically, the very special case in which X and Y are ‘linearly independent’ (i.e., a = 0) would be also quite different from a formal perspective, and asymptotic optimal choices for θ depend on either the variation of mid ε or the variation of spr ε being large, and lead to extreme choices which do not make real sense. 4.3 Empirical discussion on the suggested choice for θ The empirical studies in Subsection 4.1 and the theoretical result introduced in Subsection 4.2 indicate that an estimate of the asymptotically most suitable choice for θ would be given by θb = s2mid ε /s2spr ε .

For each case and situation under study, the MSE associated with the approxb as well as the one associated with a random uniform choice imate θb (MSE(θ)) of τ = θ/(1+θ) in (0, 1) (MSE(θrand ) have been approximated by Monte Carlo Method on the basis of 100000 iterations. Specifically, we have obtained the following results: 21

MSE(θrand )

0.01086

MSE(b θ)

(a 6= 0) Situation-Case

MSE(θ ∗ )

Sit. 1 – Case 1

0.00884

Sit. 1 – Case 2

0.01031

0.01218

0.01321

0.01439

MSE(b θ )/MSE(θrand ) 0.84642 0.91752

Sit. 2 – Case 1

0.10144

0.10202

0.13997

0.72891

Sit. 2 – Case 2

0.08951

0.09056

0.12556

0.72126

Sit. 3 – Case 1

0.01793

0.01866

0.11739

0.15894

Sit. 3 – Case 2

0.02678

0.02792

0.16567

0.16852

Sit. 4 – Case 1

0.00111

0.00258

0.00212

1.21862

Sit. 4 – Case 2

0.00162

0.00468

0.00328

1.42652

Sit. 5 – Case 1

0.00113

0.00377

0.00163

2.30807

Sit. 5 – Case 2

0.00122

0.00379

0.00157

2.42166

(a = 0) Situation-Case

MSE(θ ∗ )

MSE(θrand )

Sit. 1 – Case 1

0.00300

Sit. 1 – Case 2

0.01511

MSE(b θ) 0.00369

0.00343

0.01761

0.01800

MSE(b θ )/MSE(θrand ) 1.07552 0.97834

Sit. 2 – Case 1

0.00315

0.00388

0.00541

0.71716

Sit. 2 – Case 2

0.01920

0.02538

0.12955

0.19590

Sit. 3 – Case 1

0.01825

0.01942

0.03556

0.54608

Sit. 3 – Case 2

0.03468

0.03703

0.14912

0.24832

Sit. 4 – Case 1

0.00112

0.00194

0.00141

1.38020

Sit. 4 – Case 2

0.00211

0.00512

0.00349

1.46684

Sit. 5 – Case 1

0.00018

0.00037

0.00022

1.63325

Sit. 5 – Case 2

0.00112

0.00546

0.00204

2.67290

It should be underlined that, although θb is not always close to the empirical optimum (i.e., the one minimizing the Monte Carlo approximation of the MSE) b is indeed very close to the θ∗ in the simulations (see Figure 11), MSE(θ) minimum MSE obtained at each one of the preceding simulations for the important situations (where the MSE greatly varied with the election of θ). Simulations indicate that in all the considered important cases, the suggested choice leads to a much better approximation to the optimal MSE than a random choice. On the other hand, cases for which results for the suggested and the random choices are closer (or even worse for the suggested choice) are those for which the choice of θ is not very relevant because the MSE is rather constant. Note that when the limitation on the slope plays and important role in the estimation (Case 1 with a = 0) then the behaviour is quite different. In particular, in Situation 3 (Case 1 with a = 0) the improvement is quite small in comparison with the other scenarios. Nevertheless in Case 1 (a = 0) the MSE curves are quite flat, with very low variation in comparison with the other scenarios, so the obtained improvement is not as important as in the other cases. 5

Concluding Remarks

It has been shown that the parameter estimators of the interval arithmeticbased regression model for random intervals vary depending on the weights 22

Density

10

15

20

25

8 6 4

Density

5

2

0

0

0.5

0.6

0.7 τ* ^τ

0.8

0.9

0.05

τ*

0.10

0.15



Fig. 11. Kernel density estimation of the distribution of τˆ with a = 2 and Case 2 for Situation 2 (left picture) and Situation 3 (right picture) as well as optimal τ ∗ value in each case.

assessed to the distances between mids and spreads. It has been formally proved also that the MSE of the estimator of the rate a was related to the relative variability of the mids w.r.t the spreads. In some cases the MSE is almost constant, and then the election of θ is almost irrelevant. Nevertheless, when the difference between the variance of spreads and the variance of the mids is big, the choice of θ is crucial to obtain more efficient (in terms of lower squared error) estimates. The empirical results suggest to consider θ as a tuning parameter. Specifically, the quotient of the estimated variabilities of the mids and spreads of the error (or a correction suggested for general situations) have shown convenient theoretical and empirical support. An immediately related future direction is that corresponding to the extension of the multiple linear regression problem with interval data following the ideas for the interval arithmetic-based model in this paper. Of course, on the basis of the studies in this paper, the extension to multiple regression would be rather straightforward, but the difficulties for such a study lye in the lack of theoretical developments. These developments will be rather complex, they will require the use of computational statistical techniques and it is at present an open problem for which a very introductory analysis has been made in García-Bárzana et al. [17] and is to be deeply examined in a future. Another future direction is that related to the development of an empirical sensitivity analysis of the choice of θ in testing independence by using the power of the corresponding test. Finally, since the metric (see Trutschnig et al. [35]) and the regression problem (see D’Urso et al. [12], [13], GonzálezRodríguez et al. [22]) have been also studied to deal with fuzzy values, it should be convenient to examine whether the discussion and conclusions in this paper can be preserved. 23

Acknowledgements Authors wish to thank the three referees and Associate Editor handling the manuscript, as well as the Editor-in-Chief, for their very helpful suggestions. Appendix (Proof of Theorem 4.1). Indeed, from Large Sample Theory it is well-known that the sample moments of mid X, spr X, (mid X, mid Y ) and (spr X, spr Y ) are strongly consistent estimators of the population ones. Hence, from Slutsky Theorem we have that d d 2 2 2 σ [ \ X,Y /σX and σ −X,Y /σX are strongly consistent estimators of σX,Y /σX and 2 σ−X,Y /σX , respectively. 2 2 Furthermore, if Y = aX + ε, then a = σX,Y /σX if a > 0, and a = σ−X,Y /σX d d 2 2 if a < 0, whence σ [ \ X,Y /σX and σ −X,Y /σX converges almost surely to a if a > 0 and −a if a < 0, respectively, and ac0 = |a| + min spr εi /spr Xi , so that i : spr Xi >0

after several computations we get that

MSE(θ) = E((b aθ − a)2 ) =E

+E

+E

"

spr εi |a| − a + min i : spr Xi >0 spr Xi

"

|a| ·

"

+E

σX,sgn(a)X \ 2 σc X

+

σ[ X,ε 2 σc X

!2

−a

spr εi |a| + a + min i : spr Xi >0 spr Xi



|a| ·

σ−X,sgn(a)X \

c

+

2 σX

σ\ −X,ε

c

2 σX

2

2 +a

· 1[0,∞) σ \ X,Y − σ\ −X,Y



· 1[0,∞) σ \ X,Y − σ\ −X,Y





· 1[0,∞) σ \ X,Y · 1[0,∞)



· 1(0,∞) σ\ \ −X,Y − σ X,Y



2 σc X



ab0 −

· 1[0,∞) σ \ X,Y · 1(0,∞)

· 1(0,∞) σ\ \ −X,Y − σ X,Y · 1[0,∞) σ\ −X,Y

2

σ \ X,Y

· 1[0,∞) σ\ −X,Y





σ \ X,Y

σ\ −X,Y

· 1[0,∞)

· 1(0,∞)

− ab0



b

2 c σ X

a0 −

!#

2 σc X

!#

− ab0

σ\ −X,Y

c

2 σX

!#



whence due to the properties of the convergences of the sample moments and the uniform integrability assumptions one gets that lim n · MSE(θ) =

n→∞

Var(mid X) · Var(mid ε) + θ2 Var(spr X) · Var(spr ε) (Var(mid X) + θ · Var(spr X))2

+ sgn(a)

2θ Cov(mid X, spr X)Cov(mid ε, spr ε) . (Var(mid X) + θ · Var(spr X))2

Let G(θ) = limn→∞ n · MSE(θ). Then, G(θ) is minimized for h

i

Var(mid X) Var(spr X) · Var(mid ε) − sgn(a) Cov(mid X, spr X)Cov(mid ε, spr ε) θ=

h

i.

Var(spr X) Var(mid X) · Var(spr ε) − sgn(a) Cov(mid X, spr X)Cov(mid ε, spr ε)

24

References [1] C. Bertoluzza, N. Corral and A. Salas, On a new class of distances between fuzzy numbers. Mathware Soft Comput. 2 (1995), pp. 71–84. [2] L. Billard and E. Diday, Regression analysis for interval-valued data. In Data Analysis, Classification, and Related Methods, Proc. 7th Conf. Int. Feder. Clas. Soc. (H.A L. Kiers et al., Eds.), Springer, Berlin (2000), pp. 369–374. [3] L. Billard and E. Diday, From the statistics of data to the statistics of knowledge: Symbolic data analysis. J. Am. Statist. Assoc. 98 (2003), pp. 470–487. [4] A. Blanco, A. Colubi, N. Corral and G. González-Rodríguez, On a linear independence test for interval-valued random sets. In Soft Methods for Handling Variability and Imprecision (D. Dubois et al., Eds.), Springer, Berlin (2008), pp. 111–117. [5] A. Blanco-Fernández, Análisis estadístico de un nuevo modelo de regresión lineal flexible para intervalos aleatorios. PhD Thesis. University of Oviedo (2009). (http://bellman.ciencias.uniovi.es/SMIRE/ Archivos/Teseo-ABlanco.pdf). [6] A. Blanco-Fernández, N. Corral and G. González-Rodríguez. Estimation of a flexible simple linear model for interval data based on set arithmetic. Comput. Statist. Data Anal. 55 (2011), pp. 2568Ű-2578. [7] A. Blanco-Fernández, N. Corral, G. González-Rodríguez and A. Palacio, On some confidence regions to estimate a linear regression model for interval data. In Combining Soft Computing and Statistical Methods in Data Analysis (C. Borgelt et al., Eds.), Springer, Berlin (2010) pp. 33–40. [8] C.-C. Chuang, Extended support vector interval regression networks for interval inputŰoutput data. Inform. Sci. 178 (2008), pp. 871–891. [9] P. Diamond, Least squares fitting of compact set-valued data. J. Math. Anal. Appl. 147 (1990), pp. 531–544. [10] F.A.T. De Carvalho, E.A. Lima Neto and C.P. Tenorio, A new method to fit a linear regression model for interval-valued data. In Lect. Not. Comp. Sci. 3238 (S. Biundo et al., Eds.), Springer, Berlin (2004), pp. 295–306. [11] P.P. D’Urso and P. Giordani, A least squares approach to principal component analysis for interval valued data. Chemom. Intel. Lab. Syst. 70 (2004), pp. 179– 192. [12] P.P. D’Urso, R. Massari and A. Santoro, A class of fuzzy clusterwise regression models. Inform. Sci. 180 (2010), pp. 4737–4762. [13] P.P. D’Urso, R. Massari and A. Santoro, Robust fuzzy regression analysis. Inform. Sci. 181 (2011), pp. 4154–4174. [14] S. Ferson, L. Ginzburg, V. Kreinovich, L. Longpré and M. Avilés, Computing variance for interval data is NP-Hard. ACM SIGACT News 33 (2002), pp. 108– 118.

25

[15] S. Ferson, L. Ginzburg, V. Kreinovich, L. Longpré and M. Avilés, Exact bounds on finite populations of interval data. Reliab. Comput. 11 (2005), pp. 207–233. [16] S. Ferson, V. Kreinovich, J. Hajagos, W. Oberkampf and L. Ginzburg, Experimental uncertainty estimation and statistics for data having interval uncertainty. Sandia Nat. Lab. Tech. Rep. SAND2007-0939, Setauket, New York (2007). [17] M. García-Bárzana, A. Colubi and E.J. Kontoghiorghes, Least-squares estimation of a multiple regression model for interval data. In Abstracts 4th CSDA Int. Confer. Comput. & Finan. Economet. and 3rd Int. Confer. ERCIM WG Comput & Statist. (CFE 10 & ERCIM 10, London), p. 49. (http://www.cfe-csda.org/cfe10/LondonBoA.pdf)

[18] M.A. Gil, M.T. López-García, M.A. Lubiano and M. Montenegro, A linear correlation coefficient for interval-valued random sets. In Proc. 8th Int. Conf. Inform. Proc. Manag. Uncer. (2000), pp. 104–110. [19] M.A. Gil, M.T. López-García, M.A. Lubiano and M. Montenegro, Regression and correlation analyses of a linear relation between random intervals. Test 10, (2001), pp. 183–201. [20] M.A. Gil, M.A. Lubiano, M. Montenegro and M.T. López-García, Least squares fitting of an affine function and strength of association for interval-valued data. Metrika 56 (2002), pp. 97–111. [21] M.A. Gil, G. González-Rodríguez, A. Colubi and M. Montenegro, Testing linear independence in linear models with interval-valued data. Comput. Statist. Data Anal. 51 (2007), pp. 3002–3015. [22] G. González-Rodríguez, A. Blanco, A. Colubi and M.A. Lubiano, Estimation of a simple linear regression model for fuzzy random variables. Fuzzy Sets and Systems 160 (2009a), pp. 357–370. [23] G. González-Rodríguez, A. Blanco, N. Corral and A. Colubi, Least squares estimation of linear regression models for convex compact random sets. Adv. Data Anal. Clas. 1 (2007), pp. 67–81. [24] G. González-Rodríguez, A. Colubi, R. Coppi and P. Giordani, On the estimation of linear models with interval-valued data. In Proc. 17th Conf. IASC-ERS (COMPSTAT’06), Physica-Verlag, Heidelberg (2006), pp. 697–704. [25] G. González-Rodríguez, W. Trutschnig and A. Colubi, Confidence regions for the mean of a fuzzy random variable. In Abstracts of IFSA World Congress/ EUSFLAT Conference (IFSA-EUSFLAT 2009, Portugal) (2009b), pp. 1433– 1438. (http://www.eusflat.org/publications/proceedings/IFSA-EUSFLAT_ 2009/pdf/tema_1433.pdf)

[26] G. González-Rodríguez, A. Colubi and M.A. Gil, Fuzzy data treated as functional data, A one-way ANOVA test approach. Comput. Statist. Data Anal. In press (2011, doi:10.1016/j.csda.2010.06.013).

26

[27] R. Körner and W. Näther, On the variance of random fuzzy variables. In Statistical Modeling, Analysis and Management of Fuzzy Data (C. Bertoluzza et al., Eds.). Physica-Verlag, Heidelberg (2002), pp. 22–39. [28] V. Kreinovich, L. Longpré, S.A. Starks, G. Xiang, J. Beck, R. Kandathi, A. Nayak, S. Ferson and J. Hajagos, Interval versions of statistical techniques with applications to environmental analysis, bioinformatics, and privacy in statistical databases. J. Comput. Appl. Math. 199 (2007), pp. 418–423. [29] Z. Kulpa, Diagrammatic representation for interval arithmetic. Linear Algebra Appl. 324 (2001), pp. 55–80. [30] E.A. Lima Neto and F.A.T. De Carvalho, Centre and range method to fitting a linear regression model on symbolic interval data. Comput. Statist. Data Anal. 52 (2008), pp. 1500–1515. [31] E.A. Lima Neto and F.A.T. De Carvalho, Constrained linear regression models for symbolic interval-valued variables. Comput. Statist. Data Anal. 54 (2010), pp. 333–347. [32] M.A. Lubiano, Medidas de Variación de Elementos Aleatorios Imprecisos. PhD Thesis. University of Oviedo (1999). (http://www.tesisenred.net/TESIS_ UOV/AVAILABLE/TDR-0209110-122449//UOV0067TMALG.pdf)

[33] C.F. Manski and E. Tamer, Inference on regressions with interval data on a regressor or outcome. Econometrica 70 (2002), pp. 519–546. [34] M. Montenegro, Estadística con Datos Imprecisos Basada en una Métrica Generalizada. PhD Thesis. University of Oviedo (2003). (http://www.tesisenred.net/TESIS_ UOV/AVAILABLE/TDR-0209110-120109//UOV0066TMMH.pdf)

[35] W. Trutschnig, G. González-Rodríguez, A. Colubi and M.A. Gil, A new family of metrics for compact, convex (fuzzy) sets based on a generalized concept of mid and spread. Inform. Sci. 179 (2009), pp. 3964–3972. [36] R.A. Vitale, Lp metrics for compact, convex sets. J. Approx. Theory 45 (1985), pp. 280–287.

27