Document not found! Please try again

Statistics with Fuzzy Random Variables

0 downloads 0 Views 345KB Size Report
ground theory, the assumptions and the models which allow us to process E in order to draw the ...... On the Rao-. Blackwell theorem for fuzzy random variables.
Statistics with Fuzzy Random Variables Ana Colubia , Renato Coppib , Pierpaolo D’Ursoc ´ and Mar´ıa Angeles Gila a

Departamento de Estad´ıstica, I.O. y D.M., Universidad de Oviedo, 33071 Oviedo, Spain

b

Dipartimento di Statistica, Probabilit` a e Statistiche Applicate Universit` a degli Studi di Roma, “La Sapienza”, 00185 Roma, Italy c

Dipartimento di Scienze Economiche, Gestionali e Sociali, Universit` a degli Studi del Molise, 86100 Campobasso, Italy

The notion of Fuzzy Random Variable has been introduced to model random mechanisms generating imprecisely-valued data. Probabilistic aspects of these random elements have been deeply discussed in the literature. However, statistical analysis of fuzzy random variables has not received so much attention, in spite that implications of this analysis range over many fields, including Medicine, Sociology, Economics, and so on. A summary of the basic ideas underlying fuzzy random variables is presented. Then, some related parameters will be recalled, and some recent statistical studies will be described. Keywords: arithmetic of fuzzy data, estimation, fuzzy data, hypothesis testing, informational paradigm, LR fuzzy numbers, metrics between fuzzy data, regression analysis.

1

Introduction

Statistical reasoning is a specific and relevant instance of approximate reasoning, under uncertainty. In particular, it refers to the analysis of collective phenomena, namely phenomena which are defined with reference to a collection of empirical observations. On each observation one or more variables are 1

measured in various different ways. This constitutes the “empirical information” (=E ) which is a basic component of the reasoning process in Statistics. The other component is the “theoretical information” (=T ), namely the background theory, the assumptions and the models which allow us to process =E in order to draw the conclusions of the reasoning procedure, which represent the “informational gain” of this piece of knowledge acquisition based on the couple (=E , =T ). This way of looking at statistical methodology has been called “Informational Paradigm” (Coppi, 2002). The advantage in using this perspective lies in the fact that it allows us to widen the spectrum of statistical theory and applications, considering different kinds of “informational ingredients” (either empirical or theoretical ones) as input of the statistical reasoning process. In order to illustrate this point, we must recall that some kind of uncertainty generally affects the above mentioned reasoning process and, specifically, its informational ingredients. As a matter of fact, the observed data may be affected by various types of uncertainty concerning: (i) the measuring system; (ii) the way of expressing the assessment of their measure (e.g., numerically, linguistically, by means of numerical intervals, etc.); (iii) the way they are eventually selected from a larger “population”; (iv) the possible ”vagueness” in defining the underlying concepts through the use of observable variables (e.g. measuring an opinion by means of a visual analogue scale; see for instance, Coppi et al., 2007a). Likewise, some theoretical informational ingredients can be affected by uncertainty. For example, we may doubt of the Gaussian assumption about the distribution of a given quantitative variable, utilized as a means for drawing inferences from an observed sample (the adoption of a Bayesian framework for coping with this type of uncertainty is often suggested in the literature; beside the use of robust procedures, nonparametric inference or resampling techniques). A different type of uncertainty may concern, for instance, the assumption of a regression model linking the “response” variable with the “explanatory” ones. In classical inferential statistics, a “true“, albeit unknown, regression model is assumed, characterized by a vector of (unknown) precisely measured regression coefficients. This “crisp” assumption may be relaxed, by thinking of a “fuzzy” regression relationship, whereby the unknown coefficients are fuzzy numbers to be estimated by means of appropriate procedures. In this case, the uncertainty related to the assumption of a specific regression model is dealt with through the fuzziness of the regression system (see, e.g., Tanaka et al., 1982). 2

Fuzziness may also be adopted as a tool for coping with some types of uncertainty affecting statistical data (e.g. in case (iv) of the above mentioned list, or in case of linguistic data as mentioned in situation (ii) before). On the other hand, randomness managed by a suitable probability models constitutes the usual tool for dealing with data uncertainty due to sampling from finite or infinite populations (type (iii) of the above list), or for processing subjective uncertainty concerning theoretical assumptions (such as the sampling distribution on a statistical set up, or the parameters of a given statistical model), in a Bayesian framework. Randomness and fuzziness may act separately or jointly on the various informational ingredients of a statistical reasoning process. From mathematical viewpoint they are respectively managed by means of probability and fuzzy sets theories. Due to the uncertainty of the ingredients, also the conclusions of the process are uncertain. The latter uncertainty must be expressed as a result of the propagation of the various types of involved uncertainties through the given system of statistical reasoning. This opens new problems in statistical methodology, as compared with traditional inferential statistics, where the whole of uncertainty in the system was expressed in probabilistic terms (see Coppi et al., 2007b and, for a through discussion, the Special Issue of ”Computational Statistics and Data Analysis” (2007c, eds. R Coppi, M.A. Gil, H.A.L. Kiers) devoted to “The Fuzzy Approach to Statistical Analysis”). In the present work, we will focus in particular on the case where the observed data are jointly affected by two sources of uncertainty: fuzziness (due to imprecision, vagueness, partial ignorance) and randomness (due to sampling or measurement errors of stochastic nature). In this framework we will show the theoretical and applicative potentialities of a specific tool developed for coping with this type of complex uncertainty: the notion of “Fuzzy Random Variable”. In collecting statistical data one can frequently come across an underlying imprecision due, for instance, to inexactitude in the measuring process, vagueness of the involved concepts or a certain degree of ignorance about the real values. In many cases, such an imprecision can be modelled by means of fuzzy sets in a more efficient way than considering only a single value or category. Example 1. Forestry. In Forestry a typical score in evaluating the quality of trees is that leading to data corresponding to ‘values’ like “excellent”, 3

“good”, “fair”, “bad”, and “very bad”, these values usually deriving from the aggregation of different characteristics which are not easy to be quantified in a numerical scale (such as leafiness, good structure, and so on). This values can be properly described by means of some fuzzy sets, such as those represented in Figures 1. 1 Very bad Bad Fair Good Excellent

0 0

2

1

3

4

Figure 1: Quality of trees

Example 2. Finance. In dealing with the economical fluctuation of share prices data registered are determined by the minimum and maximum prices, which lead to interval-valued data that can overlap. These interval data can be represented by special fuzzy sets, whose membership functions correspond to the characteristic functions of the considered intervals (see, for instance, those in Figure 2). Fuzzy scale is hence more expressive than real- and interval-valued ones.

1

0 0

50

100

150

200

Figure 2: Fluctuation of share prices

Example 3. Health state. Coppi (2003) considers an example of multivariate fuzzy empirical information in which we take into account “a set 4

of individuals (e.g. a population living in a given area). Each individual, from a clinical viewpoint, can be characterized according to her/his “health state”. This refers to the “normal” functioning of the various “aspects” of her/his organism. Generally, any “aspect” works correctly to a certain extent. We often use the notion of “insufficiency”, related to different relevant functions of parts of the body (e.g. renal or hepatic insufficiency, aortic incompetence, etc.). Insufficiency (as referred to the various relevant above mentioned aspects) is a concept which applies in a certain degree to any individual (depending on several factors such as age, sex, previous illnesses, and so on). This may be expressed by means of a fuzzy number on a standard scale (say, from 0=perfect functioning, to 10= complete insufficiency). Consequently, each individual can be more realistically characterized by a vector of fuzzy variables concerning “insufficiency” of various relevant aspects of her/his organism” (Coppi, 2003). Example 4. Vehicle speed. Pedrycz et al. (1998) remark that the features utilized for describing and classifying vehicle speed can have different representations. “Consider the speed s of a vehicle. If measured precisely at some time instant, speed s is a real number, say s = 100. Figure 3 (case a) shows the membership function µ1 (s) = 1 for this case, µ1 (s) = 1 ⇔ s = 100; otherwise, µ1 (s) = 0. This piece of data could be collected by one observation of a radar gun. Next, suppose that two radar guns held by observers at different locations both measure s at same instant. One sensor might suggest that s = 80, while the second measurement might be s = 110. Uncalibrated instruments could lead to this situation. In this case, several representations of the collected data offer themselves. One way to represent these temporally collocated data points is by the single interval [80, 110], as shown by the membership function 2 (s) in Figure 3 (case b), 2 (s) = 1 ⇔ 80 < s < 110; otherwise, 2(s) = 0. Finally, it may happen that vehicle speed is evaluated nonnumerically by a human observer who might state simply that “s is very high”. In this case, the observation can be naturally modelled by a real fuzzy set. The membership function 3 (s) shown in Figure 3 (case c) is one (of infinitely many) possible representations of the linguistic term “very high” 3 (s) = max{0, 1 − 0.1|100 − s|}, s ∈ R” (Pedrycz et al., 1998). Example 5. Medical diagnosis. In medical diagnosis, Di Lascio et al. (2002), for classifying two samples of patients hospitalized in different times with diabetic neuropathy, according to the severity of the symptoms, suggested suitable linguistic labels. The labels denote a different severity grade 5

m1(s)

m 2(s)

m3(s)

1

0

(case a)

(case b)

(case c)

Figure 3: Fluctuation of share prices

of symptom. Each label is represented by a triangular fuzzy number. The values are assigned partly by consultancy of an expert in the field, and partly by activating the phase of initial training. In particular, the labels and the corresponding fuzzy codes are shown in Table 1(see Sec. 2 for technical details). Example 6. Decision making. “Suppose, as you approach a red light, you must advise a driving student when to apply the brakes. Would you say, ‘Begin braking 74 feet from the crosswalk?. Or would your advice be more like, ‘Apply the brakes pretty soon’ ? The latter, of course; the former instruction is too precise to be implemented. This illustrates that precision may be quite useless, while vague directions can be interpreted and acted upon. Everyday language is one example of ways vagueness is used and propagated. Children quickly learn how to interpret and implement fuzzy instructions (‘go to bed about 10’). We all assimilate and use (act on) fuzzy data, vague rules, and imprecise information, just as we are able to make decisions about situations which seem to be governed by an element of chance. Accordingly, computational models of real systems should also be able to recognize, represent, manipulate, and use (act on) [fuzzy uncertainty]” (Bezdek, 1993). When data are examined for statistical purposes, probabilistic models are considered to formalize both the setting underlying the problem and the mechanism generating data. For real/vectorial data, random variables/vectors model such a mechanism, whereas for the case of interval-valued (more generally, set-valued) data, this model is stated by random sets. The mechanisms generating fuzzy data have been properly modelled through the concept of fuzzy random variables (also referred to in the literature as random fuzzy sets). 6

This paper aims to present a kind of review on the key existing statistical developments with fuzzy random variables, namely, those related to the parametrical inference and regression.

2

Formalizing fuzzy data

Fuzzy-valued data we will deal with in this paper will be those belonging to the class ¯ © ª Fc (R) = U | R → [0, 1] ¯ Uα is a compact interval for all α ∈ [0, 1] , where Uα denotes the α-level of fuzzy set U , that is, Uα = {x ∈ R | U (x) ≥ α} if α ∈ (0, 1], and U0 = cl(supp U ) = cl {x ∈ R | U (x) > 0} , and U (x) represents the degree of compatibility of x with the property defining U (also degree of membership of x in U , or degree of possibility of x being U ). In this way each fuzzy datum U ∈ Fc (R) to be considered can be characterized by means of the family of compact intervals {[inf Uα , sup Uα ]}α∈[0,1] . Alternatively it can be characterized by the family of vectors {(mid Uα , spr Uα )}α∈[0,1] where mid Uα = (inf Uα + sup Uα )/2 and spr Uα = (sup Uα − inf Uα )/2 are the center and the radius of the α-level, respectively. A special type of fuzzy datum which is very often considered is the LRfuzzy number. An LR-fuzzy number, briefly denoted by (m, l, u)LR , is an element U ∈ Fc (R) such that  µ ¶ m − x   if x ≤ m L    l U (x) = ¶ µ   x − m   if x ≥ m  R u 7

with m ∈ R, l > 0, u > 0, and L : R → [0, 1], R : R → [0, 1] being monotonically nonincreasing functions such that L(0) = R(0) = 1 (see, for instance, Figure 4). 1

0

1

m-l

m

0

m+u

m-l

m

m+u

Figure 4: Examples of LR-fuzzy data

3

Arithmetic of fuzzy data

In handling fuzzy data for probabilistic/statistical purposes the basic operations are the sum and the multiplication by scalars. These operations are usually assumed to be based on Zadeh’s extension principle (1975). The application of this principle for elements in Fc (R) can be proven to satisfy, for all α ∈ [0, 1]: (U + V )α = {u + v | u ∈ Uα , v ∈ Vα },

(λ U )α = {λ u | u ∈ Uα },

whatever U, V ∈ Fc (R) and λ ∈ R. This arithmetic coincides level-wise with set-valued arithmetic. The operations above become especially easy to perform when we deal with LR-fuzzy numbers. More concretely, in case U and V involve the same functions L and R, then (m, l, u)LR + (m0 , l0 , u0 )LR = (m + m0 , l + l0 , u + u0 )LR , and

½ λ (m, l, u)LR =

(λm, λl, λu)LR (λm, −λu, −λl)RL

if λ > 0 if λ < 0

With the above-mentioned operations Fc (R) is endowed with a semilinear structure, the absence of linearity being associated with the fact that there is 8

no inverse for the fuzzy addition. When necessary, the Hukuhara difference will be considered. In this sense, C is called the Hukuhara difference between A and B ∈ Fc (R) if A = B + C (that is, Cα = (A −H B)α = [inf Aα − inf Bα , sup Aα − sup Bα ] provided that inf Aα − inf Bα ≤ sup Aα − sup Bα for all α ∈ [0, 1]). Since fuzzy data correspond to [0, 1]-valued mappings on R, they are in fact functional data with a very intuitive interpretation. However, whereas functional data are often assumed to be Hilbert space-valued, the space of fuzzy data with the usual arithmetic can be embedded into a cone of a Hilbert space, but this cone is not a linear space. As a consequence, in the development of many probabilistic and statistical results one should take special care to guarantee that involved operations do not lead to elements out of the cone. For this reason, the case of fuzzy data often requires a specialized study.

4

Metrics for fuzzy data

For both, probabilistic and statistical studies, appropriate metrics for fuzzy data should be considered. In dealing with interval-valued data a well-known metric is Hausdorff distance, which for intervals A, B is given by dH (A, B) = max {| sup A − sup B|, | inf A − inf B|} . The Hausdorff metric plays a key role in formalizing the measurability of set-valued mappings (random sets), which model random mechanisms generating set-valued data. Since Puri and Ralescu (1986) have formalized the random mechanisms generating fuzzy data as an extension of measurable set-valued mappings, Hausdorff metric can also be used in this respect. There are a number of metrics in Fc (R) extending the Hausdorff one (see, for instance, Puri and Ralescu, 1986, Klement et al., 1986 or Diamond and Kloeden, 1994). However, in order to analyze probabilistic and statistical aspects involving the variability of fuzzy random variables these kinds of metrics are not operative (see, for instance K¨orner and N¨ather, 2002). A suitable metric in these cases is the DK -distance (K¨orner and N¨ather, 2002), which for fuzzy data U, V ∈ Fc (R) is given by

9

sZ DK (U, V ) =

¡

{−1,1}2 ×[0,1]2

¢¡ ¢ sU (u, α) − sV (u, α) sU (v, β) − sV (v, β) dK(u, α, v, β)

where s is the support function in Fc (R) (i.e., sU (−1, α) = inf Uα , sU (1, α) = sup Uα ), and K is a positive definite and symmetric kernel. DK is the generic L2 distance in the cone in which Fc (R) is embedded through the support function (see Diamond and Kloeden, 1994). ϕ A particular choice for DK is the DW -metric introduced by Bertoluzza et ϕ al. (1995). For U, V ∈ Fc (R) the DW distance between U and V is given by sZ Z £ ¤2 ϕ DW (U, V ) = fU (α, λ) − fV (α, λ) d W (λ) d ϕ(α), [0,1]

[0,1]

where fU (α, λ) = λ sup Uα + (1 − λ) inf Uα , and W and ϕ are weighting measures which can be identified with probability measures on the measurable space ([0, 1], B[0, 1]); W being associated with a non-degenerate distribution, whereas ϕ has a distribution function which is strictly increasing on [0, 1]. ϕ is in fact a metConditions for W and ϕ are imposed to guarantee that DW ric. Nevertheless, it should be noted that although W and ϕ are identified with probability measures, they have not in fact a stochastic meaning. It should be emphasized that the choice of W is equivalent to the choice of a triple (λ1 , λ2 , λ3 ) (with λ1 > 0, λ2 > 0, λ3 > 0, λ1 + λ2 + λ3 = 1) weighting three different specific points (for instance, the infimum, mid-point and supremum) for each α-level; the choice of ϕ allows us to associate a specific weight or importance with each α-level.

5

Formalizing the random mechanisms generating fuzzy data: fuzzy random variables

Fuzzy random variables, as intended in this section (Puri and Ralescu, 1986) have been considered in the setting of a random experiment to model an essentially fuzzy-valued mechanism, that is, a mechanism associating a fuzzy value with each experimental outcome. The model is stated as follows: Definition 1. Given a probability space (Ω, A, P ), a mapping X : Ω → Fc (R) is said to be a fuzzy random variable (also called random fuzzy set, 10

and referred to as FRV for short) if for all α ∈ [0, 1] the interval-valued mappings Xα defined so that for all ω ∈ Ω ¡ ¢ Xα (ω) = X (ω) α , are random sets (that is, Borel-measurable mappings with the Borel σ-field generated by the topology associated with the Hausdorff metric on the class of the nonempty compact intervals). Fuzzy random variables extend the notion of random sets and, therefore, that of real-valued random variables. In fact, the above definition is equivalent to the one given by Kruse and Meyer (1987) (although the situation which is modeled is conceptually different), in accordance with which X : Ω → Fc (R) is a FRV if and only if for all α ∈ [0, 1] the real-valued mappings inf Xα and sup Xα are random variables.

6

Relevant ‘parameters’ of the distribution of fuzzy random variables

Probabilistic and statistical studies for FRV’s usually concern fuzzy- and real-valued ‘parameters’. These parameters extend either those from the setvalued case or other ones from the real-valued case. The best known fuzzy parameter is the ‘fuzzy expected value’ which is defined (Puri and Ralescu, 1986) as a fuzzy-valued measure of the ‘central tendency’ of a FRV as follows: Definition 2. Given a probability ¡space (Ω, ¢ A, P ), if X : Ω → Fc (R) is a fuzzy random variable such that dH {0}, X0 is integrable, then the (fuzzy) expected value of X corresponds to the unique E(X ) ∈ Fc (R) such that for all α ∈ [0, 1] ¡ ¢ E(X ) α = Aumann’s integral of the random set Xα Z ¯ © ª = X(ω) dP (ω) ¯ X : Ω → Rp , X ∈ L1 (Ω, A, P ), X ∈ Xα a.s. [P ] . Ω

It should be remarked that in the Fc (R)-valued case we are considering, we have that ¡ ¢ E(X ) α = [E(inf Xα ), E(sup Xα )] for all α ∈ [0, 1]. 11

Properties of the fuzzy mean value can be found, for instance, in Puri and Ralescu (1986), Negoita and Ralescu (1987), Colubi et al. (1999). To illustrate the concepts and methods in this paper we consider the following example: Example 7. Clients of a Savings Bank are classified in accordance with their ‘degree of aversion’ to investment. Because of the labels assigned to clients by the Bank managers, the classification process can be viewed as a fuzzy random variable X which takes on values x e1 = ‘very low degree of aversion’, x e2 = ‘low degree of aversion’, x e3 = ‘medium degree of aversion’, x e4 = ‘high degree of aversion’, x e5 = ‘almost total aversion’, with these five values being identified with fuzzy ones described, for instance, by the fuzzy numbers in Figure 5, which are defined in terms of S-curves as follows:

very low

1

low

medium

high

almost total

.5

0 0

20

40

60

80

100

Figure 5: Values of the ‘degree of aversion to investment of clients of a given Savings Bank

12

 in [0, 10]  1 1 − S(10, 20) in [10, 15] x e1 =  0 otherwise  S(30, 40)    1 x e3 = 1 − S(60, 70)    0

in [30, 40] in [40, 60] in [60, 70] otherwise ½

x e5 = with

in [5, 20] in [20, 30] in [30, 40] otherwise

 S(65, 75)    1 x e4 = 1 − S(85, 100)    0

in [65, 75] in [75, 85] in [85, 100] otherwise

S(80, 100) in [80, 100] 0 otherwise

 0   µ ¶2   t − a   2    b−a S(a, b)(t) =

 S(5, 20)    1 x e2 = 1 − S(30, 40)    0

if t ≤ a · ¸ a+b if t ∈ a, 2

µ ¶2 · ¸    t−b a+b   1−2 if t ∈ ,b   b−a 2   1 otherwise,

It should be pointed out (see also the example in Figure 1) that in describing imprecisely-valued data, it is often considered that ‘central’ values are more ‘imprecise’ (in the sense of being closer to intervals than to real numbers) than ‘extreme’ ones. Bank managers are interested in the ‘mean degrees of aversion’ in the three main offices of the Savings Bank in a given area, that will be denoted by Ω1 , Ω2 and Ω3 . For this purpose, they consider a whole sample of n = 122 clients, and observe fuzzy random variable X on the subsamples of sizes n1 = 33, n2 = 44 and n3 = 45, from the three populations, in which they get the following fuzzy data: Of course the table above could be treated as a contingency one, but in such a treatment one would develop a statistical analysis of ordinal or qualitative data. As a consequence, it would not be possible to consider and compare distances between data, so that the expressive scale of measurement of Fc (R) and the valuable information associated with distances would be lost. Although alternative ways to handle these imprecise data could 13

be suggested, fuzzy random variables represent an intuitive one to describe these data, and also operations with data are easy to perform and no extra information (like stochastic dependence, etc.) is required. As an illustration of the idea of fuzzy expected value we can now represent those corresponding to FRV X in the samples from Ω1 , Ω2 and Ω3 (see Figure 6). In accordance with the graphics for these sample mean values, the mean risk aversion to investment is ‘quite moderate’ for the first two samples, whereas it is ‘rather high’ for the one from Ω3 .

1

.5

0 25

35

45

55

65

75

Figure 6: Mean (fuzzy) values of the ‘degree of aversion to investment’in the three considered samples in the Example Other useful parameters are those quantifying the absolute/relative variation of a FRV (see, for instance, Lubiano, 1999, K¨orner and N¨ather, 2002), when it is intended as a measure of how values differ among them in terms of distances/ratios (respectively). Since variation summary measures are usually employed to compare variables or populations, they are often considered to be real-valued. A real-valued measure for the ‘absolute variation’ of a FRV can be obtained, for instance, by expressing how much “in error” this number is expected to be as a description of variable values. This error could be quantified in a natural way as follows: 14

Definition 3. Let X : Ω → Fc (R) be a fuzzy random variable so that £ ¡ ¢¤2 dH {0}, X0 ∈ L1 (Ω, A, P ), then, the variance of X is given by the value Z ¡ 2¢ V ar(X ) = E [DK (X , E(X ))] = [DK (X (ω), E(X ))]2 dP (ω). Ω

This variation measure agrees with the Fr`echet approach and verifies the usual properties of the variance (see Lubiano, 1999, K¨orner and N¨ather, 2002). Example 8. For data in Example 7 the absolute variation of X in the samples from Ω1 , Ω2 and Ω3 can be quantified (by considering W = ϕ = Lebesgue measure). Thus, the sample variation is given by V ar(X |(ω1,1 , . . . , ω1,33 ) = 776.585, V ar(X |(ω2,1 , . . . , ω2,44 ) = 791.930, V ar(X |(ω3,1 , . . . , ω3,45 ) = 883.148. Since the scales for the fuzzy mean values are similar absolute variations are comparable, whence we can conclude that samples from Ω1 and Ω2 show close variations, whereas (in the absolute sense) X is slightly more variable over Ω3 than in them. A real-valued measure for the ‘relative variation’ of a FRV can be obtained, for instance, by considering the ‘ratio’ of its values with respect to the mean one (i.e., “how many times its values have as much as the mean value”), so it distinguishes between values being above and below the mean. Another distinctive characteristic of the relative variation, opposite to the absolute one, is that the “contribution” of variable values to the inequality associated with this variable is not necessarily nonnegative (more precisely, the contribution for values above the mean one is usually assumed to be negative, for values below the mean it is positive, and for values coinciding with the mean the contribution is null). The key problem in the case of fuzzy data will be now that of dividing and ranking them. In Lubiano (1999) we can find a detailed justification of the arguments leading to the following Definition 4. Assume that the FRV X : Ω → Fc (R)¡ satisfies ¢ the condition that X0 (ω) ⊂ (0, +∞) for all ω ∈ Ω). Let f : 0, +∞ → R be a 15

strictly convex (intended as convex downward) and monotonic function verifying f (u)+f (1/u) ≥ 0 for all u ∈ (0, +∞) and f (1) = 0. The f -inequality index associated with X is the value given by · µ ¶ µ ¶¸ Z 1 inf Xα sup Xα I f (X ) = E f +f dα, 2 [0,1] E(sup Xα ) E(inf Xα ) whenever the above expectations exist. Properties of the above measures can be found, for instance, in Lubiano (1999), Lubiano et al. (2000), Lubiano and Gil (2002). Inequality measures of FRV’s have been also expressed in terms of fuzzy-valued indices (see, for instance, Gil et al., 1998) which become useful when the main goal is not that of comparing either populations or fuzzy attributes but rather to give an idea of the magnitude of the inequality. Example 9. Consider the variable ‘annual income’, X , when the income is not exactly quantified but rather classified in accordance with the classification adopted in some credit assessment systems (intended as a kind of perception on the annual income situation). Following Cox (1994), this variable can be viewed as a variable whose (fuzzy) values are x˜1 = ‘somewhat high’, x˜2 = ‘moderately high’, x˜3 = ‘high’ and x˜4 = ‘very high’, where x˜1 , x˜2 , x˜3 and x˜4 are described (in US thousand euros) by means of the following S- and Π-curves: x˜1 x˜2 x˜3 x˜4

= 1 − S(100, 125) = Π(100, 125, 150), = Π(125, 147.5, 170) = S(147.5, 170)

= ‘somewhat high’ · · · · · · = ‘moderately high’ = ‘high’ - - - - - = ‘very high’

and supp˜ xi ⊂ [90, 180], i = 1, 2, 3, 4, for all candidates for a credit in the considered system (see Figure 7). Assume that a bank adopting the above system wishes to compare two different towns by means of the income inequality, and to this purpose we observe the values of X in the central offices of these two towns. If there are 125 candidates for a credit in one of the offices (Ω1 ) during a certain period, 28 of them having a ‘somewhat high’ annual income, 43 ‘moderately high’, 31 ‘high’, and 23 ‘very high’, whereas there are 178 candidates for a credit in the other office (Ω2 ) during the same period, 63 of them 16

1

125

147.5

170

Figure 7: ‘Annual income’ values data in accordance with some credit assessment systems

having ‘somewhat high’, 79 ‘moderately high’, 27 ‘high’, and 9 ‘very high’, and we employ the f -inequality index with f (x) = − log x, we obtain that I f (XΩ1 ) = 0.01224, I f (XΩ2 ) = 0.009, whence we can conclude that there are more inequality of annual income in the first town.

7

Estimation/testing on relevant parameters associated with fuzzy random variables

The procedures of inferential analysis of fuzzy data which have been firstly developed in the context of FRVs, relate to the estimation and hypothesis testing problems concerning the parameters of the associated parent populations. We illustrate in the sequel some results obtained in this framework.

7.1

Inferences on the (fuzzy) mean value of a FRV

Consider a FRV X : Ω → with the probability space ¡ Fc (R)¢ associated 1 (Ω, A, P ) and such that dH {0}, X0 ∈ L (Ω, A, P ). Let X1 , . . . , Xn be FRVs which are identically distributed as X . Then (see Lubiano et al., 1999), Theorem 5. The sample fuzzy mean value Xn =

1 (X1 + . . . + Xn ) n 17

is an unbiased and consistent estimator of the fuzzy parameter E(X ); that is, the (fuzzy) mean of the fuzzy-valued estimator X n over the space of all random samples equals E(X ) and X n converges almost-surely to E(X ) Example 10. Since values in Example 7 correspond to a random sample of clients of each office, the (sample) mean values in Figure 6 are fuzzy estimates of the population mean values. On the other hand, consider a FRV X : Ω → Fc (R) associated with the £ ¡ ¢¤2 probability space (Ω, A, P ) and such that dH {0}, X0 ∈ L1 (Ω, A, P ). Let X1 , . . . , Xn be FRVs which are independent and identically distributed as X , and let X1∗ , . . . , Xn∗ be a bootstrap sample obtained from X1 , . . . , Xn . Then (see Montenegro et al., 2004, Gonz´alez-Rodr´ıguez et al., 2006), we have that Theorem 6. In testing the null hypothesis H0 : E(X |P ) = U ∈ Fc (R) at the nominal significance level α ∈ [0, 1], H0 should be rejected whenever h ¡ ¢i2 DK X n , U > zα , 2 SbK n where zα is the 100(1 − α) fractile of the bootstrap distribution of h ¡ ∗ ¢i 2 . ∗ 2 Tn = D K X n , X n SbKn and with n X ¢¤2 £ ¡ 1 (X1 + . . . + Xn ) , SbK2 n = DK Xi , X n /(n − 1), n i=1 n h i X 1 ∗ ∗¢ 2 ∗ ∗ ∗2 b X n = (X1 + . . . + Xn ) , SKn = DK (Xi∗ , X n /(n − 1). n i=1

Xn =

Example 11. Bank managers in Example 7 are interested in checking whether or not the ‘mean degree of aversion’ of each office is ‘midium/high’, where this value is assumed to be described by means of the fuzzy set in Figure 6. By applying bootstrap techniques to test such a hypothesis on the basis of the available sample data, we get the p-values for Ω1 , Ω2 , and Ω3 , which are respectively given by .015, .013 and .474. This means that, at significance leve of .05, the ‘mean degree of aversion’ in office Ω3 can be accepted to be ‘midium/high’, whereas this is not sustainable in Offices Ω1 and Ω2 . 18

Testing hypothesis can also be developed for the two-sample and multisample cases (see Montenegro et al., 2001, and Gil et al., 2006, respectively). The multi-sample one will be later explained in the setting of linear models.

7.2

Inferences on the variance

Consider a FRV X : Ω → ¡Fc (R) ¢associated with the probability space (Ω, A, P ) and such that [dH {0}, X0 ]2 ∈ L1 (Ω, A, P ). Let X1 , . . . , Xn be FRVs which are independent and identically distributed as X . Then, Theorem 7. The corrected sample variance n

SbK2 =

X£ ¡ ¢¤2 1 DK Xi , X n (n − 1) i=1

is an unbiased and consistent estimator of V ar(X ); that is, the mean of the real-valued estimator SbK2 over the space of all random samples equals V ar(X ) and SbK2 converges almost-surely to V ar(X ). Example 12. On the basis of the results in Example 8 and the results above, unbiased estimates of the population variances of X in Ω1 , Ω2 and Ω3 are, respectively, given by 800.853, 810.347, and 903.216. On the other hand, consider a FRV X : Ω → Fc (R) associated with the £ ¡ ¢¤4 probability space (Ω, A, P ) and such that dH {0}, X0 ∈ L1 (Ω, A, P ). Let X1 , . . . , Xn be FRVs which are independent and identically distributed as X , ϕ metric, we then by the asymptotic results in Lubiano (1999) involving DW have that Theorem 8. In testing the null hypothesis H0 : V ar(X ) = δ0 ∈ R at the nominal significance level α ∈ [0, 1], H0 should be rejected whenever ¯ n ¯ ¯ ¢¤2 √ ¯¯ 1 X £ ϕ ¡ ¯ n¯ DW Xi , X n − δ0 ¯ ¯n ¯ i=1 > zα , 2 \ S (W,ϕ) n

where zα is the 100(1 − α) fractile of an N (0, 1) distribution. where ¶2 n µ ¤2 n − 1 2 1X £ ϕ 2 \ b DW (Xi , X n ) − S(W,ϕ)n . S(W,ϕ) = n n i=1 n 19

Example 13. Assume that Bank managers consider an absolute variation in the ‘degree of aversion’ lower than 800 can be accepted. Then, to test whether the variation in Ω1 , Ω2 , and Ω3 satisfy such a condition, the last test has been applied. The corresponding p-values are given by .57, .53, and .22, whence the variability can be accepted in any office at the usual significance levels.

7.3

Inferences on the inequality

Consider a FRV X : Ω → Fc (R) associated with the probability space ¡ ¢ (Ω, A, P ) and such that X0 (ω) ⊂ (0, +∞) for all ω ∈ Ω, and [dH {0}, X0 ]2 ∈ L1 (Ω, A, P ). Let X1 , . . . , Xn be FRVs which are independent and identically distributed as X . Then, Theorem 9. The corrected sample hyperbolic inequality index given by I\ h (X )n such that I\ h (X )n =

· ¸ Z n n X X inf(Xj )α 1 sup(Xj )α + dα 2n(n − 1) i=1 j=1, j6=i [0,1] sup(Xi )α inf(Xi )α

is an unbiased and consistent estimator of Ih (X ); that is, the mean of the real-valued estimator I\ h (X )n over the space of all random samples equals Ih (X ) and the sample hyperbolic inequality index converges almost-surely to Ih (X ). On the other hand, consider a FRV X : Ω → Fc (R) associated with the probability space (Ω, A, P ), such that X0 (ω) ⊂ (0, +∞) for all ω ∈ Ω, and satisfying integrability condition guaranteeing the existence of If (X ). Let X1 , . . . , Xn be FRVs which are independent and identically distributed as X , then by the asymptotic results in Lubiano (1999), we have that Theorem 10. In testing the null hypothesis H0 : If (X ) = ι0 ∈ R at the nominal significance level α ∈ [0, 1], H0 should be rejected whenever ¯ ¯ ¶ µ ¶¸ · µ n Z ¯ √ ¯¯ 1 X sup(Xi )α inf(Xi )α ¯ n¯ +f dα − ι0 ¯ f ¯ ¯ 2n sup(X n )α inf(X n )α i=1 [0,1] > zα , c2 S fn 20

where zα is the 100(1 − α) fractile of an N (0, 1) distribution, where n

X c2 = 1 S fn 4n i=1

µZ

· µ ¶ µ ¶¸ inf(Xi )α sup(Xi )α f +f dα sup(X n )α inf(X n )α [0,1]

¶ µ ¶¸ · µ n Z 1X sup(Xj )α inf(Xj )α − +f dα f n j=1 [0,1] sup(X n )α inf(X n )α

!2 .

Example 14. A situation similar to that in Example 9 could be considered for the variable X in Example 3, and the test in Theorem 10 could be then applied.

8

Linear models

As we have remarked above, the considered space of fuzzy sets is not linear. This fact makes the classical theory of Linear models in Statistics rather complex or restrictive whenever the difference operator is needed. In spite of this, there are some studies in the literature involving this kind of models.

8.1

Analysis of variance with fuzzy data

Consider a factor which can act at k possible different levels and having fixed effects, and a response fuzzy random variable determining k independent populations. In this way, let (Ω1 , A1 , P1 ), . . . , (Ωk , Ak , Pk ) be probability spaces and X1 : Ω1 → Fc (R), . . . , Xk : Ωk → Fc (R) fuzzy random variables associated with these spaces. From the i-th population, Xi , we can generate a simple random sample, Xi1 , . . . , Xini (i = 1, . . . , k). Consider the linear model Xij = Ai + ²ij where Ai ∈ Fc (R) and the expected value of the fuzzy random errors ²ij is B ∈ Fc (R) for all i = 1 . . . k and j = 1 . . . ni . The goal for this section is verifying if the effect of the factor at the k levels is the same, which is equivalent to testing the equality of the fuzzy means values on the k populations. The technique of analysis of variance (ratio between-group variation/within-group variation) will be considered.

21

Whatever i ∈ {1, . . . , k} may be, consider a simple random sample from Xi , Xi1 , . . . , Xini . The sample fuzzy mean in the i-th group is given by Xi =

1 (Xi1 + . . . + Xini ) , ni

and the total sample fuzzy mean is given by X =

1 nk n1 (X11 + . . . + X1n1 + . . . + Xk1 + . . . + Xknk ) = X 1 + . . . + X k n n n

(with n = n1 + . . . + nk ). To get bootstrap populations with a common fuzzy mean from the available sample information in this case, we can add to each sample the mean of the other samples. In other words, for each i = 1, . . . , k we will define a new fuzzy random variable Xi∗ taking on values x ei1 + X −i , . . . , x eiLi + X −i with sample distribution vector fni = (fni 1 , . . . , fni Li ), where X −i is the (fuzzy) sum of all the available sample means but the i-th one, that is, X −i = X 1 ⊕ . . . ⊕ X (i−1) ⊕ X (i+1) ⊕ . . . ⊕ X k . Then, we will resample from these new populations, that is, for any i ∈ {1, . . . , k} we draw a large number of samples of ni independent observations ∗ ∗ Xi1∗ , . . . , Xin , from population XX will be given by ik. The h test¡ statistic i ¢¢i2 ϕ ∗ ∗ nj DW Xi , X ∗ T(n = 1 ,...,nk )

i=1 ni k X X £

¡

¢¤ ∗ 2

,

ϕ Xij∗ , Xi DW

i=1 j=1

where for all i ∈ {1, . . . , k} Xi∗ =

¢ 1 ¡ ∗ ∗ Xi1 + . . . + Xin , i ni

X∗ =

n1 ∗ nk ∗ X1 + . . . + X . n n k

Under regularity conditions (see Gil et al., 2006) it can be shown the following Theorem 11. To test at the nominal significance level α ∈ [0, 1] the null hypothesis e 1 |p1 ) = . . . = E(X e J |pJ ), H0 : E(X

22

H0 should be rejected if J X

T(n1 ,...,nJ ) =

h ¡ ¢i2 ϕ nj DW Xj , X

j=1 nj J XX

£

ϕ DW

¡

Xji , Xj

¢¤2

> zα ,

j=1 i=1 ∗ (this diswhere zα is the 100(1 − α) fractile of the distribution of T(n 1 ,...,nJ ) tribution can be approximated by the MonteCarlo method). Moreover, the probability of rejecting H0 under the alternative hypothesis Ha (which assumes that there is not coincidence of all the fuzzy population means) converges to 1 as nj → ∞ for all j = 1, . . . , J.

In practice, and especially when samples have quite different sizes, the replacement of the statistic in Theorem 11 and its bootstrap version by J X

h ¡ ¢i2 ϕ nj DW Xj , X

j=1 J X j=1

nj ¢¤2 1 X£ ϕ ¡ DW Xji , Xj 2 nj i=1

usually leads to more stable results. Example 15. Bank managers in Example 1 are interested in checking whether the ‘mean degree of aversion’ is the same in the three office. By applying the method in this section, a p-value of .302 is obtained. Thus, at the usual significance level, this hypothesis is accepted.

8.2

Regression analysis involving fuzzy data

The regression problem involving input or/and output fuzzy data has been widely studied in the literature. The viewpoint of many of them regards a fitting problem (see, for instance, Tanaka et al., 1982, Celminˇs, 1987, Diamond, 1988, Bandemar and N¨ather, 1992, Coppi and D’Urso, 2003 D’Urso, 2003 or Coppi et al., 2006 ) and only few works assume an underlying stochastic model expressed through the fuzzy random variables (see K¨orner and N¨ather, 23

1998, N¨ather, 2000 W¨ unche and N¨ather, 2002, Kr¨atschmer, 2004 and N¨ather, 2006). From a Least Squares setting, there are two fundamental approaches with a fuzzy response variable. The first approach regards crisp independent variables, whereas the second approach refers to fuzzy independent variables. Regarding the first approach, FRVs Y1 , . . . , Yn ∈ Fc (R) are observed at a set of m-dimensional crisp design points {x1 , . . . , xn } ⊂ Rm . The usual choice of kernel K in linear models is ( 1 u = v and α = β K(u, v, α, β) = 0 otherwise The following result assures the existence and uniqueness of the least squares solution when a linear model is considered (see, for instance, K¨orner and N¨ather, 2002). ˆ1 , . . . , B ˆm ) ∈ (Fc (R))m so that Theorem 12. There exists a unique (B Ã ! Ã ! n m n m X X X X 2 2 ˆj = DK Yi , xij B inf DK Yi , xij Bj . m i=1

j=1

(B1 ,...,Bm )∈(Fc (R))

i=1

j=1

In K¨orner and N¨ather (2002) explicit solutions are obtained when LRfuzzy numbers are considered. A wider discussion on this issue can be found in N¨ather (2006), where it is proved that under a stochastic linear regression model E(Yi ) =

m X

ej , xij B

e1 , . . . , B em ) ∈ (Fc (R))m , i = 1, . . . , n and (B

j=1

it is not possible to find in general a BLUE of the fuzzy parameter if m ≥ 2. It should be noted that in the simple regression case, m = 1, for instance, the sample fuzzy mean is a BLUE. Concerning the second approach, consider (X, Y ) a bidimensional FRV associated with a probability space (Ω, A, P ) so that V ar(X) < ∞ and V ar(Y ) < ∞, and let Cov(X, Y ) defined as ¶ µZ ¡ ¢¡ ¢ sX (u, α) − sE(X) (u, α) sY (v, β) − sE(Y ) (v, β) dK . E {−1,1}2 ×[0,1]2

24

W¨ unche & N¨ather (2002) proved that E(Y |X) is the best approximation of Y by a measurable function of X w.r.t. DK , which justifies calling E(Y |X) regression function of Y w.r.t. X. Regarding the problem of finding the best approximation of Y by a linear function of X, that is, search for a ˆ ∈ R and ˆ ∈ Fc (R) so that B ´ ³ ¢ ¡ 2 2 ˆ = (Y, aX + B) , (Y, a ˆX + B) inf E DW E DW a∈R,B∈Fc (R)

partial solutions provided that some involved Hukuhara differences exist were obtained. In this respect, it should be noted that the lack of linear structure of the space of fuzzy sets makes difficult to handle linear regression models. Theorem 13. The minimization problem ³ ´ 2 ˆ E DW (Y, a ˆX + B) = inf

a∈R,B∈Fc (R)

¡ 2 ¢ E DW (Y, aX + B)

can be easily solved in the following cases, ˆ = a) if a > 0 and Cov(X, Y ) ≥ 0, then a ˆ = Cov(X, Y )/V ar(X), B ˆ E(Y ) −H a ˆE(X) is a solution provided that B exists. ˆ = b) if a < 0 and Cov(−X, Y ) ≤ 0, then a ˆ = Cov(−X, Y )/V ar(X), B ˆ exists. E(Y ) −H a ˆE(X) is a solution provided that B To avoid the difficulties of linear models, in Fl´orez et al. (2006) it is considered a nonparametric regression model given by Y = g(X ) + εX where g : Fc (R) → Fc (R) is any function and εX is a fuzzy random variable defined on (Ω, A, P ). It should be noted that the arithmetic between fuzzy sets makes meaningless a regression model for which the E(εX ) = 1{0} . By means of the support function it is possible to take advantage of some recent results concerning regression models developed for functional data in this context. More concretely, in Fl´orez et al. (2006) it is proposed a kernel estimator of m = g + B on the basis of the random sample {(Xi , Yi )}ni=1 obtained from (X , Y) given by Pn Yi K(h−1 DK (e x, Xi )) m(e ˆ x) = Pi=1 n −1 x, Xi )) i=1 K(h DK (e where h is the bandwidth, and the kernel K satisfies the usual regularity conditions (see, for instance, Ferraty and Vieu, 2004). It should be noted that the choice of the kernel function is not very relevant whereas the choice of the bandwidth is decisive. 25

9

Concluding remarks

It should be remarked that the notion of fuzzy random variable is not the only one to model and deal with fuzzy data. Among the alternative notions we can refer to the concept of fuzzy information system (Tanaka et al. 1979), which has been also considered to model a mechanism generating fuzzy data corresponding to fuzzy observations/reports of existing real-valued data and some statistics have been developed about it (see, for instance, Gil et al. 1988ab, 1989). Reasons supporting the interest of fuzzy random variables are mainly the following: • they model essentially fuzzy-valued classification processes, mechanisms, and so on, which are usually associated with perceptions, value judgments, opinions, leading to imprecise assessments; • they are well-formalized in the probabilistic setting, which allows that concepts and properties in analyzing real-valued data (like, for instance, p-values, power functions, etc. of hypothesis tests) make sense in this wider context, and several statistical techniques inspired in those for real-valued data can be extended/adapted to the fuzzy-valued case; • the use of the scale of measurement Fc (R) allows to consider distances between data, whence many classical concepts and procedures (especially those based on the squared error) can be extended and applied for statistical purposes.

References [1] Bertoluzza, C., Corral, N. and Salas A. (1995). On a new class of distances between fuzzy numbers. Mathware & Soft Computing 2, 71–84. [2] Bezdek, J.C. (1993). Fuzzy models - What are they, and why? IEEE Transactions on Fuzzy Systems 1, 1-6. [3] Coppi, R. (2002). A theoretical framework for data mining: the “Informational Paradigm”. Computational Statistics & Data Analysis 38, 501-515.

26

[4] Coppi, R. (2003). The fuzzy approach to multivariate statistical analysis. Technical report, Department of Statistics, Probability and Applied Statistics 11, University of Rome “La Sapienza”. [5] Coppi R. and D’Urso, P., (2003). Regression analysis with fuzzy informational paradigm: a least-squares approach using membership function information. International Journal of Pure and Applied Mathematics 8 (3), 279-306. [6] Coppi, R., D’Urso, P., Giordani, P. and Santoro, A. (2006). Least squares estimation of a linear regression model with LR fuzzy response. Computational Statistics and Data Analysis (to appear). [7] Coppi, R., D’Urso, P., Giordani, P. (2007a). Two-way component models for fuzzy data. Psychometrika (to appear). [8] Coppi, R., Gil, M.A., H.A.L. Kiers. (2007b). The fuzzy approach to statistical analysis. Computational Statistics & Data Analysis (to appear). [9] Cox, E. (1994). The Fuzzy Systems Handbook. Academic Press, Cambridge. [10] Diamond, P. (1988). Fuzzy least squares. Information Sciences 46, 141157. [11] Diamond, P. and Kloeden P. (1994). Metric Spaces of Fuzzy Sets: Theory and Applications. World Scientific, Singapore. [12] Di Lascio, L., Gisolfi, A., Albunia, A., Galarti, G., Meschi, F. (2002). A fuzzy-based methodology for the analysis of diabetic neuropathy. Fuzzy Sets and Systems 129, 203-228. [13] D’Urso, P. (2003). Linear regression analysis for fuzzy/crisp input and fuzzy/crisp output data. Computational Statistics and Data Analysis 42 47-72. [14] Ferraty, F., Vieu, P. (2004). Nonparametric models for functional data, with application in regression, time-series prediction and curve discrimination. J. Nonparametr. Stat. 16, 111–125.

27

[15] Fl´orez, J.L., Colubi, A., Gonz´alez-Rodr´ıguez, G., Gil, M.A. (2006). Nonparametric regression with fuzzy random variables through the support function. Proc. IPMU 2006. [16] Garc´ıa, D., Lubiano, M. A. and Alonso, M. C. (2001). Estimating the expected value of fuzzy random variables in the stratified random sampling from finite populations. Inform. Sci. 138, 165–184. [17] Gil, M.A., Corral, N. et Gil, P. (1988a). The minimum inaccuracy estimates in χ2 tests for goodness of fit with fuzzy observations. J. Statist. Plan. Infer. 19, 95–115. [18] Gil, M.A. et Casals, M.R. (1988b). An operative extension of the likelihood ratio test from fuzzy data. Stat. Pap. 29, 191–203. [19] Gil, M.A., Corral, N. et Casals, M.R. (1989). The likelihood ratio test for goodness of fit with fuzzy experimental observations. IEEE Trans. Syst., Man, Cyber. 19, 771–779. [20] Gil, M.A., Gonz´alez-Rodr´ıguez, G., Colubi, A., Montenegro, M. (2006). Testing Linear Independence in Linear Models with Interval-Valued Data. Comp. Stat. Data Anal. (to appear) [21] Gonz´alez-Rodr´ıguez, G., Montenegro, M., Colubi, A., Gil, M.A. (2006). Bootstrap techniques and fuzzy random variables: Synergy in hypothesis testing with fuzzy data. Fuzzy Sets and Systems (to appear). [22] Klement, E.P., Puri M.L., Ralescu, D.A. (1986). Limit theorems for fuzzy random variables. Proc. R. Soc, Lond. A 407, 171-182. [23] Krtschmer, V. (2004). Least squares estimation in linear regression models with vague concepts. In: Soft Methodology and Random Information Systems (eds. M. Lopez-Diaz, M.A. Gil, P. Grzegorzewski, P. Hryniewicz, J. Lawry). Springer, Heidelberg, 407-414. [24] K¨orner, R. and N¨ather, W. (2002). On the variance of random fuzzy variables. In: Bertoluzza, C., Gil, M.A., Ralescu, D.A. (Eds.) Statistical Modeling, Analysis and Management of Fuzzy Data. Physica-Verlag, Heidelberg, pp. 22–39.

28

[25] Lubiano, M. A. and Gil, M. A. (1999). Estimating the expected value of fuzzy random variables in random samplings from finite populations. Statist. Papers 40, 277–295. [26] Lubiano, M. A., Gil, M. A. and L´opez-D´ıaz, M. (1999). On the RaoBlackwell theorem for fuzzy random variables. Kybernetika 35, 167–175. [27] Lubiano, M. A., Gil, M. A., L´opez-D´ıaz, M. and L´opez-Garc´ıa, M.T. − → (2000). The λ -mean squared dispersion associated with a fuzzy random variable. Fuzzy Sets and Systems 111, 307–317. [28] N¨ather, W. (2006). Regression with fuzzy random data. Computational Statistics and Data Analysis, (to appear). [29] Montenegro, M., Casals, M. R., Lubiano, M. A. and Gil, M. A. (2001). Two-sample hypothesis tests of means of a fuzzy random variable. Inform. Sci 133, 89–100. [30] Montenegro, M., Casals, M. R., Gil, M. A. (2000). Asymptotic comparison of two fuzzy expected values. Proc. JCIS 2000 - 7th FT&T Conference, 150-153 [31] Montenegro, M., Casals, M. R., Lubiano, M. A., Gil, M. A. (2001). Twosample hypothesis tests of means of a fuzzy random variable. Inform. Sci 133, 89–100. [32] Montenegro, M., Gonz´alez-Rodr´ıguez, G., Gil, M. A., Colubi, A., Casals, M. R., 2004a. Introduction to ANOVA with fuzzy random variables. In: L´ opez-D´ıaz, M., Gil, M. A., Grzegorzewski, P., Hryniewicz, O., Lawry, J. (Eds.), Soft Methodology and Random Information Systems. SpringerVerlag, Berlin, pp. 487–494. [33] Montenegro, M., Colubi, A., Casals, M. R., Gil, M. A. (2004b). Asymptotic and Bootstrap techniques for testing the expected value of a fuzzy random variable. Metrika 59, 31–49. [34] Pedrycz, W., Bezdek, J.C., Hathaway, R.J., Rogers, G.W. (1998). Two nonparametric models for fusing heterogeneous fuzzy data. IEEE Transactions on Fuzzy Systems 6 (3), 411-425.

29

[35] Puri, M. L. and Ralescu, D. A. (1986). Fuzzy random variables. J. Math. Anal. Appl. 114, 409–422. [36] Tanaka, H., Okuda, T. et Asai, K. (1979). Fuzzy information and decision in statistical model. In Advances in Fuzzy Sets Theory and Applications (Eds. M.M. Gupta, R.K. Rage et R.R. Yager), North-Holland, pp. 303-320. [37] Tanaka, H., Uejima, S., Asai, K. (1982). Linear regression analysis with fuzzy model. IEEE Transactions on Systems Man Cybernetetics 12, 903907. [38] W¨ unsche, A. and N¨ather, W. (2002). Least-squares fuzzy regression with fuzzy random variables. Fuzzy Sets and Systems 130, 43-50. [39] Zadeh, L.A. (1975). The concept of a linguistic variable and its application to approximate reasoning, Part 1. Inform. Sci. 8, 199–249; Part 2. Inform. Sci. 8, 301–353; Part 3. Inform. Sci. 9, 43–80.

30

Attributes Diabetes-age (Di-age)

HbA1c: this parameter indicates the glicosilate haemoglobin and measures the patient’s hemoglicidic state, in terms of percentage of haemoglobin affected by glicosilation. Normal values are lower than 6%. AER: this parameter measures the amount of albumin in urine. High values are symptoms of nephropathy and signals that diabetes starts damaging the organism. Insulin: it denotes the amount of insulin administered to the patient.

GFR: it measures the amount of plasma filtered by renal glomerule per unit of time. This parameter gives further information about renal functionality. Normal values range from 90 to 130, greater values may denote early unbalance. Fluor: it denotes fluorangiography, namely the eyeball test devoted to ascertain the state of vessels.

BdB: it stands for beat to beat and denotes the degree of cardiac enervation. Possible values are: 1=normal, 2=borderline, 3=pathological. Fundus: it indicates the state of ocular fundus.

Labels and fuzzy numbers Choosing labels for this attribute considers juvenile diabetes, whose diagnosis takes place generally at about 12. Then, the following labels are singled out: ƒ Very early: if diagnosis occurs before 3 years old, i.e. Di-age[0,3). This label is represented by the fuzzy number [0.8, 1.0, 1.0]. ƒ Early: if Di-age[3,10] Ÿ the fuzzy number is [0.6, 0.7, 0.9]. ƒ Average: if Di-age(10,25] Ÿ the fuzzy number is [0.1, 0.5, 0.7]. ƒ Late: if Di-age(25,’) Ÿ the fuzzy number is [0.0, 0.0, 0.2]. For healthy people this index takes value zero, however values below 6% are considered as acceptable. ƒ Serious pathology: if the values is greater than 15%, i.e. HbA1c[15, ’). The corresponding fuzzy number is [0.8, 1.0, 1.0]. ƒ High value: if HbA1c[9, 15] Ÿ the fuzzy number is [0.5, 0.75, 0.9]. ƒ Altered value: if HbA1c(6, 9) Ÿ the fuzzy number is [0.2, 0.4, 0.6]. ƒ Normal: if HbA1c[0, 6] Ÿ the fuzzy number is [0.0, 0.0, 0.3]. ƒ Serious pathology: if the value is greater than 30, i.e. index AER[30, ’) Ÿ the fuzzy number is [0.5, 0.7, 1.0]. ƒ High value: if AER(20, 30] Ÿ the fuzzy number is [0.4, 0.6, 1.0]. ƒ Altered value: if AER(15, 20] Ÿ the fuzzy number is [0.2, 0.5, 0.6]. ƒ Normal: if AER[0, 15] Ÿ the fuzzy number is [0.0, 0.0, 0.3]. The values are assigned considering that daily quantity is usually 1U/kg. ƒ High value: if Insulin(1.1, ’) Ÿ the fuzzy number is [0.9, 1.0, 1.0]. ƒ Normal value: if Insulin[0.9, 1.1] Ÿ the fuzzy number is [0.5, 0.7, 1.0]. ƒ Low value: if Insulin[0, 0.9) Ÿ the fuzzy number is [0.2, 0.5, 0.6]. Regular values lie between 90 and 130, lower are viewed as pathological and upper ones deserve further investigation. ƒ Serious pathology: if index GFR[0, 40) Ÿ the fuzzy number is [0.7, 1.0, 1.0]. ƒ High: if GFR[40, 60] Ÿ the fuzzy number is [0.4, 0.6, 0.8]. ƒ Borderline: if GFR(60, 90]‰(130, ’) Ÿ the fuzzy number is [0.2, 0.4, 0.5]. ƒ Normal: if GFR(90, 130] Ÿ the fuzzy number is [0.0, 0.0, 0.4]. As regards fluorangiography there are standard values representing pathological categories, thus we just attach a label to each category. ƒ Serious pathology: if the value Fluor =3 Ÿ the fuzzy number is [0.7, 1.0, 1.0]. ƒ Pathological: if Fluor =2 Ÿ the fuzzy number is [0.4, 0.6, 0.8]. ƒ Normal: if Fluor =1 Ÿ the fuzzy number is [0.2, 0.4, 0.5]. ƒ Pathological: if the index BdB =3 Ÿ the fuzzy number is [0.7, 1.0, 1.0]. ƒ Borderline: if BdB =2 Ÿ the fuzzy number is [0.3, 0.5, 0.7]. ƒ Normal: if BdB =1 Ÿ the fuzzy number is [0.2, 0.4, 0.5]. ƒ Serious pathology: if Fundus =3 Ÿ the fuzzy number is [0.7, 1.0, 1.0]. ƒ Pathological: if Fluor =2 Ÿ the fuzzy number is [0.3, 0.5, 0.7]. ƒ Normal: if Fluor =1 Ÿ the fuzzy number is [0.1, 0.3, 0.5].

31

Table 1: Labels and fuzzy representations of clinical and diagnosis data (Di Lascio et al., 2002)

Table 2. Sample data on the classification of clients of three Main Offices of a Savings Bank Ω1 Ω2 Ω3

x1 5 6 5

x2 8 11 9

32

x3 12 15 12

x4 5 8 13

x5 3 4 6

Suggest Documents