Statistics with Imprecise Data ´ NGELES G IL a M AR´I A A
AND
O LGIERD H RYNIEWICZ b
a
University of Oviedo, Faculty of Sciences, C/ Calvo Sotelo, s/n, 33007 Oviedo, Spain, e-mail:
[email protected] b Systems Research Institute, Newelska 6, 01-447 Warsaw, Poland, e-mail:
[email protected]
Article Outline Glossary I. Definition of the Subject and Its Importance II. Introduction III. Mathematical modeling of imprecise data IV. Fuzzy random variables V. Statistical analysis of random fuzzy perceptions VI.1 Fuzzy estimators and fuzzy confidence intervals VI.2 Fuzzy statistical tests VII. Statistical analysis of random fuzzy variables VIII. Future directions Bibliography
Glossary Fuzzy estimators Estimators of ‘parameters’ of probability distributions, or other characteristics, of random variables/fuzzy random variables (such as e.g. the expected value/the fuzzy expected value) when statistical data are imprecise and are described by means of fuzzy sets. Fuzzy random variable Random element whose observed values are described by fuzzy sets. Fuzzy set Generalization of a classical notion of a set. In contrast to the case of a classical set, each element x of a fuzzy set may belong to it to a degree described by the so-called membership function µ (x). Thus, the fuzzy set may be defined as a set of ordered pairs (x, µ (x)), where x belongs to a set X called the universe of discourse or
2
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
referential. Alternatively, the fuzzy set can be identified with its membership function (in the same way that a classical set can be identified with its indicator function). Fuzzy statistical tests Statistical tests used for the verification of hypotheses about the values of ‘parameters’ of probability distributions, or other characteristics, of random variables/fuzzy random variables when statistical data are imprecise and are described by fuzzy sets. Fuzzy statistics Generalization of traditional statistics that allows to handle imprecise data described in terms of fuzzy sets. Imprecise data Data that cannot be described by either real numbers or vectors with real-valued components. Random sets Random elements whose observed values are sets (e.g. intervals, subsets of a plane).
1 Definition of the Subject and Its Importance Traditional statistics deals usually with precisely defined data. The source of these data may be of different nature. Usually the data are collected from observations of random experiments, i.e. experiments whose performance leads to an outcome which cannot be predicted in advance with one hundred percent sureness. When the knowledge about the nature of these random events is available we can use the methods of mathematical statistics and draw conclusions about the source of available data. Uncertainty being intrinsic to random outcomes/events is properly described by using the formalism of the mathematical theory of probability, and generally it is attributed to future observations of these outcomes/events. Traditionally, statistical experimental data are described by real numbers or by vectors whose components are real numbers. These numbers are either observed directly as results of measurements (e.g. height and weight of a person) or represent observed counts of certain categories representing labelled events (e.g. a gender of that person). However, in real life applications the results of measurements are never precise. The precision of every measurement is limited by the precision of a measuring device. In the majority of practical applications this lack of precision may be neglected, and for statistical analysis of available data we can use traditional methods of statistics. If the measurement error cannot be neglected statistical analysis of interval data is recommended by specialists in metrology. However, when statistical data are presented by human beings we need more sophisticated methods for the description of their lack of precision. Therefore, there is a practical need to generalize traditional statistical methods in order to make them applicable for handling imprecise data, e.g. the data described by statements expressed by using a plain language (the so-called “linguistic data”).
Statistics with Imprecise Data
3
2 Introduction There is a common agreement that uncertainty characterized by randomness shall be described by considering the theory of probability. Thus, mathematical statistics is a proper tool for dealing with data generated by random experiments and described by precisely defined numbers. However, there also exist other types of uncertainty which are related to vagueness, imprecision, existence of only partial information about experimental outcomes/events of interest, etc. In contrast to randomness, uncertainties of such types are attributed rather to current perceptions/observations. It has to be noted that the mathematical modelling of all these types of uncertainty which are different from simple randomness is still the subject of controversies. A good overview of the related problems can be found in the paper by P. Walley [35]. In this article we confine ourselves to the case when randomness is observed together with vagueness understood as the lack of precision. Specialists in measurement theory recognize different types of uncertainty. For instance, in the ISO/IEC Guide [18] it is recommended to distinguish between uncertainty related to pure randomness and uncertainty of other nature, such as a lack of precision. However, the lack of precision of statistical data is usually omitted in the statistical analysis of measurements. Consider, for example, the analysis of results of measurements coming from a digital meter. The results of measurements coming from such a meter are always rounded in order to have their representation by a certain number of digits. When we observe a displayed result of a measurement we never know what is the actual value of the measured quantity. What is more important, and often overlooked by statisticians, we may not increase our knowledge about that value by averaging the results of repeated measurements, as it is frequently recommended by statisticians. Consider, for example, a series of measurements showing exactly the same result. Having such results we are in principle not able to distinguish between two fundamentally different cases when the measured quantity is constant and its value belongs to the range of rounding or when it is varying from measurement to measurement with values belonging to that range. The situation described in the previous paragraph is relatively simple as the range of possible actual values corresponding to the observed result of a measurement is usually precisely defined. There exist, however, situations when either this range is not precisely known or values in the range are not equally compatible with the available data/information/events. We face such cases when results of measurements or classifications are evaluated by humans (e.g., by evaluating indications of an analog meter or classifying individuals in accordance with their height, and so on). The result of a measurement is even more imprecise when it has been obtained without the usage of any meter (e.g., when we visually evaluate the distance between two points); in such situations statistical data may consist of imprecise statements like ‘around 5 meters’, ‘more or less between 5 and 10 seconds’, etc. The classification of a person in accordance with his/her height leads often to imprecise assessments, like ‘very short’, ‘rather tall’, etc. A similar situation occurs when we deal with retrospective data recalled by human beings; for example, in reliability analysis of field lifetime data we may face situation when failure times are reported imprecisely us-
4
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
ing statements like ‘the failure occurred about one month ago’, etc. In all these cases statistical data consist of imprecise perceptions of actual real values. In the previous paragraphs we considered situations when actual values of measured quantities exist, but they are imprecisely perceived. There exist, however, situations when we have to analyze statistical data that represent imprecisely defined concepts. Take for example the color of human hair described using categories such as ‘blond’ or ‘dark blond’; it is obvious that the border between these two categories is vague; one can try to establish a precise border between these two categories in terms of the results of precise measurements of the spectrum of reflected light, but such attempts seem to be practically senseless. We face a similar situation in classifying a client of a bank office in accordance with his/her degree of aversion to investment which leads to imprecise assessments, like ‘very low’, ‘moderate’, etc. In both these cases we could try to collect precise statistical data, e.g., by either asking a respondent to a questionnaire to indicate exactly one choice or coding it numerically. However, it seems to be more prudent, natural and informative to expect imprecise answers to questions pertained to vague notions. If we do so, we may face imprecise statistical data for further analysis. In all these cases statistical data consist of imprecise actual values themselves. There also exists another source of imprecision while dealing with statistical data. For example, in reliability lifetime tests we perform tests in more severe ‘over-stress’ conditions, and then we try to recalculate test results to conditions which are considered ‘normal’. For this recalculation we can utilize some partial knowledge about possible values of stress-dependent recalculation coefficients. In such a case we have originally precise lifetime data, but after recalculation these data become imprecise. In all considered cases the lack of precision can be appropriately described by using fuzzy sets introduced by Lotfi A. Zadeh. In the second section of the article we recall some basic definitions related to fuzzy sets. We will present the fuzzy sets methodology as a useful tool for the description of imprecise data. When fuzzy lack of precision is mixed with randomness, either in the sense that available fuzzy data are supposed to come from the perception of real- or vectorial-valued data generated by a random mechanism, or in the sense of they being directly generated by a random mechanism, a convenient tool to use is that of the notion of a fuzzy random variable. This notion is introduced in the third section of this article. The interpretation of the fuzzy random variable depends upon the type of observed fuzzy data and events. In the paper we distinguish between the two types of data which have been described in this section. We start with the description of statistical methods which are useful for the analysis of fuzzy perceptions of existing precise values. Then, we present statistical methods which are useful in the analysis of intrinsically fuzzy-valued data.
3 Mathematical modeling of imprecise data There exist competitive methods for the description of vagueness. For example, some statisticians claim that the theory of subjective probability is sufficient for the description of all types of uncertainty. However, many other researchers have shown
Statistics with Imprecise Data
5
examples of situations when the application of the classical theory of probability is not sufficient for modelling these situations. Therefore, other formalisms have been proposed for the description of vagueness/imprecision. One of those formalisms, namely the theory of fuzzy sets proposed by Lotfi A. Zadeh [36], has been slowly but widely accepted as a good methodology for the description of imprecise data, both from practical and theoretical (see, e.g., the paper by Ter´an [32]) point of view. The basic concept of the theory of fuzzy sets is the universe of discourse or referential X which may be understood as the set of all possible (feasible) elements that are relevant for the description of a certain concept (quantity). Mathematically speaking a fuzzy subset A of a set X ̸= 0/ (or a fuzzy set, for short) is a map A : X → [0, 1], where A(x) can be interpreted as the degree of compatibility of x with the ill-defined property characterizing A, or degree of truth of the assertion “x is A”, or degree of membership of x to A. Equivalent, but more intuitive for some purposes, a fuzzy set A of X can be defined as a set of ordered pairs {(x, µA )}, where x ∈ X and µA : X → [0, 1] is the so-called membership function of A. In other words, a fuzzy set can be identified with its membership function, in the same way that a classical set can be identified with its indicator function. In what follows, we will consider indistinctly A or µA to denote and refer to a fuzzy subset. Unfortunately, there is no one generally accepted methodology for the construction of membership functions. The majority of researchers assume that the membership function µA (x) is a purely subjective function provided by a person who describes his/her perception of a certain phenomenon or quantity. Some authors provide practical methods for the construction of the membership function when it is interpreted in terms of the theory of possibility as the possibility distribution (see [8] for more information). Some other authors, e.g Bandemer and N¨ather [1] or Viertl [34], present methods which may be used for the construction of membership functions in a more objective way from measurements of physical quantities. Anyway, our purpose is not entering a discussion here about this point. In the analysis of imprecise data we are usually interested in the description of interesting phenomena by numbers. For this purpose we can use the concept of a fuzzy number defined as follows (see [5]): Definition 1. The fuzzy subset A of the space of real numbers R, with the membership function µA : R → [0, 1], is a fuzzy number if and only if (a) A is normal, i.e. there exists at least an element x0 ∈ R such that µA (x0 ) = 1; (b) A is fuzzy convex, i.e., µA (λ x1 + (1 − λ )x2 ) ≥ µA (x1 ) ∧ µA (x2 ), for all x1 , x2 ∈ R, and λ ∈ [0, 1]; (c) µA is upper semicontinuous; (d) the support set, supp A = {x ∈ R : µA (x) > 0}, is bounded (that is, its closure cl (supp A) is compact). It is easily seen that if A is a fuzzy number then its membership function can be expressed as follows:
6
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
0 rlA (x) µA (x) = 1 ruA (x) 0
for x < a1 for a1 ≤ x < a2 for a2 ≤ x ≤ a3 for a3 < x ≤ a4 for x > a4 ,
(1)
where a1 , a2 , a3 , a4 ∈ R, a1 ≤ a2 ≤ a3 ≤ a4 , rlA : [a1 , a2 ] → [0, 1] is a nondecreasing upper semicontinuous function, and ruA : [a3 , a4 ] → [0, 1] is a nonincreasing upper semicontinuous function. Functions rlA and ruA are called sometimes the left and the right ‘arms’ (or ‘sides’) of the fuzzy number, respectively. A useful notion for dealing with a fuzzy number is the so-called α -level set, also known as the α -cut. For α ∈ (0, 1] the α -level set of the fuzzy number A is the ordinary (non-fuzzy) set defined as Aα = {x ∈ R : µA (x) ≥ α }
(2)
and the 0-level set is usually intended to be given by A0 = cl (supp A). The family {Aα : α ∈ [0, 1]} is a set representation of the fuzzy number A. According to the resolution identity proposed by Zadeh, we can represent the membership function as: µA (x) = sup{α · 1Aα (x) : α ∈ (0, 1]}, (3) where 1Aα (x) denotes the indicator function of Aα . On the basis of the notion of α −-level, a fuzzy number A can be viewed as a fuzzy subset of R such that its α -level sets are nonempty compact and convex sets of R, that is, nonempty compact intervals. Hence, for each α ∈ [0, 1] we have that Aα = [AL (α ), AU (α )], where AL (α ) = inf{x ∈ R : µA (x) ≥ α }, (4) AU (α ) = sup{x ∈ R : µA (x) ≥ α }. If the sides of the fuzzy number A are strictly monotonic functions, then from (1) one can see easily that AL (α ) and AU (α ) are inverse functions of rlA and ruA , respectively. In statistical analysis of random data we use functions (and, in particular, operations) of the observed random samples. These functions, called statistics, can be also defined for fuzzy random data. Their membership functions can be derived by using Zadeh’s extension principle. This principle has the following formulation [37]: Let X be a Cartesian product of universes X = X1 × . . . × Xr , and A1 , . . . , Ar be r fuzzy subsets of X1 , . . . , Xr , respectively. Let f be a mapping from X = X1 × . . . × Xr to a universe Y such that y = f (x1 , . . . , xr ). The extension principle allows us to induce from the r fuzzy sets Ai a fuzzy set B on Y through f , B = f (A1 , . . . , Ar ) such that sup min{µA1 (x1 ), . . . , µAr (xr )} if f −1 (y) ̸= 0/ µB (y) = x1 ,...,xr | y= f (x1 ,...,xr ) (5) 0 if f −1 (y) = 0/
Statistics with Imprecise Data
7
In case of Ai , i = 1, . . . , n being fuzzy numbers, and f being either an injective or a continuous function, then for ( each α ∈ [0,)1] the α -level of B = f (A1 , . . . , Ar ) can be shown to be equal to Bα = f (A1 , . . . , Ar ) α with ( ) f (A1 , . . . , Ar ) L (α ) = min f (x1 , . . . , xr ), (x1 ,...,xr )∈(A1 )α ×...×(Ar )α
(
) f (A1 , . . . , Ar ) U (α ) =
(6) max
(x1 ,...,xr )∈(A1 )α ×...×(Ar )α
f (x1 , . . . , xr ).
Thus, the application of the extension principle for the calculation of the membership function of y = f (x1 , . . . , xr ) is equivalent to the application of the interval arithmetics on α -level sets of the arguments of this function (see [28] as a basis for the proof). For instance, if A and B are fuzzy numbers, then, (A + B)L (α ) = AL (α ) + BL (α ), { λ · AL (α ) if λ ≥ 0 (λ · A)L (α ) = λ · AU (α ) if λ < 0
(A + B)U (α ) = AU (α ) + BU (α ), { λ · AU (α ) if λ ≥ 0 (λ · A)U (α ) = λ · AL (α ) if λ < 0
for any λ ∈ R, and α ∈ [0, 1].
4 Fuzzy random variables Uncertainty, understood as randomness, is well described in Probability Theory. The concept of a random variable, which is basic in this theory, is well-known and its definition does not need to be recalled in this article. However, when we observe random experimental data which are imprecise, a useful tool to model either the imprecise perception of values coming from realvalued random variables or the random mechanisms generating directly these imprecisa data is the one associated with the so-called concept of fuzzy random variables. Actually, we can consider two different approaches to the concept of fuzzy random variable; the motivation for these approaches and the situations they apply to are different, but the formalization of the second notion and the associated statistical methodology can be applied to the first one. Historically, the first widely accepted definition of the fuzzy random variable was proposed by Kwakernaak [21], [22]. Kruse [19] proposed an interpretation of this notion, and according to this interpretation a fuzzy random variable Z may be considered as a fuzzy perception of an unknown true real-valued random variable Z0 associated with a random experiment, and referred to as ‘the original’ of Z . Below, we recall the version of this definition elaborated by Kruse and Meyer [20]. Definition 2. (Kruse and Meyer [20]). Let (Ω , A , P) be a probability space, where Ω is the set of all possible outcomes of the random experiment, A is a σ -field of subsets of Ω (the set of all possible events of interest), and P is a probability measure associated with (Ω , A ). A mapping X : Ω → Fc (R), where Fc (R) is the space of all fuzzy numbers, is called a fuzzy random variable if it satisfies the following properties:
8
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
i) {Xα (ω ) : α ∈ [0, 1]}, where Xα (ω ) = (X (ω ))α is a set representation of X (ω ) for all ω ∈ Ω ; ii) for each α ∈ [0, 1] both XαL : Ω → R and XαU : Ω → R, with XαL (ω ) = inf Xα (ω ) and XαU (ω ) = sup Xα (ω ), are usual real-valued random variables associated with (Ω , A , P). Values of a fuzzy random variable Definition 2 have been conceived to model fuzzy perceptions of existing real-valued values (the values of the original, see Figure 1). For instance, when we qualify the price of a given item in a specific store, we can perceive/label it as being ‘cheap’, but there is an existing price (although assumed to be unknown for the person receiving and processing data information).
Fig. 1. Fuzzy random variables in Kwakernaak/Kruse and Meyer’s sense
Puri and Ralescu [30] introduced the concept of the also called fuzzy random variable as a generalization of the concept of random set or set-valued random element (and hence, as a generalization also of the concept of random variable). According to this definition a fuzzy random variable Z may be considered as a random element associating with each experimental outcome a value which is intrinsically fuzzy. Below, we recall Puri and Ralescu’s definition Definition 3. (Puri and Ralescu [30]). Given a probability space (Ω , A , P), a mapping X : Ω → Fc (R) is said to be a fuzzy random variable (also referred to as random fuzzy set) if for each α ∈ [0, 1] the set-valued mapping Xα : Ω → Kc (R), where Kc (R) is the class of the nonempty compact intervals and Xα (ω ) = (X(ω ))α for all ω ∈ Ω , is a compact convex random sets (that is, a Borel-measurable mapping with respect to the Borel σ -field generated by the topology associated with the Haussdorf metric on Kc (R)). Remark 1. The notion of fuzzy random variable has been in fact introduced in a more general way by considering as the codomain the space of fuzzy sets of the pdimensional Euclidean space, or even more general Banach spaces, whose α -levels are compact subsets of this space. In case one constrains to p = 1 and fuzzy sets being convex, then one gets the last definition. Remark 2. Although motivation to introduce fuzzy random variables was different in the approaches by Kwakernaak/Kruse and Meyer and by Puri and Ralescu, one can prove that the notion in Definition 3 implies the one in Definition 2. As a consequence, probabilistic ideas and results for the notion in Definition 3 (or for the more
original random variable
perception of the variable value
unknown stage
Statistics with Imprecise Data
9
general one in Remark 1) apply to the notion in Definition 2, and the same happens for statistical developments. However, many probabilistic conclusions, and most of the statistical procedures for Definition 2 are based on the assumption of having an unknown but existing original, and considering Zadeh’s extension principle, so that these conclusions and procedures are not usually applicable to deal with data coming from fuzzy random variables in Definition 3. Remark 3. The concept of fuzzy random variable in Definition 3 can be alternatively formalized as a Borel-measurable mapping with respect to the Borel σ -field generated by the topology associated with some metrics on the space Fc (R), among them an operational one we will later refer to. Borel-measurability allows us to guarantee that notions like those of the induced distribution by a fuzzy random variable, independence of fuzzy random variables, identically distributed fuzzy random variables, and so on, can be immediately formalized in the probabilistic setting. Values of a fuzzy random variable in Definition 3 have been conceived to model existing fuzzy values (see Figure 2). For instance, when we classify a client of a bank in accordance with the degree of aversion to investment as having a ‘rather high’ degree, there is no underlying real-valued degree, but the classification itself is essentially imprecise.
Fig. 2. Fuzzy random variables in Puri and Ralescu’s sense
5 Statistical analysis of fuzzy data corresponding to fuzzy perceptions of existing real-valued data 5.1 Fuzzy estimation When imprecise statistical data correspond to fuzzy perceptions of unobserved/unknown precise (i.e. crisp) statistical data we can treat them as observed values of fuzzy random variables in the sense of Kwakernaak/Kruse and Meyer. In such a case we can analyze imprecise data in terms of probability distributions of their unobserved originals in a similar way as precise statistical data are analyzed using methods of traditional mathematical statistics. The only difference stems from the fact that having imprecise input information in the form of fuzzy data we cannot precisely evaluate the characteristics of the underlying probability distribution. Therefore, instead of finding precise values of the estimators of the ‘parameters’ describing the underlying probability distribution (the one of the original), it seems more coherent finding their imprecise fuzzy perceptions.
fuzzy-valued random variable
10
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
Assume that we observe a fuzzy random sample X1 , . . . , Xn which is viewed as a fuzzy perception of an unobserved random sample X1 , . . . , Xn . Let F(x; θ ) be the cumulative probability function of the original random variable X characterized by a crisp parameter θ ∈ Θ . Suppose now that an estimator of θ which is given by a statistic θˆ = ϕ (X1 , . . . , Xn ). By using Zadeh’s extension principle we can consider as a fuzzy estimator of θ based on X1 , . . . , Xn (or, alternatively, as a fuzzy estimator of the induced fuzzy parameter ϑ = θ (X ) with θ (X )(t) = supX∈E (Ω ,A ,P) | θ (X)=t infω ∈Ω µX (ω ) X(ω ) and E (Ω , A , P) being the class of all possible originals of X ) the one associating with each fuzzy sample information (e x1 , . . . , xen ) the fuzzy estimate ϕ (e x1 , . . . , xen ) given by µϕ (ex1 ,...,exn ) (t) =
sup
(x1 ,...,xn ) |t=ϕ (x1 ,...,xn )
0
min{µxe1 (x1 ), . . . , µxen (xn )}
if t ∈ Im(ϕ (X1 , . . . , Xn ) otherwise (7)
In many practical cases, when ϕ (x1 , . . . , xn ) has a simple form, the calculation of (7) is straightforward. For example, when ϕ (x1 , . . . , xn ) = sample mean = (x1 + . . . + xn )/n the α -levels of the estimate of its expected value θ are expressed as [ ] 1 n 1 n (ϕ (e x1 , . . . , xen ))α = (8) ∑ (exi )L (α ), n ∑ (exi )U (α ) n i=1 i=1 Analogously, whenever the estimator of θ is given by a continuous function or an injective function, the α -levels becomes also rather simple. In other cases, the limits of the α -levels should often be found by solving nonlinear mathematical programming problems defined by (7). Similar approach may be applied when we try to construct fuzzy versions of the confidence intervals of the unknown parameter θ . Let us assume that we are able to find confidence intervals of θ using precise (crisp) statistical data. For example, let [π l , +∞), where π l = π l (X1 , . . . , Xn ; δ ), be the one-sided confidence interval for the parameter θ on the confidence level 1 − δ . Kruse and Meyer [20] have shown that when we replace π l with the lower limits of the α -cuts of its fuzzy version we obtain a proper confidence interval for the fuzzy perception of θ . These lower limits can be found for different values of α ∈ 90, 1] from the formula [15]
Π Lα = Π Lα (X1 , . . . , Xn ; δ ) = inf {t ∈ R : ∀i ∈ {1, . . . n} ∃xi ∈ Xi,α such that π l (x1 , . . . , xn ; δ ) ≤ t} ,
(9)
where Xi,α , i = 1, . . . , n are respective α -cuts of the observed fuzzy data. In a similar way we can define a fuzzy equivalent of the one-sided confidence interval (−∞, π u ],: U
U
Π α = Π α (X1 , . . . , Xn ; δ ) = sup {t ∈ R : ∀i ∈ {1, . . . n} ∃zi ∈ Xi,α such that π u (z1 , . . . , zn ; δ ) ≥ t}
(10)
Statistics with Imprecise Data
11
where π u (z1 , . . . , zn ; δ ) = π l (z1 , . . . , zn ; 1 − δ ). Moreover, exactly the same approach can be applied when we look for two-sided confidence intervals. Summing up this section we can say that when imprecise observations may be treated as fuzzy perceptions of precise but unobserved realizations of ordinary random variables the problem of point and interval estimation of the unknown parameters of the underlying probability distribution can be reduced to finding fuzzy versions of the formulae known from traditional mathematical statistics. 5.2 Fuzzy statistical tests Testing statistical hypotheses is the second main branch of mathematical statistics. Tests of statistical hypotheses have to be applied if we want to make decisions based on the analysis of random data. When our decisions depend on the values of the parameters of probability distributions that describe observed statistical data we use parametric statistical methods. In such a case we test statistical hypotheses about the values of the parameters of probability distributions utilizing a well known equivalence between the set of values of the considered probability distribution parameter for which the null hypothesis is accepted and a certain confidence interval for this parameter. Kruse and Meyer [20] have shown that the same equivalence exists in the case of statistical tests with fuzzy data. Let X1 , . . . , Xn denote a fuzzy sample, i.e. a fuzzy perception of the usual random sample X1 , . . . , Xn , from the population with the distribution Pθ . Let δ be a given number from the interval (0, 1). Grzegorzewski (2000) proposed the following definition of the fuzzy test for vague data: ( )n Definition 4. A function φ : Fc (R) → F ({0, 1}) is called a fuzzy test for the hypothesis H, on the significance level δ , if sup P {ω ∈ Ω : φα (X1 (ω ), . . . , Xn (ω )) ⊆ {1} |H } ≤ δ ,
(11)
α ∈[0,1]
where φα is the α −level set (α -cut) of φ (X1 , . . . , Xn ). The fuzzy test defined above can be regarded as a family of classical tests {φα : α ∈ (0, 1]} for which the significance level is given as the upper bound of type I error for the whole family {φα : α ∈ (0, 1]}. In order to give an example of a fuzzy statistical test let us consider a following simple null hypothesis: H : θ = θ0 , against the composite two-sided alternative: K : θ ̸= θ0 . Suppose we know two-sided symmetrical confidence interval [π1 , π2 ] for θ , on a confidence level 1 − δ , where π1 = π1 (X1 , . . . , Xn ; δ /2) and π2 = π2 (X1 , . . . , Xn ; δ /2) are the limits of the ordinary two-sided confidence interval. The fuzzy equivalent of this confidence interval can be calculated using the α -cuts Πα = [ΠαL , ΠαU ] for all α ∈ (0, 1], where the limits of these α -cuts can be computed from (9)- (10) by replacing δ with δ /2. The fuzzy two-sided statistical test for H : θ = θ0 , against K : θ ̸= θ0 , on the significance level δ , has been defined by Grzegorzewski [15] as a function φ : (F N (R))n → F ({0, 1}) with following α -cuts
12
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
{0} {1} φα (X1 , . . . , Xn ) = {0, 1} 0/
if if if if
θ0 ∈ (Πα \ (¬Π )α ), θ0 ∈ ((¬Π )α \ Πα ), θ0 ∈ (Πα ∩ (¬Π )α ), θ0 ∈ / (Πα ∪ (¬Π )α ),
(12)
Similarly we may obtain fuzzy tests for one-sided hypotheses using the one-to-one correspondence between the acceptance regions of the tests designated for testing these hypotheses on the significance level δ and one-sided confidence intervals for the parameter θ on the confidence level 1 − δ . Grzegorzewski [15] has shown that the membership function of the fuzzy test for the hypothesis H against K is given by
µφ (t) = µΠ (θ0 )I{0} (t) + µ¬Π (θ0 )I{1} (t) = µΠ (θ0 )I{0} (t) + (1 − µΠ (θ0 ))I{1} (t),
t ∈ {0, 1},
(13)
where Π is a fuzzy acceptance region depending on the considered hypotheses. Thus, the fuzzy fuzzy test defined by (12), contrary to the classical crisp test, does not lead to the binary decision – to accept or to reject the null hypothesis – but to a fuzzy decision. One may get φ = 1/0 + 0/1 which indicates that we should accept H, or φ = 0/0 + 1/1 which means the rejection of H. However, one may also get φ = µ0 /0 + (1 − µ0 )/1, where µ0 ∈ (0, 1), which can be interpreted as a degree of conviction that we should accept (µ0 ) or reject (1 − µ0 ) the hypothesis H. Thus, in situation when µ0 is neither 0 nor 1, a user must decide using other criteria whether to reject or to accept the considered hypothesis. There exist several approaches that are suitable for solving this problem. One of these approaches which is formulated in the language of the possibility theory has been proposed by Hryniewicz [17] who used the results of Dubois et al. [9] who proposed to use statistical confidence intervals of parameters of probability distributions for the construction of possibility distributions of these parameters. According to their approach, the family of two-sided confidence intervals [πL (x1 , . . . , xn ; 1 − δ /2) , πU (x1 , . . . , xn ; 1 − δ /2)] , δ ∈ (0, 1)
(14)
forms the possibility distribution ϑ˜ of the estimated value of the unknown parameter ϑ . In a similar way it is possible to construct one-sided possibility distributions based on one-sided nested confidence intervals. Hryniewicz [17] proposed to compare this possibility distribution with a hypothetical value of the tested parameter. For this purpose he proposed to use the necessity of strict dominance measure introduced by Dubois and Prade [7] for measuring the necessity of the strict dominance relation ˜ where A˜ and B˜ are fuzzy sets. This measure, called the Necessity of Strict A˜ ≻ B, Dominance index (NSD), is defined as ( ) NSD = Ness A˜ ≻ B˜ = 1− sup min {µA (x) , µB (y)} . (15) x,y;x≤y Hryniewicz [17] has shown that in the classical case of precise statistical data and precisely defined statistical hypotheses the value of the NSD index is equal to the p-value of the test.
Statistics with Imprecise Data
13
In case of fuzzy data the confidence intervals used for the construction of the possibility distribution of the estimated parameter θ can be replaced by their fuzzy equivalents presented in the previous sections of this article. In his paper Hryniewicz [17] assumes that the value of the significance level of the corresponding statistical test δ is equal to the possibility degree α that defines the respective α -cut of the possibility distribution of θ˜ . He also assumes that in the possibilistic analysis of statistical tests on the significance level δ we should take into account only those possible values of the fuzzy sample whose possibility is[not smaller] than δ . Thus, the (α ) (α ) α -cuts of the membership function µF (θ ) denoted by µF,L , µF,U are equivalent to the α -cuts of the respective fuzzy confidence intervals on a confidence level 1 − α . Having the possibility distribution of the test statistic we can use (15) for the calculation of the degree on necessity that the considered statistical hypothesis has to be accepted. When we set a critical value for this characteristic we arrive at unequivocal (crisp) decisions. It is worthy to note that this approach has been generalized in [17] to the case of testing imprecisely defined hypothesis using fuzzy statistical data. In the previous paragraphs we have presented fuzzy statistical tests when the class the underlying probability distribution belongs to is known. Verification of this assumption when the available statistical data are imprecise may be very difficult indeed. Therefore, it would be advisable to use fuzzy equivalents of non-parametric (distribution-free) statistical methods. Unfortunately, there exist only few papers devoted to such fuzzy tests. The most interesting result has been obtained by Denœux et al. [4] who proposed a general methodology for the construction of fuzzy tests based on rank statistics. Statistical analysis of fuzzy random data can be also done in the Bayesian framework. First results presenting the Bayesian decision analysis for imprecise data were given in papers by Casals et al. [3] and Gil [11]. Other approaches have been proposed by such authors as Viertl [33], Fr¨uhwirth-Schnatter [10], and Taheri and Behboodian [31]. Comprehensive Bayesian model comprising fuzzy data, fuzzy hypotheses, and fuzzy utility function has been proposed in the paper by Hryniewicz [16].
6 Statistical analysis of existing fuzzy-valued data When imprecise statistical data correspond to observed fuzzy-valued data we can treat them as observed values of fuzzy random variables in the sense of Puri and Ralescu. In the literature these data are often treated as categorical/ordinal/intervalvalued ones. It should be emphasized that the model given by fuzzy random variables allows us to describe and handle these data in a more expressive scale and way (in contrast to just ranking or stating simply the interval support of the values). Thus, many statistical developments for real-valued data are based on distances/deviations between values rather than on the diversity of these values. The use of the fuzzy scale allows to consider metrics with a meaning similar to that for the real-valued case (i.e., distinguishing not only the ranks of variable values w.r.t. a certain criterion, but a physical distance between them).
14
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
The distance we will consider here is the one stated by Bertoluzza et al. ([2]), so that if A, B ∈ Fc (R) √∫ [∫ ] φ DW (A, B) = [fA (α , λ ) − fB (α , λ )]2 d W (λ ) d φ (α ) [0,1]
[0,1]
with fA (α , λ ) = λ · AU (α ) + (1 − λ ) · AL (α ), where • W y φ normalized weighted measures on [0, 1] formalized as probability measures on ([0, 1], B[0,1] ), • W associated with a non-degenerate distribution, • φ associated with a strictly distribution function on [0, 1]. φ
Remark 4. It should be remarked on DW that • W and φ have no stochastic meaning. • To consider W is equivalents to consider a measure weighting points 0, 1 and a certain t0 (W ) ∈ (0, 1); in case W is symmetric, then t0 (W ) = .5. • For each α , the choice of W allows us to weight – the effect of the distance between the ‘widths’ of the α -levels (i.e., effect of the ‘shape’ difference), – in comparison with the effect of the distance between their t0 (W )-points (i.e., effect of the ‘location’ difference). • The choice of φ allows us to weight the influence of each level (i.e., degree of ‘imprecision’, ‘consensus’, ‘subjectivity’,...). φ • DW is a versatile and operational in statistical developments with fuzzy numbers, and it behaves especially well when we consider least-squares approaches. Since the concept of fuzzy random variable in Puri and Ralescu’s sense has been properly stated in a probabilistic context as a random element (i.e., as a Borelmeasurable function), concepts like independent and identically distributed fuzzy random variables make immediate sense. Furthermore, all the main ideas, aims and concepts, and several developments can be immediately considered to deal with fuzzy data when coming from fuzzy-valued Borel-measurable mappings. In this respect, notions like either unbiasedness or consistency of a ‘point’ (fuzzy- or realvalued) estimator of a (fuzzy- or real- valued) ‘parameter’ associated with the distribution of the fuzzy random variable, or the p-value and power of a test concerning such a ‘parameter’, make the same sense as in the classical case. Several developments have been made in connection with both estimation and testing of fuzzy- and real-valued parameters associated with the distribution of a fuzzy random variable in Puri and Ralescu’s sense. We are now just recalling a few results concerning the ‘point’ fuzzy estimation and testing of the population mean of a fuzzy random variable, which is formalized as follows: Definition 5. (Puri & Ralescu, [30]). Given a probability space ({Ω , A , P) and an as} sociated fuzzy random variable X : Ω → Fc (R) such that max | inf X0 |, | sup X0 |
Statistics with Imprecise Data
15
is integrable, then, the fuzzy expected value (or fuzzy mean) of X is the fuzzy nume ber E(X |P) ∈ Fc (R) such that for all α ∈ [0, 1] ( ) e E(X |P) = Aumann integral of Xα α { } = E(X|P) X : Ω → R, X ∈ L1 (Ω , A , P), X ∈ Xα c.s. [P] = [E(inf Xα |P), E(sup Xα |P)] . Remark 5. The fuzzy mean satisfies that •
If X (Ω ) = {e x1 , . . . , xem , . . .} ⊂ Fc (R), then, it is coherent with fuzzy arithmetic, that is,
e E(X |P) = P ({ω ∈ Ω | X (ω ) = xe1 })· xe1 +. . .+P ({ω ∈ Ω | X (ω ) = xem })· xem +. . . φ
• Strong Laws of Large Numbers are satisfied for different metrics (like DW , and stronger ones), which also corroborates the suitability of the defined fuzzy mean as the stochastic limit of the sample ones. φ e • E(X |P) is the ‘Fr´echet expectation’ of X w.r.t. DW , i.e., for all A ∈ Fc (R): ([ ) ]2 ([ ]2 ) φ φ e E DW (X , E(X |P)) P ≤ E DW (X , A) P . 6.1 Fuzzy estimation Assume that we observe a fuzzy simple random sample X1 , . . . , Xn which is viewed now as a n-tuple of n independent fuzzy random variables which are identically distributed as X . The associated fuzzy sample mean is the statistic given by Xn=
1 · [X1 + . . . + X n] . n
Then, one can prove that (see Lubiano ([26]) Theorem 1. The fuzzy sample mean satisfies that e i) X n [ · ] in an ‘unbiased fuzzy-valued estimator’ of E(X |P) (in the sense of the fuzzy expected value in Puri & Ralescu’s). For most of the metrics we can cone sider, it is also a ‘strongly consistent’ fuzzy-valued estimator of E(X |P). ii) One can quantify the mean squared-type error in the fuzzy estimation by consid([ ]2 ) φ e ering the real-valued expected value E DW (X n , E(X |P)) . Actually, Lubiano et al. ([24], [25]) several developments in connection with the φ estimation and testing about the DW -mean squared error associated with the estimation of the population fuzzy means by means of the sample one.
16
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
6.2 Statistical tests One of the problems which has received a deep attention in connection with testing from fuzzy random variables in Puri and Ralescu’s sense is that concerning two-sided tests about the mean of a fuzzy random variable (one-sample case). The aim of the problem, can be formlized as follows: Given n independent observations from X , X1 , . . . , Xn , we wish to test the e null hypothesis H0 : E(X |P) = A ∈ Fc (R), which can be equivalently expressed as ) φ (e H0 : DW E(X |P), A = 0. For this purpose, an exact test for ‘normal’ fuzzy random variables (in Puri and Ralescu’s sense, [?]) has been developed (see Montenegro et al., [27]). Although the method is exact and easy-to-apply, the assumption of X being normal (X = V + N (0, 1), with V ∈ Fc (R)) is quite restrictive and often unrealistic. On the other hand, asymptotic test have been developed for the same target, based on Large Sample Theory, but the asymptotic distribution of the statistic involves unknown parameters (K¨orner, [23]) but in case X takes on a finite number of values (Montenegro et al., [27]), and large sample sizes would be required anyway. Simulation studies have been considered to analyze the extent and applicability of the asymptotic test by Montenegro et al.. These studies have confirmed that in estimating the eigenvalues entails a substantial loss of precision w.r.t. the nomφ inal significance level. Based on this empirical conclusion, the use of DW and the Generalized Bootstrapped CLT (Gin´e and Zinn, [13]) allow us to consider bootstrap techniques in this context. The, we get the following method (Gonz´alez-Rodr´ıguez et al., [14]): Theorem 2. Given a fuzzy random variable X : Ω → Fc (R) associated with the probability space (Ω , A , P) and such that { } • max (inf X0 )2 , (sup X0 )2 is integrable, • X1 , . . . , Xn are i.i.d. as X , • X1∗ , . . . , Xn∗ is a bootstrap sample from X1 , . . . , Xn , then, to test H0 at the nominal significance level α ∈ [0, 1], H0 should be rejected whenever [ ( )]2 φ DW Xn ,U > zα , Sbn2 where zα is the 100(1 − α ) fractile of the bootstrap distribution of [ ( )]2 / ∗ 2 φ Tn = D Xn∗ , Xn Sbn W
with
n
Xn∗ = ∑ Xi ∗ /n, i=1
n [ )]2 φ( Sbn∗ 2 = ∑ DW Xi ∗ , Xn∗ /(n − 1). i=1
Other tests about means of fuzzy random variables which have been already developed are the following ones:
Statistics with Imprecise Data
• • •
17
One-sided hypotheses tests in the one-sample case. Tests for the equality of means of two FRVs – for two independent samples – por two linked samples. Tests for the equality of means of J FRVs (ANOVA): – for J independent samples.
Comparative simulation studies have been developed, so we can conclude empirically that • •
For small/medium samples, the bootstrap method performs and behaves much better than the asymptotic one. For large sample sizes (over 300), the improvement is not that remarkable, but the bootstrap approach still provides the best approximation to the nominal significance level.
7 Future directions Statistical analysis of imprecise data is still a developing area of science. Future directions of its development are tightly connected with the development of methods that may be used for the description of uncertainties of different types. It has been pointed out by P.Walley (see, e.g. [35] for a good overview) that traditional probability is not sufficient for good description of different types of uncertainty. Different mathematical models, such as e.g. Dempster-Schafer belief functions, possibility distributions, lower and upper probabilities, lower and upper previsions, and many others, have been proposed for this purpose. However, for the most general models describing uncertainty appropriate statistical methods have not been proposed yet. Therefore, statistical methods for handling very general imprecise data have to be developed in the future. Another, but much more specific, future direction for the development of statistical analysis of imprecise data is related to the analysis of intrinsically fuzzy data. In contrast to the situation when fuzzy observations may be considered as fuzzy perceptions of real-valued observations, many notions known from traditional statistics are still waiting for their widely accepted definitions, and statistical methods of analysis. The most challenging future direction is related to Zadeh’s paradigm of “computing with words”. First of all, we need operational methods for the representation of linguistic concepts which could be useful is statistical analysis of imprecisely reported (with words!) statistical data. moreover, we also need methods for convincing presentation of the results of computations to users who have only limited knowledge of mathematics and statistics.
18
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
8 Bibliography Primary Literature 1. Bandemer H., N¨ather W., Fuzzy Data Analysis, Kluwer, Dordrecht, 1992. 2. Bertoluzza, C., Corral, N., Salas, A., On a new class of distances between fuzzy numbers, Mathware and Soft Computing, vol. 2 (1995), 71–84. 3. Casals R., Gil M.A., Gil P., The fuzzy decision problem: an approach to the problem of testing statistical hypotheses with fuzzy information, Europ. Journ. of Oper. Res., vol. 27 (1986), 371–382. 4. Denœux T., Masson M.-H., H´ebert P.A., Nonparametric rank-based statistics and significance tests for fuzzy data, Fuzzy Sets and Systems, vol. 153 (2005), 1-28. 5. Dubois D., Prade H., Operations on Fuzzy Numbers, International Journal of Systems Science, vol. 9 (1978), 613–626. 6. Dubois D., Prade H., Fuzzy Sets and Systems. Theory and Applications, Academic Press, New York, 1980. 7. Dubois D., Prade H., Ranking fuzzy numbers in the setting of possibility theory, Information Sciences, vol. 30 (1983), 184–244. 8. Dubois D., Prade H., Possibility Theory, Plenum Press, New York (1988). 9. Dubois D., Foulloy L., Mauris G., Prade H., Probability-possibility transformations, triangular fuzzy-sets and probabilistic inequalities, Proc. of the Ninth International Conference IPMU, Annecy (2002), 1077–1083. 10. Fr¨uhwirth-Schatter S., Fuzzy Bayesian inference, Fuzzy Sets and Systems, vol. 60 (1993), 41–58. ´ Probabilistic-Possibilistic approach to some Statistical Problems with Fuzzy 11. Gil M.A., Experimental Observations, In: Kacprzyk J., Fedrizzi M. (Eds.). Combining Fuzzy Imprecision with Probabilistic Uncertainty in Decision Making. Springer-Verlag, Berlin (1988), 286–306 ´ L´opez-D´ıaz M., Ralescu D.A., Overview on the development of fuzzy random 12. Gil M.A., variables, Fuzzy Sets and Systems, vol. 157 (2006), 2546–2557. 13. Gin´e, E., Zinn, J., Bootstrapping general empirical measures, Ann. Probab. vol. 18 (1990), 851–869. 14. Gonz´alez-Rodr´ıguez, G., Montenegro, M., Colubi, A., Gil, M.A., Bootstrap techniques and fuzzy random variables: synergy in hypothesis testing with fuzzy data, Fuzzy Sets and Systems, Vol. 157, 2608–2613. 15. Grzegorzewski P., Testing statistical hypotheses with vague data, Fuzzy Sets and Systems, vol. 112 (2000), 501–510. 16. Hryniewicz O., Possibilistic Approach to the Bayes Statistical Decisions, In: Grze´ (Eds.), Soft Methods in Probability, Statistics and gorzewski P., Hryniewicz O., Gil M.A. Data Analysis, Physica-Verlag, Heidelberg (2002), 207–218. 17. Hryniewicz O., Possibilistic decisions and fuzzy statistical tests, Fuzzy Sets and Systems, vol. 157 (2006), 2665–2673. 18. ISO/IEC Guide to the expression of uncertainty in measurement (GUM), ISO/IEC, Geneva, 1995. 19. Kruse R., The strong law of large numbers for fuzzy random variables, Information Sciences, vol.28 (1982), 233–241. 20. Kruse R., Meyer K.D., Statistics with Vague Data, D. Riedel Publishing Company, 1987. 21. Kwakernaak H., Fuzzy Random Variables, Part I: Definitions and Theorems, Information Science, vol. 15 (1978), 1–15;
Statistics with Imprecise Data
19
22. Kwakernaak H., Fuzzy Random Variables, Part II: Algorithms and Examples for the Discrete Case, Information Science, vol. 17 (1979), 253–278. 23. K¨orner, R., An asymptotic α -test for the expectation of random fuzzy variables, J. Stat. Plann. Inference, Vol. 83 (2000), 331–346. 24. Lubiano, M. A. and Gil, M. A., Estimating the expected value of fuzzy random variables in random samplings from finite populations, Statist. Papers, vol. 40 (1999), 277–295. → − 25. Lubiano, M. A., Gil, M. A., L´opez-D´ıaz, M. and L´opez-Garc´ıa, M.T., The λ -mean squared dispersion associated with a fuzzy random variable, Fuzzy Sets and Systems, vol. 111 (2000), 307–317. 26. Lubiano, M.A., Medidas de variacin de elementos aleatorios, PhD Thesis, University of Oviedo. 27. Montenegro, M., Colubi, A., Casals, M.R., Gil, M.A., Asymptotic and Bootstrap techniques for testing the expected value of a fuzzy random variable, Metrika, Vol. 59 (2004), 31–49. 28. Nguyen, H.T., A note on the extension principle for fuzzy sets, J. Math. Anal. Appl., vol. 64 (1978), 369–380. 29. Puri M.L., Ralescu D.A., The concept of normality for fuzzy random variables, Ann. Probab., vol. 13 (1985), 1373–1379. 30. Puri M.L., Ralescu D.A., Fuzzy Random Variables, J. Math. Anal. Appl., vol. 114 (1986), 409–422. 31. Taheri S.M., Behboodian J., A Bayesian approach to fuzzy hypotheses testing. Fuzzy Sets and Systems, vol. 123 (2001), 39–48. 32. Ter´an P., Probabilistic foundations for measurement modelling with fuzzy random variables, Fuzzy Sets and Systems, vol. 158 (2007), 973–986. 33. Viertl R. Is it necessary to develop a fuzzy Bayesian inference, In: Viertl R. (Ed.), Probability and Bayesian Statistics, Plenum Publishing Company, New York (1987), 471-475 34. Viertl R., Statistical Methods for Non-Precise Data, CRC Press, Boca-Raton, 1996. 35. Walley P., Measures of uncertainty in expert systems, Artificial Intelligence, vol. 83 (1996), 1–58. 36. Zadeh L.A., Fuzzy sets, Information and Control, vol. 8 (1956), 338–353. 37. Zadeh L.A., The concept of a linguistic variable and its application to approximate reasoning, Inform. Sci., Part 1, Vol. 8 (1975), 199–249; Part 2, vol. 8, 301–353; Part 3, vol. 9, 43–80.
Books and Reviews ´ Ralescu D.A.(Eds.), Statistical Modeling, Analysis and Bertoluzza C., Gil M.A., Management of Fuzzy Data. Physica Verlag, Heidelberg and New York, 2002, ´ L´opez-D´ıaz M., Ralescu D.A., Overview on the development of fuzzy Gil M.A., random variables, Fuzzy Sets and Systems, vol.157 (2006), 2546–2557. ´ (Eds.), Soft methods in probability. Grzegorzewski P., Hryniewicz O., Gil M.A. Statistics and data analysis. Physica-Verlag, Heidelberg, New York 2002 ´ Fuzzy Statistics, In: Webster J.G (Ed.), EncycloKruse R, Gebhardt J., Gil M.A., pedia of Electrical and Electronics Engineering, J.Wiley, New York, 1999. ´ Gil, P. Grzegorzewski, O. J. Lawry, E. Miranda, A. Bugarin, S. Li, M. A. Hryniewicz (Eds.), Soft Methods for Integrated Uncertainty Modeling. SpringerVerlag, Berlin Heidelberg 2006,
20
´ NGELES G IL a AND O LGIERD H RYNIEWICZ b M AR´I A A
´ Grzegorzewski P., Hryniewicz O., Lawry J. (Eds.), L´opez-D´ıaz M., Gil M.A., Soft Methodology and Random Information Systems. Springer Verlag, Berlin, Heidelberg, New York 2004 Taheri S.M., Trends in Fuzzy Statistics, Austrian Journal of Statistics, vol.32 (2003), 239–257.