tolerance regions as a decision tool in the medical ... Key-Words: Tolerance Regions, Simplex-valued Random Variables, .... or a two-sided tolerance interval is.
Methods of Information in Medicine © F. K. Schattauer Verlag GmbH (1988)
G. Rosenkranz
Tolerance Regions for Simplex-Valued Random Variables (From the Hoechst AG, Frankfurt am Main, FRG)
1. Introduction
Summary We propose distribution-free tolerance regions for multidimensional random variables on the simplex. They are helpful when normal ranges of proportions have to be calculated. The work is motivated by the need for the statistical evaluation of findings at birth in embryotoxicological studies.
A tolerance region for a random variable is a subset of the sample space which covers, on an average, a certain Key-Words: Tolerance Regions, Simplex-valued Random Variables, Embryotoxicological Studies fraction of the population. Tolerance regions are very usef)..Il, for example, Tolerauzbereiche fiir simplexwertige ZufaIlsvariable when data have to be classified as Es werden verteilungsfreie Toleranzbereiche fiir simplexwertige Zufallsvariable vorgeschla"normal" or "not normal". The use of gen. Sie sind niitzlich, wenn Normalbereiche von Proportionen berechnet werden sollen. tolerance regions as a decision tool in Entstanden ist die Arbeit bei der Auswertung von Befundhaufigkeiten aus Embryotoxizitatsthe medical context has recently been studien. proposed by several authors (Abt [1], Schliissel-Warter: Toleranzbereiche, simplexwertige ZufaIlsvariable, Embryotoxizitatsstudien Passing et al. [4]). The mathematical foundations of the theory of tolerance regions, which are independent of the distribution of high proportion of dead or reabsorbed i. e. the proportion of the popUlation the variable under consideration, were fetuses at birth or Caesarian section with values in B. If B depends on laid in the papers of Wilks [8, 9] and, . among the treated animals. The sam- some n-sample of, say, random varespecially for multivariate data, Wald ple space of the proportions of live, iables Xb"'Xo then Band C(B) are [7]. Tukey [5, 6] gave Wald's results a dead, and reabsorbed fetuses is the themselves random. If the X's are general and elegant form and two-dimensional simplex. Of course, independently identically distributed generalized the main results to discon- animals have only a finite number of and if for some 0 < P < 1 (E denotes tinuous variables. The scale-depend- implantation sites; hence only a finite expectation) encyof the multidimensional toler- set of possible proportions can be E{CCB)} :S p, ance regions proposed by Fraser [3] observed. Therefore, tolerance rewas removed in Abt [1]. Ackermann gions for discrete random variables- we say that B is a tolerance region of index p for any random variable with [2] published a monograph on the will have to be considered. theory and application of tolerance The present paper is organised as the same distribution as the Xi' B then regions. follows: In section 2., Tukey's ap- covers at most 100p% of the populaSo far; tolerance regions have been proach to the construction of toler- tion on an average. Tukey's method of constructing tolconsidered for variables in d-dimen- ance regions is shortly reviewed for sional Euclidian spaces. In what the convenience of the reader. In erance regions is to have a fixed sefollows, attention is focused on the section 3., the construction of toler- quence of functions which are used development of tolerance regions for ance regions for simplex-valued ran- sucessively to cut off blocks from the simplex-valued random variables. dom variables of any dimension is sample space. This is done until the These regions may be useful to classify described. The example from embryo- remaining set has a coverage of less proportions. For example, if Z is poly- toxicological studies is further discus- than some predetermined number 0 < nomially distributed with parameters sed in section 4. where an example p < 1. In the discontinuous case, n, Pr, ... ,Pd, Pi ;::: 0, I:Pi = 1, the state with artificial data is also given. Sec- which we want to study in this paper, space of Zin is the d-1 dimensional tion 5. discusses the stability of toler- the cuts can have a probability greater than zero and must, therefore, be simplex. The motivation for the pre- ance regions. considered in addition to the blocks. sent investigation was to obtain tolerThis complicates the construction in ance regions for certain variables some way. measured in the course of embryotoxi- 2. Mathematical FoundatiOIis We recall Tukey's definitions, cological studies. In these studies, the slightly modified for our purpose. An effects of drugs on the progeny of rats For any subset B of the sample n-system of functions G 1 , ... , Go is or rabbits are examined. One criterion of a damaging effect is an unusually space S let C(B) be the coverage of B, defined as follows: I Meth. Inform. Med. 27 (1988) 84-86
84
Downloaded from www.methods-online.com on 2011-12-21 | IP: 88.64.169.76 For personal or educational use only. No other uses without permission. All rights reserved.
where gk,j is a measurable function in the sample space Sand j(k) is a positive integer depending on k. We can order values of functions G k by the lexicographical method: (al,' .. ,an) > (b] , ... ,b n) if any of the following hold: al > bJ, ...... ~! .. ~.. ~.1.. ~I1~I.~~.?.. ~~.: ..... .
a; = b; (i < n) and an > b n.
Given an n-system of functions and n points S1 , ... ,Sn in S, the corresponding blocks and cuts are defined by the following procedure: Select i(l) to maximize G1(sJ (If more than one value of i maximizes G], choose one at random.) Put SI = ISES, GI(s) > G I(s;(I))j, TI = ISES, GJ(s) = G I(s;(I))j.
Next select i(2) G 2 (Si) and set
=1=
i(l) to maximize
S2 = ISES, G1(s) < GI(S;(I))' G 2(s) > G 2(S;(2))j,
T~.. ~ ..I~EO.s.:.gl(S!.. : mI(n + 1) > EICCA)).
Hence, if m/(n + 1) :::; P for some p, B)" is a tolerance region of index p.
solved if suitable cut-off functions have been found. Consider first the problem of how a simplex can be cut into two pieces on the basis of a single observation such that both parts have a coverage of about one half. If d = 2, S is the unit interval and a single point divides it into two subintervals. If d = 3 as in our example, S is an equilateral triangle. A natural partition is obtained by a straight line through the observed point which is parallel to one of the edges. In the case d = 4 when S is a tetraeder, a hyperplane through the observed point parallel to one of the planes on the surface may serve as a cut-off set. In any case considered above, we have chosen the cut-off functions gt(ZI,'" ,Zd) = Zko gk'(zi , ... ,Zd) = l-z ko
or both, depending on whether alower ( - ) or an upper ( + ) tolerance limit or a two-sided tolerance interval is needed for the ith component of the random variable. The n':system of functions G k is then given by G I = IgJ, g2;' .. ,gd-2),
.....c:;~..~.. l~~.:. ~?:.::.:. :.~~c.~:. ~~:!I.'..... . Gd = Igd, gl"" ,gd-3), G d+1 = G I etc.
where gk stands for gt or gk' or both. Because of the special choice of the functions gb it is unnecessary to include more functions in the families G k. Arenumbering of the functions gk does not change the tolerance region essentially if n is large (see section 5). Now take A. to be [n-r+1, ... ,n+1) where r is equal to the integer part of p(n+ 1) for some p. Then by the results of the preceding section
clearer in the next section were we exemplify the construction for dimension d = 3. 4. An Example We now discuss the embryotoxicological example, mentioned in the introduction, in greater detail. Let the components of a random variable X denote the proportions of resorption sites, dead and live fetuses, respectively. To check whether or not the progeny of treated animals is normal, we need upper bounds for the proportions of dead and reabsorbed fetuses, derived from an untreated popUlation. In such a population, the proportion of live fetuses is generally high. Hence, a tolerance region must furnish lower bounds for the proportion of live fetuses. From these,' considerations it follows that the cut-off functions have the following form: gl(ZJ,Z2,Z3) =Zl glZhZ2,Z3) = 22
g3(ZI,Z2,Z3) = 1 - Z3 = ZI
+ Z20
where z = (z], Z2, Z3) is an element of the two-dimensional simplex. The n-system G I , ... ,Gn then reads: G I = IgJ, g2), G 2 = Ig2, g3), G3 = Ig3, gIl, G4 = IgJ, g2), etc.
In order to illustrate the construction of blocks and cuts as described in earlier sections, assume that XI = (0.4,0.2,0.3), X 2 = (0.3,0.6,0.1), X3 = (0.4, 0.4, 0.2)
have been observed. Note that these figures are, of course, artificial and only serve illustrative purposes. Then
EICCA)) :s r/(n + 1):S p,
and B)" is the tolerance region we are looking for. Note that 3. Construction of Tolerance Regions for Simplex-valned Variables
B" = T n- r+1 U Sn-r+1 U .,. USn_I U T n- 1 USn U Tn U Sn+I'
In this section we apply the general result stated above to construct tolerance regions for random variables whose state space is a d-1 dimensional simplex. With regard to Tukey's construction, the problem is essentially
In any dimension, the sets Ti consist only of the observations Xi> i. e. Ti = [Xi)' The set Si contains the points "above" the hyperplane through Xi and some points on this hyperplane minus the points already contained in Sj' Tj for j < i. This will become
and (Xi,j denotes the jth component of Xi) TI = IX3 ), SI = ISES, SI > X 3,1 or SI = X 3,1 and S2 > X3,2j, T2 = IX2 ), S2 = ISES, S2 > X 2,2 or S2 = X 2,2 and SI + S2 > X 2,1 + Xd - (SI UTI)' T3 = lXIi, S3 = ISES, SI + S2 > XI,I + X I,2 or SI + S2 = XI,I + X2,1 and SI :> X I,I1 - (SI U T, U S2 U T2)'
The blocks and cuts constructed above are displayed in Figure 1. The
85
Meth. Inform. Med., Vol. 27, No.2, 1988
Downloaded from www.methods-online.com on 2011-12-21 | IP: 88.64.169.76 For personal or educational use only. No other uses without permission. All rights reserved.
Table 1 Probability that a tolerance limit of index 0.95 contains at most a fraction of 0.95 + E of the population
L
D
Fig. 1 Blocks and cuts for the example with d = 3, n = 3, XI = (0.4, 0.2, 0.3), Xl = (0.3, 0.6,0.1), X3 = (0.4,0.4,0.2).
sample space S is the triangle RDL. The block Sl contains all points which lie below the line through X3 parallel to the edge LD and the points on this line below X3. Sz contains all points below the line through X z parallel to the edge LR and all points on this line below X 2 diminished by the points already contained III Sl and T f~ A similar result holds for S3. In the case of an n-sample of untreated animals, determine r as described in section 3 and cut off r blocks and cuts in the same way as described in the above example. The remaining set, which has the sh~pe of an irregular pentagon containing the angle L of the triangle, is the required tolerance region for the proportions of live, dead, and reabsorbed fetuses. Had we considered two-sided tolerance regions for each proportion, the tolerance region would have had the form of an irregular hexagon lying inside the triangle RDL.
nlE
0.01
0.005
0.001
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000
0.788375 0.819979 0.848876 0.872746 0.892364 0.908580 0.922076 0.933377 0.942891 0.950936 0.957765 0.963582 0.968550 0.972803 0.976453 0.979591 0.982293 0.984623 0.986636 0.988376 0.989883 0.991189 0.992321 0.993304 0.994159 0.994901 0.995548 0.996111 0.996601 0.997028· 0.997401 0.997726 0.998010 0.998258 0.998474 0.998663 0.998829 0.998974 0.999100 0.999211 0.999308 0.999393 0.999467 0.999533 0.999590 0.999640 0.999684 0.999722 0.999756 0.999786
0.704953 0.708771 0.721108 0.734445 0.747394 0.759628 0.771095 0.781822 0.791863 0.801275 0.810112 0.818424 0.826256 0.833647 0.840632 0.847244 0.853509 0.859454 0.865100 0.870468 0.875576 . 0.880441 0.885079 0.889502 0.893725 0.897758 0.901613 0.905299 0.908826 0.912202 0.915435 0.918532 0.921501 0.924348 0.927079 0.929699 0.932214 0.934629 0.936949 0.939178 0.941319 0.943378 0.945358 0.947262 0.949093 0.950855 0.952551 0.954184 0.955755 0.957269
0.633985 0.608707 0.599591 0.595498 0.593639 0.592962 0.592976 0.593424 0.594157 0.595083 0.596142 0.597294 0.598512 0.599776 0.601072 0.602390 0.603721 0.605060 0.606403 0.607745 0.609085 0.610420 0.611749 0.613070 0.614383 0.615687 0.616981 0.618265· 0.619538 0.620801 0.622053 0.623295 0.624525 0.625745 0.626954 0.628153 0.629341 0.630518 0.631686 0.632843 0.633990 0.635127 0.636255 0.637372 0.638481 0.639580 0.640670 0.641751 0.642823 0.643887
5. Stability of Tolerance Regions
p Let me complete this article with some remarks concerning the stability of a tolerance region. Stability is measured in terms of two quantities ~ and c, where ~ denotes the probability, that a tolerance limit of index p contains at most a fraction of p + c of the population. If ~ is close to one for an c close to zero, the tolerance region is called stable. Roughly speaking,
+ c may be looked upon as an upper
100~% confidence limit for the coverage of the tolerance region. If r denotes the integer part of pen + 1), it follows from Tukey's theorem that ~ =
Ip + , (r, n - r
seen from this table, that some thousand data are necessary for a reasonable degree of stability of the tolerance region.
Acknowledgment The author wants to thank Dr. H. Passing, Hoechst AG, I + K-Software, for drawing his attention to the problem treated and for many stimulating discussions. REFERENCES [1] Abt, K.: Scale-independent non-parametric multivariate tolerance regions and their application in medicine. Biometric J. 24 (1982) 27-48. [2] Ackermann, H.: Mehrdimensionale nichtparametrische Normbereiche. (Berlin Heidelberg - New York - Tokyo: Springer, 1985). [3] Fraser, D. A. S.: Sequentially determined statistically equivalent blocks. Ann. Math. Statist. 24 (1951) 372-38l. [4] Passing, H., Bambynek, G., Deyssenroth, H., Griewe, A. P., Helmstadter, G., Knappen, E, Liidin, E., Peil, H., Unkelbach, H. D.: Normalbereich im Tierversuch. In: J. Vollmar (Edit.): Biometrie in der chemisch-pharmazeutischen Industrie 1, pp. 57-73. (Stuttgart: Gustav Fischer 1983). [5] Tukey, J. w.: Nonparametric estimation II: Statistically equivalent blocks and tolerance regions - the continuous case. Ann. Math. Statist. 19 (1947) 45-55. [6] Tukey, J. w.: Nonparametric estimation III: Statistically equivalent blocks and tolerance regions - the discontinuous case. Ann. Math. Statist. 19 (1948) 30-39. [7] Wald, A.: An extension ofWilk's method of setting tolerance limits. Ann. Math. Statist. 14 (1943) 45-55. [8] Wilks, S. S.: Determination of sample sizes for setting tolerance limits. Ann. Math. Statist. 12 (1941) 91-96. [9] Wilks, S. S.: Statistical prediction with special reference to· the problem of tolerance limits. Ann. Math. Statist. 13 (1942) 400-409.
Address of the author: Dr. Gerd Rosenkranz Hoechst A. G. Pharma Forschung Informatik, D-6230 Frankfurt/M. 80, FRG
+ 1).
In Table 1 the values of ~ are tabulated for c = 0.01, 0.005 and 0.001 and values of n ranging vom 100 to 5000 in intervals of 100. It can be
86
Meth. Inform. Med., Vol. 27, No.2, 1988
Downloaded from www.methods-online.com on 2011-12-21 | IP: 88.64.169.76 For personal or educational use only. No other uses without permission. All rights reserved.