Optimal Post-stratification: An application of stochastic programming Sira Allende*, Carlos Bouza * (1) and Dante Covarubias** *Universidad de La Habana ** Universidad Autónoma de Guerrero (1)
[email protected] The need of designing a monitoring network in Sierra de Guerrero motivated this paper. The study of the effect of deforestation of the forest needs of monitoring in the diversity. A quantitative evaluation of diversity is measured by estimating an index. Commonly the monitoring of biodiversity is based on the periodical selection of samples for evaluating the diversity. We propose to use sample information for determining post strata. They must constitute homogeneous zones in the forest. A stochastic program is developed for determining the post strata to be used for sampling. The procedure seems to be a good alternative with respect to the use of a heuristic procedure. The results presented are based on the data obtained in a research developed at one of the most important forest diversity reservoirs of Mexico Key words: post stratification, monitoring network, stochastic programming, diversity indexes
Optimal Post-stratification for the study of the sustainability: an application to the monitoring of diversity in Sierra de Guerrero Sira Allende*, Carlos Bouza * (1) and Dante Covarubias** *Universidad de La Habana ** Universidad Autónoma de Guerrero (1)
[email protected] 1. INTRODUCTION The importance of ecological studies is being increased because of their role in environmental managing. Characterizing diversity in ecological communities is very important and some measures (indexes) are needed for quantifying its status in many situations. The importance of this problem in forestry is evident; see Shiver and Borders (2003). Sample data are used for evaluating different aspects but diversity is not considered. Various approaches can be used for measuring diversity, see Patil and Taille (1982). There are also different indexes for measuring it. The selection of one of them by a researcher is based on a personal preference. The need of studying them form a statistical point of view has been proposed in different papers and books as Kemton (2002). Bouza Covarrubias (2005) and Bouza-Schubert (2003) have studied this problem and proposed estimators for some indexes. Section 2 presents the results obtained by them in the development of estimators of the indexes of Simpson, Fager and Shannon. The need of developing periodically studies of the sustainability of a forest suggests establishing a network monitoring points. The area under study should be divided into homogeneous zones and in each of them some monitoring points or stations will provide the information for evaluating the effect of the management policies established. These policies must establish how to manage the harvesting, the reforestation and the urbanization. Their main objective is to maintain the diversity at a level that allows obtaining timber using appropriate harvesting and reforestation scheduling. 2. METHODOLOGY The posed problem is to establish a procedure for dealing with the characterization of the diversity in Sierra de Guerrero. Ecologists must use a quantitative measure for measuring biodiversity. In this case the interest is in the diversity of the flora in terms of the species present in the sierra. Different authors established that a biodiversity index is formed by a function of the number of species and its richness. See for example Gove et al, (1994). Let us give some key ideas on biodiversity measurement. An index is a parameter to be estimated. We selected the indexes of Simpson, Fager and Shannon, because of their popularity and the easiness of their interpretation by the ecologists. The index of Simpson (1949) K
S 1 i2 i 1
(2.3)
is very popular among the ecologists. The index of Fager (1972) is defined as
N ( K 1) J ( K J ) K (2.4) Ni Ri 2 i 1 where J[0,K) is an integer fixed by the ecologist and R1,...,RK are ranks. The ranks are assigned to the species in a decreasing order of its importance .
F
The index of Shannon is given by :
Sh = -K i=1 i ln(i)
(2.5)
and is derived from the proposal of Shannon (1948) for measuring the entropy of a system. These indexes allow measuring the diversity in a population but we deal with samples. Therefore we must estimate them. The needed development of estimators is given in the next section. The survey developed in Sierra de Guerrero did not considered the existence of zones with a differentiated diversity. For establishing a monitoring network it is necessary to establish the homogeneity of them. As samples are selected, the clustering of the zones is made after sampling. We look for groups of homogeneous zones. The indexes t, t=1, 2, 3, are considered as the variables of interest. A stratum (homogeneous zone) is considered as a set of population units such that the evaluation of the biodiversity index, t,( t=1, 2, 3,) in each sampling unit uj, is similar. Ideally, once the strata are determined, these parameters must be very different from stratum to stratum. The strata construction is modeled using the results of Allende-Bouza (1987). They assumed that some multivariate information was available and it was used as auxiliary information on the interest (stratification) variables. The needed auxiliary information of each unit in the population must be available. As it is not available in our study we use the sample data provided for the observed units. Therefore the group of zones is defined as post-strata. The original optimization model has to be redefined. As we deal with random data the original model is not longer valid. Hence a stochastic programming model is derived. The use of it allows determining which zones are clustered in the strata using the estimated diversity indexes. 3. Indexes Denoting by ni the number of individuals of the specie i, an unbiased estimator of the index of Simpson is, see Bouza and Schubert (2003) and Gimaret and Carpentier et al. (1998), K n (n 1) given by S 1 i 1 i i The estimation of the index of Simpson using m sites n(n 1) .
selected independently, Bouza and Covarrubias (2005) is SS j 1
n*i
m
j
m . m j 1 nij (i=1,..K). This estimator is unbiased and its variance is
Taking
as
n (n 1) m K 1 where Var SS 2 j 1Var j with ˆ j 1 i 1 * ij ij* n .i (n .i 1) m 2 K 2 2(n* j 2) 3 (2n* j 3) K 2 2 Var j * i i 1 i i * n j (n j 1) i 1
K n (n 1) The combined estimator ˆSC 1 i 1 i * i* is also unbiased, see Bouza and n (n 1) Covarrubias (2005), and its sampling error is 2 K K 2 V ˆSC) * * 2 2(n* 2) i3 (2n* 3) i 1 i2 i 1 i n (n 1) The index of Fager (1972) was studied by Bouza and Schubert [2004] who proposed to use N ( K 1) J ( K J ) K the transformations *F i Ri for making it more tractable from 2N i 1 a statistical point of view. Taking 0=[N(K+1) –J(K-J)]/2N an estimator of this K K n transformed index is ˆF 0 i Ri 0 ˆi Ri which variance is n i 1 i 1 *
*
K R2 (1 i ) . When an independent random sample of m sites is selected an Var(ˆF ) i i n i 1 m
unbiased estimator of the index of Fager is ˆFS
ˆ jF j 1
m
m
0
K
R n j 1 i 1
t ij
/ n* j
m
where ˆ jF
denotes the estimator of Fager’s in the site j and m K K 2 * 2 2 R i i ( R i i Ri Ri ' i i ' ) / n j i 1 i i ' V (ˆFS ) j 1 i1 is the sampling error 2 m Another alternative is to combine the information and to estimate i using ̂ where n*= j=1m i=1K nij. As E(nij )=
ˆFC 0
i 1
i
m
i, i=1,..,K, j=1,..,m, an unbiased estimator is
K
K
R ˆ
* i
and V (ˆFC )
R i 1
∑
K
2
i
i ( R 2 i 2 i Ri Ri ' i i ' ) i 1
i i '
n*
is its variance.
The index of Shannon Sh i 1 i ln( i ) is used frequently because its roots are K
justified, in physics, by its role in measuring entropy. An estimator of it, see Hsieh and Li
K (1998), is ˆSh i 1ˆi ln(ˆi ) They studied its behavior when a random sample is . selected and established that it is a maximum likelihood estimator. Its variance is given by: 1 K 2 2 V (ˆSh ) i ln( i ) Sh , If m independent random samples are selected the use n i1 .
of the separate estimator ̂
yields, as a naïve estimator, m
ˆ
ShS
m
j 1
K
i 1
ˆij ln(ˆij ) m
ˆ
Shj
j 1
m
Hence if VShj .
1 K ij ln( ij ) 2 Shj2 , * n j i 1
1 m Using the combination of V 2 Shj m j 1 . the counts of the samples the corresponding combined estimator considered is K ˆ * ln(ˆ * ) i ˆShC 1 i * i and its variance can be estimated by n K 2 1 2 VShC * ˆi* ln(ˆi* ) ˆShC n i 1 . j=1,..,m the variance of the estimator is V (ˆShS )
The ecologists generally hesitate when considering which index is to be used. The determination of strata using the three indexes as stratification variables will produce strata that would satisfy the preferences of them. 4. THE POST-STRATIFICATION The usual models deal with pre-determined strata. The strata should be constructed if they are unknown in advance. Some particular techniques may be used for grouping the units in a population Take multivariate clustering for example. The units are identified by a vector, a certain distance is determined and U is partitioned into H mutually disjoint clusters U1 ,...,UH. A stratum (homogeneous zone) is considered as a set of population units such that the different biodiversity indexes, evaluated in each sampling of the stratum have similar values. We consider that t=1 means that Simpson index is used, t=2 that Fager is computed and t=3 identifies the index of Shannon, so we describe the vector The variables of interest, evaluated in each sampling unit in a stratum , are the indexes. They should be close to the value of the index in the stratum h: t(h). Ideally, once the strata are determined, these parameters must be very different from stratum to stratum. This fact sustains that the estimations based on stratified random sampling (SRS) are more accurate than when simple random sampling (RS) is used, see Cochran (1981). Allende-Bouza (1987) studied the problem of determining optimal strata using multivariate information. They assumed that some multivariate information was available and it was used as auxiliary information on the interest (stratification) variables. The auxiliary information must be obtained for each unit in the population. Unfortunately in the study of a forest that information commonly does not exist. The practical solution is to select a
sample of sites in the studied area for obtaining it. The diversity indexes in the selected site are computed. The usual models for stratifying optimally, see Cochran (1981) are not longer usable because of the randomness of the data in hand. When we have one variable, a solution is to use post stratification but it assumes that the strata are known and the sampled units are classified a posteriori in them. Let us consider that in each stratum Uh of size Mh the arithmetic mean if a diversity index t is *t(h). The ecologist wants: that for any t: *t(h). be close to the individual sample site values. *t(h). ) be very different from the population mean of the index *t . The measurement of the number of individuals in a site q provides the vector while the dispersion of the diversity in the population is measured by ∑
(
)
(4.1)
and within the stratum Uh by ∑
(
)
)
(4.2)
The population mean can be expressed by the linear function of the strata means, see Cochran (1981), *t = ∑ , t=1,..3, where h=Mh/M=Prob(u q Uh) , h=1,..,H. An approach for determining the strata is to consider that the DM fixes a set of points in 3. They represent points that characterize notable behaviors of the diversity. Then we may assume that these points are centroids and denote them as r =rS, rF, , rSh,] T = r, r =1,..,R, R H} as one of the centroids determined by the DM. With these vectors we can define a cost function: C qr =3t=1 (qt - rt )2= 3t=1 Vqrt (4.3) for each unit. Then we may determine a set of zones such that the clustered units are close to the corresponding centroid and the centroids are as far as possible. Allende –Bouza (1987) studied this problem and proposed an optimization program. Its solution yields an “approximately minimum multivariate variance” stratification. Our objective is to use the information provided by a sample s to establish a stratification in U. It is basically the idea present in clustering but the difference with the usual poststratification is that the strata are not known. The optimal stratification procedures in the literature uses T=1 variables but in our case T>1. As we are using sample information P1 is not longer adequate because we deal with a Stochastic Program. The modeling proposed in this paper is based on the proposal approach of Albareda et. al. (2000) Using these results we have a frame for studying the post-stratification problem. We are using the information provided by a random sample s for determining post-strata. Due to the particularities of the randomness of our data we will use the modification of P3: P*3: min qs Rr=1 c*qrxqr Subject to : Rr=1 xqr=1, qs Prob p (r)= qs pq xqr br1- , r=1,..,R. xqr 0,1 , qs , r=1,…, R
P*3 is a stochastic problem with probabilistic constraints. Note that q(r) is the sum of independent Bernoulli variables with expectation Wr . Hence it is a Binomial variable with expected value: [
| ]
∑
(4.5)
Using these results we obtain that: {∑ } ∑ ( (4.6) The DM is able to fix
)
by determining the proportion of elements expected to be in Ur .
Note that we can determine a quantile of order using the fact that we are dealing with a Binomial distribution. The usual choice is to calculate | (4.7) The r-th constraint is satisfied if Hence the deterministic equivalent program of P*3 is: ∑ {∑ ∑ |∑ (4.8)
}
A relaxation of the integer constraints allows to use a Transport Model for solving PD1. The zones determined by the solution permit to establish a network of monitoring sample points. The effect of the management policies is measured by their capability of maintaining a certain level of diversity (calculated by indexes). A level is adequate if it supports the sustainability of forest. Samples will be selected periodically in the network by fixing some sample points in a zone. Because of its homogeneity the number of sampled points will be small. Comparing the results in different moments, it is possible to evaluate the management of the forest in terms of its sustainability. 5. STUDY OF THE BASIN OF SAN JUAN RIVER The basin of San Juan river is located at the north of Estado de Guerrero in the Sierra de Guerrero in Mexico. It is has a surface of 33 447 hectares. The 50% of it is covered by vegetation, 25% ha no vegetation , 30% is cultivated by farmers and the rest are urbanized areas. The exploitation for using the wood is considered as a cause of deforestation of one of the most important biodiversity reservoirs of Mexico. It contains five municipals. The central government provided funds for developing the project Ordenamiento Ecológico Territorial (Territorial Ecological Ordering). The ecologists identified eighth types of vegetation conforming Tropical and Subtropical woods, secondary vegetation, Cedar’s woods, Encino’s woods, Pasture, Palms and Encino-Pine woods. A large sample based study was developed by Almazán et al (2003). They collected 8 570 sample sites scattered
in the basin. The individuals in each sample site were identified and classified into 400 types which belonged to 103 families and 66 species. Species with less than 9 counts were not considered and a group with them was defined. 22 860 individuals were counted and classified. The more frequent species observed were Pinus (ocote), White Encino Dodonaea viscosus and Brahena Dulcis. The population of Red Cedar, Black Encino was scarce. A policy, for recovering these species, because their economic interest, should be designed. The ecologists considered that the continuation of the study will be feasible by determining a number of zones P[20, 30]. Monitoring points would be installed in them and periodically samples will be collected in each of them. Therefore the zones are considered as strata. They should be identified by looking for an optimal stratification. Our proposal is used for determining an adequate value of P and determining the optimal strata. The efficiency of the management policy will be evaluated periodically using estimated diversity indexes. A clear preference for one of them is not based on objective evidence. Using the theoretical frame developed in this paper we considered using different weights to their importance. We considered the sets of weights given in table 1. Table 1. Sets of weight of the indexes Weight set wS wF wSh 1/3 1/3 1/3 (1) 0,5 0,2 0,3 (2) 0,3 0,2 0,5 (3) 0,5 0,3 0,2 (4) 0,75 0,2 0,05 (5) 0,75 0,05 0,2 (6) 0,2 0,05 0,75 (7) The evaluation of the post stratification was made considering the relative efficiency obtained by the set of post strata determined by each weight set. Taking D(v(d), P), d=1,…,7, as the set of indexes of the determined post strata, the corresponding cost for a certain number of post strata P is
(v(d), P|z)= 3t=1 rD(v(d),P) m(r) qU(r) wtd(qrt - ˆtrz )2 /m= 3t=1 Vt
(5.1)
where qrt is the index t , t=1,2,3, evaluated in the site q included in the post strata U( r). ˆtrz is the estimate of the same index using the m( r) sites classified in U(r ) using the criteria z (z= Separate, Combined). wtd is the weight assigned to the index t, t=1,2,3, by the vector v(d) The cost of not stratifying is measured by ( v(d)|z)= 3t=1 M q=1 wtd (q - ˆtz )2 /m = 3t=1 V*t
(5.2)
It is compared with that obtained by using the proposed method. The relative efficiency of the post stratification was measured by using the ratios
(v(d),P|z)’ =100[1-( v(d), P|z)/ ( v(d)|z)]
(5.3)
We have considered P=5,10,15,20,25,30 in the study. The results obtained when the separate criteria is used, are given in the Table 2. Note that the use of a set of 15 post strata seemed to the specialists to be as good as P=20 and 25. The difference among them was considered not considerably large. These results sustain considering that the role of the weights is important. Note that when a large importance is assigned to the index of Shannon we observe a diminish in the gain in accuracy due to post stratification, while the opposite occurs with the index of Simpson
Table 2. Relative efficiency of the separate estimators and number of post strata (NP) determined P
( v (1),P|S)
NP
( v (2),P|S)
NP
( v (3),P|S)
NP
( v (4),P|S)
NP
( v (5),P|S)
NP
( v (6),P|S)
NP
( v (7),P|S)
NP
5 10 15 20 25 30
10,6 18,1 26,8 26,5 26,4 26,8
3 7 11 15 15 18
12,3 11,0 25,5 32,1 34,9 35,2
3 3 11 15 15 15
14,6 18,7 29,6 39,5 39,9 40,1
3 5 12 15 15 16
15,1 20,4 21,5 22,1 23,0 24,6
3 6 12 15 15 15
25,2 29,2 35,9 39,3 48,1 48,2
3 6 11 15 15 15
28,5 31,8 39,4 41,5 51,4 51,5
3 6 10 15 15 15
10,0 10,8 16,3 21,7 22,6 23,0
3 6 11 14 19 19
In Table 3 a similar behavior is observed for P. The use of a larger a weight for the index of Shannon is related with smaller gains in accuracy due to the post stratification derived. In both cases the larger gains are given by the use of d=6 when Simpson’s index obtains the largest weight and Shannon’s is considerably smaller. Table 3. Relative efficiency of the Combined estimators and number of post strata determined P
( v (1),P|C)
NP
( v (2),P|C)
NP
( v (3),P|C)
NP
( v (4),P|C)
NP
( v (5),P|C)
NP
( v (6),P|C)
NP
( v (7),P|C)
NP
5 10 15 20 25 30
16,3 16,2 25,9 25,9 26,1 26,9
3 7 13 15 15 22
17,4 19,6 29,2 33,9 33,9 34,1
3 5 12 15 15 16
10,2 18,3 23,2 24,2 24,3 24,6
2 7 10 15 15 20
13,6 19,2 26,4 34,5 34,9 35,1
3 7 13 15 15 15
24,0 35,2 41,4 41,6 41,5 41,8
3 5 9 15 15 12
33,4 36,2 45,8 48.3 48,5 49,3
3 8 8 15 15 15
7,0 7,2 7,8 11,0 15,2 15,2
2 7 12 18 21 24
The analysis of the results suggested to the ecologists accepting the use of the post strata defined by the Combined estimator for P=20 and d=5. Remark that the set of weights plays an important role in the determination of the post strata. However, the ranking of the gains in accuracy for a fixed (d) is similar for different
values of d. The weight vectors are determined by the judgment of the experts; therefore, the gain in accuracy is affected by their expertise. The variability of Shannon’s index is larger than the variability of the other two indexes and it affects seriously the gains in accuracy due to post stratifying optimally. The analysis of specialists will determine which set of post strata is to be used for establishing the monitoring network’s nodes. References Albareda-Sambola, E. and E. Fernández (2000): The stochastic generalized assignment problem with Bernoulli demands. TOP, 8, 165-190. Allende, S.M. and C. Bouza (1987): Optimization criteria for multivariate strata construction, In “Approximation en Optimization” (Lecture Notes in Mathematics 1353), 227-233. Springer, & N. York. Almazán A., Urbán G., González R., Tapia J., Villerías S., Beltrán E., and Almazán M. (2003): Ordenamiento Ecológico territorial de la subcuenca del Río San Juan del Estado de Guerrero. Reporte Técnico de Investigación. SIBEJ-UAGro. Bouza C. N. and Covarrubias D. (2005): Estimación del índice de diversidad de Simpson en m sitios de muestreo. Inv. Operacional, 26, 186-195. Bouza C. N. and Schubert L. (2003): The Estimation of The Biodiversity and The Characterization Of The Dynamics: An Application To The Study Of A Pest. (2003) Revista De Matematica E Estatistica. 21, 85-98 Cochran W. G. (1981) : Técnicas de Muestreo. CECSA, Mexico. Fager, E.W. (1972): Diversity: a sampling study. American Naturalist. 106, 293-310. Gimaret-Carpentier C, Pélissier R., Pascal J. and Houllier F. (1998):Sampling Strategies For The Assessment Of Tree Species Diversity. Journal Of Vegetation Science, 9, 161-172. Haneveld, W., K. Klein and M.H. van der Klerk (1999): Stochastic Integer Programming: general models and algorithms. Ann of Oper. Res. 85, 39-57 Hsieh H. L. and Li L. A. (1998): Rarefaction diversity: a case study of polychaete communities using an amended FORTRAN program. Zoological Studies 37, 13-21. Kempton R. A. (2002): Species Diversity, In Encyclopedia Of Environmetrics, 4, 20862092. (Eds Abbel H. El-Shaarawi and W. Piegorsh) John Wiley & Sons. N. York Gove J. H., Patil G. P., Swindel B. f. and Taillie C. (1994) : Ecological diversity and forest management. Handbook of Statitics Vol. 12. (Eds. G. P. Patil and C. R. Rao) Elsevier, N. York. Maradiaga C. F. S. (2004): Desarrollo de un Sistema de Inventario y Monitoreo de Maguey Papalote (Agave cupreata Trel. & Berger) en el estado de Guerrero. Fundación PRODUCE Guerrero A.C., Programa de Recursos Biológicos Colectivos (CONABIO) e Instituto de Investigación Científica Área Ciencias Naturales de la UAGro. Chilpancingo, Gro. México. Patil G. P. And Taillie C. (1982): Diversity As A Concept And Its Measurement. Journal Of The American Statistical Association 77, 548-567. Shannon C. E. (1948 ): A mathematical theory of communication the diversity on. The Bell System Technical Journal. 27, 379-423 Shiver, B. d. and Borders, B. E. (2003): Sampling Techniques for Forest Resource Inventory. J. Wiley, N. York.
Simpson E. H. (1949): Measuring of diversity. Nature. Vol 163.