Disclaimer. This document ...... they provide information on variations in spectral signature between pixels and not on differences in the ... the information carried out on digital images that are related to vegetation dis- tribution. For instance data ...
Optimization of sampling schemes for vegetation mapping using fuzzy classification
Rafael Tapia February, 2004
Optimization of sampling schemes for vegetation mapping using fuzzy classification by Rafael Tapia
Thesis submitted to the International Institute for Geo-information Science and Earth Observation in partial fulfilment of the requirements for the degree in Master of Science in Geoinformatics.
Degree Assessment Board Chairman
Prof. Dr. Alfred Stein
First supervisor
Prof. Dr. Alfred Stein
Second supervisor
Dr. Ir. Wietske Bijker
External examiner
Dr. Gerard B.M. Heuvelink
INTERNATIONAL INSTITUTE FOR GEO - INFORMATION SCIENCE AND EARTH OBSERVATION ENSCHEDE , THE NETHERLANDS
Disclaimer This document describes work undertaken as part of a programme of study at the International Institute for Geo-information Science and Earth Observation (ITC). All views and opinions expressed therein remain the sole responsibility of the author, and do not necessarily represent those of the institute.
Abstract The multivariate fuzzy-c-means classifier is used to model vegetation distribution considering remote sensing data. The classification results in areas with high and low uncertainty. The confusion index is used to quantify the uncertainty. Subareas with different levels of confusion are then used to constrain allocation of a fixed number of sample points in a selected study area using a program for simulated annealing by minimizing the mean of shortest distances, as such a combination of optimal sampling with fuzzy classification is achieved. To test the methodology two sites of natural vegetation were selected. The first, located in the amazonian region of Peru ,without field data, was used to pose a sampling scheme for model validation. The proposed measures aim for exploration of predictor variables in areas with high uncertainty or ”vague” zones. The second area, Budongo forest in Uganda was selected for an accuracy assessment. A predictive model is used with factors related to biodiversity and species distribution is modelled with an assumption of high validity and ”hard” areas were selected for the survey. Optimization for the sampling schemes as obtained through an almost equilateral triangular design ensures the most even spreading of spots required in model based sampling. Optimization was also obtained by constraining the survey to areas with different levels of uncertainty. Comparison with field data shows little correlation with the hypothesis for the second case. The original design for the field data was in transects and lack of coincidence between points resulted on different units and hence on different expected values. Also the assumption for high validity of the model can be doubted considering its poor definition.
Keywords optimization, sampling schemes, vegetation mapping, fuzzy-k-means, SSA, MMSD
i
Abstract
ii
Acknowledgements My first appreciation must be extended to my first supervisor, Prof. Alfred Stein for the constant guidance he has given me throughout this work, sharing generously his knowledge and challenging me to move forward in the research. I would also like to extend great thanks to Ms. Dr. Wietske Bijker, whose ever constructive comments and corrections helped me to shape this study. I wish to thank Dr. John Smith from Instituto del Bien Comun in Peru who support my spectations to follow studies from the very first time. My extended appreciation to all the staff on IBC for her friendship and support, specially to Ermeto for providing me with the necessary data. Special thanks to Ms. Grace Nangendo for sharing her data and time with me. Survey is a very noble activity, collecting information with great effort in the field and allowing others to benefit from it. Thanks to Arko Lucieer to provide me with his PARBAT software, a promising package for the exploration of uncertainties with a friendly interface, which allowed me to do the present work. My sincere thanks to members of ITC staff who helped directly and indirectly in the development of the thesis. Special thanks to Prof. Rolf de By to show us the goodness of LATEX, to Prof. Theo Bouloucos for his advices and to M.Sc. Sokhon Phem for his kind help. Thanks to Wan Bakx and Gerard Reinink from Geo Technical Support, departement from where i received constant ”support” to accomplish my tasks. Thanks to Willem Nieuwenhuis software developer, to write the bna export version in ilwis and to Ard Blenke for solving problems appearing only in my computer. I wish to extend a general appreciation to all people working on ITC for making our stage in Holland more grateful, providing me with support for documentations, teaching materials and a hot meal. I will like to express my sincerely appreciation to all my friends in ITC, my peruvian fellows, Helbert, Piero, Brando and Javier, who joined me in this study adventure abroad. To all my friends from the GFM2 course, Blanca, Bikram, Gang, John, Nandini, Paul, Poom, Potjio, Shreeharsha, Tsolmongerel and Yuhua, people from other countries with the same goals, i feel honoured to work and learn from them but especially to receive their friendship. Finally i want to dedicate this thesis to my family: to my parents Mario and Ana Maria for giving me support through all my live, to my brothers: Nicolas y Maria Elena for their flow of encouragement, and specially to my wife Dora for her love patience and continuous support at the times when i needed.
iii
Acknowledgements
iv
Contents Abstract
i
Acknowledgements
iii
List of Tables
vii
List of Figures 1 Introduction 1.1 Mapping vegetation is uncertain . . . . . . . . . . 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . 1.3 Sampling theory . . . . . . . . . . . . . . . . . . . . 1.4 Improving sampling . . . . . . . . . . . . . . . . . . 1.5 Sampling for kriging and fuzzy-kriging . . . . . . 1.6 Purposes on sampling vegetation . . . . . . . . . . 1.7 Fuzzy modelling . . . . . . . . . . . . . . . . . . . . 1.8 Fuzzy classification using RS derived information 1.8.1 Landform parameters . . . . . . . . . . . . 1.8.2 Vegetation Indices . . . . . . . . . . . . . . . 1.9 Data Fusion Techniques . . . . . . . . . . . . . . .
ix
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
1 1 2 3 3 4 5 6 7 7 8 9
2 Methodology 2.1 Sample space and Units . . . . . . . . . . . . . . . . 2.1.1 Sampling approach . . . . . . . . . . . . . . . 2.2 Model-based sampling . . . . . . . . . . . . . . . . . 2.3 Fuzzy sampling scheme . . . . . . . . . . . . . . . . 2.4 Stochastic model with image combination . . . . . . 2.5 Fuzzy-c-means classification . . . . . . . . . . . . . . 2.6 Sensitivity analysis based on entropy and partition 2.6.1 Map of Vague Zones . . . . . . . . . . . . . . . 2.7 Spatial Simulated Annealing (SSA) . . . . . . . . . . 2.7.1 SSA algorithm . . . . . . . . . . . . . . . . . . 2.8 Constrained MMSD Program . . . . . . . . . . . . . 2.8.1 Input parameters . . . . . . . . . . . . . . . . 2.8.2 Point-polygon algorithm . . . . . . . . . . . . 2.8.3 MMSD computation . . . . . . . . . . . . . . 2.9 Accuracy evaluation . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
11 11 11 12 12 13 14 16 17 17 19 20 21 21 22 23
v
Contents
vi
2.9.1 Shannon index . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Kriging comparatives . . . . . . . . . . . . . . . . . . . . . 2.10 Methodology diagram . . . . . . . . . . . . . . . . . . . . . . . . .
23 24 24
3 Data description 3.1 Study Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Training Area . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Budongo Forest . . . . . . . . . . . . . . . . . . . . . . . .
27 27 27 28
4 Results 4.1 Training area . . . . . . . . . . . . 4.1.1 Sampling scheme . . . . . . 4.1.2 Predictive model . . . . . . . 4.1.3 Sensitivity analysis . . . . . 4.1.4 Map of confusion . . . . . . 4.1.5 Sampling allocation . . . . . 4.2 Budongo forest . . . . . . . . . . . . 4.2.1 Sampling scheme . . . . . . 4.2.2 Diversity predictive model . 4.2.3 Fuzzy classification . . . . . 4.2.4 Samples for the study area . 4.2.5 Accuracy assessment . . . . 4.2.6 Regional modelling . . . . . 4.2.7 Kriging analysis . . . . . . .
. . . . . . . . . . . . . .
31 31 31 31 33 35 35 36 37 38 39 39 40 43 43
5 Discussion 5.0.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
47 48
6 Conclusions 6.1 About the procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 About the results . . . . . . . . . . . . . . . . . . . . . . . . . . .
49 49 49
Bibliography
51
Constrained MMSD
55
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
List of Tables 2.1 General fuzzy sampling scheme . . . . . . . . . . . . . . . . . . .
12
4.1 Sampling scheme for the training area . . . . . . . . . . . . . . . 4.2 Boundaries for the Yanachaga research area, UTM-18S, WGS84 4.3 Sampling scheme for a local biodiversity research in Budongo forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Boundaries for Budongo research area, UTM-36N, Arc1960, Clark 1880 Spheroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Number of field samples in the study area . . . . . . . . . . . . . 4.6 Sampling scheme for a regional biodiversity research in Budongo forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Boundaries for Budongo regional model . . . . . . . . . . . . . .
31 32 37 38 41 43 43
vii
List of Tables
viii
List of Figures 2.1 2.2 2.3 2.4 2.5
Fuzzy-sampling and model-constrained approach intersection point of two lines . . . . . . . . . . . . equations for µa and mub . . . . . . . . . . . . . . . Parameter definition for the MMSD program . . . Fuzzy-sampling workflow diagram . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
13 21 22 23 25
3.1 Location of Yanachaga-Chemillen National park . . . . . . . . . 3.2 Location of Budongo forest . . . . . . . . . . . . . . . . . . . . . . 3.3 Landsat image from 1995 showing burning areas in Budongo forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28 29
4.1 Landsat subset for the study area . . . . . . . . . . . . . . . . . . 4.2 Thematic images for training area. a)elevation, b)slope and c)NDVI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 H and F scaled for an overlapping of 1.5 . . . . . . . . . . . . . . 4.4 H and F scaled for an overlapping of 2.5 . . . . . . . . . . . . . . 4.5 Confusion index maps for overlapping of 1.5 and for 2, 5 and 10 classes respectively . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Confusion index map: bright areas are showing high confusion 4.7 Unconstrained disposition of 100 samples . . . . . . . . . . . . . 4.8 Location for 100 samples over the Yanachaga study area . . . . 4.9 Landsat subset for the Budongo forest study area, containing field samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Thematic images for Budongo tree-diversity model. From left to right and up to down: 1)elevation, 2)aspect, 3)slope, 4) greenness 5) brightness 6) wetness and 7)NDVI-change. . . . . . . . . . . 4.11 Budongo forest: H and F scaled for an overlapping of 2.4 . . . . 4.12 Confusion index maps considering 4 classes and overlapping coefficients of 1.5, 2, and 2.5 respectively . . . . . . . . . . . . . . . 4.13 Allocation of samples for the biodiversity model on Budongo forest 4.14 Right: samples obtained on the field, left: selection of samples corresponding to the study area . . . . . . . . . . . . . . . . . . . 4.15 Comparative of positions from field data and proposed sampling schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.16 Right: samples obtained on the field, left: selection of samples corresponding to the study area. In red squares: the proposed sample points . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
29
33 33 34 34 35 36 37 38
39 40 40 41 42 42
44
ix
List of Figures
4.17 Predicted distribution of values for a) number of species b)Shannon index and c)Evenness index . . . . . . . . . . . . . . . . . . . . . 44 4.18 Kriging with simulated values for hard and vague zones ranges resulting from the fuzzy-sampling procedures . . . . . . . . . . 45
x
Chapter 1
Introduction 1.1
Mapping vegetation is uncertain
Mapping of natural vegetation, as well as that of other natural resources, is a complex activity. Definition of classes is often subjective, depending upon the level of knowledge that is acquired through the observation in the field and the patterns of distribution for the species that have to be mapped. Major factors influencing spatial distribution of vegetation are temperature, humidity and soil among others. Weather and soil conditions are difficult to measure in zones of natural vegetation; for instance, installation and maintenance of ground stations will not be justified by the records obtained, certainly influenced by micro-climate and, in the case of soil sampling, accessibility will be a major impediment. Remote sensing is the appropriate tool for surveying in such cases but ,at present time, it has also limitations. Direct readings for weather conditions are remaining on large resolutions while giving priority for a good temporal resolution. On the other hand, physical properties of land could be measured with new sensors with increasingly high spectral and spatial resolution but yet good understanding of this natural processes and further predictions are not fully achieved. This difficulty relies mostly on the multivariate character of the mentioned processes. It may be that many of the predictor variables were not properly identified and hence not measured, either by sensors or directly in the field. Therefore the improvement of our predictions must include a proper set of variables and measurements. In the mapping work this could be translated into proper class definitions and accuracy. On mapping, spatial correlation is an important predictor variable that should be considered. But unlike univariate and more stable phenomenons, as the presence of minerals in the soil, for which is possible to apply kriging procedures; multivariate and more dynamic phenomenons, as spatial distribution of vegetation doesn’t give good approximations. The ”spatial contiguity in fuzzy classes” (Jetten, 1994; Burrough, 1998)[5] that results from the presence of similar environmental conditions at different locations could be the solution to deal with multivariate processes. Fuzzy classification has the advantage of computing many attributes and,
1
1.2. Objectives
the resulting uncertainty on class definitions could be used to find a proper sampling scheme. Then the set of locations proposed will be correlated to a model that will improve iteratively each time information about variables and its correspondent measurements at each point are obtained.
1.2
Objectives
The Main objective of the study is: To determine optimal sampling procedures in the presence of fuzzy classes. Secondary and related objectives are: • Extract information for vegetation predictors from remote sensing images. • Propose predictive models for vegetation distribution using image-combination. • Select two site studies for a sensitivity analysis to different parameters for fuzzy-c-mean classification and for different researches related to vegetation. • To delineate an optimal sampling criterion using simulated annealing based on the uncertainty identified by classification. • Apply the sampling procedures to determine optimal schemas for vegetation researches carried out in natural forest. • Evaluate accuracy for a hypothetical biodiversity research.
2
Chapter 1. Introduction
1.3
Sampling theory
In statistical theory, the first step to make inference about a measurable property of a population is to take a sample. Then a statistic will constitute a number describing the sample. This will be used to estimate a parameter that applies to the whole population. Difficulties arise because it will be too expensive to take as many samples as the size of the population. The solution is to simulate random variables, which are based on the laws of probabilities and will result in normal distributions for unlimited populations [16]. In nature, data is also influenced by random variables effects (read chance), and in theory is possible to apply the laws of probability to make inferences about a phenomenon. But in this case the number of variables is large while the population we want to analyse could be limited. In such cases models with random functions could be used to make inference and at the same time the size of the sample could be shortened.
1.4
Improving sampling
The main reasons to design statistical sampling schemes is that these guarantee scientific objectivity and the possibility to improve the results with resampling (Stein, 2003) [20]; knowing the location of the samples, the survey could be improved by sampling in unvisited sites and if necessary accuracy could be checked. This may also be useful if possible locations to visit are limited by available resources, for instance surveying in the forest has many constrains due to difficult accessibility. The other important element, besides the number of samples, is the unit of sample. The objective of the research will determine what we will sample on the field and with which resolution that will be done. For the case of vegetation, if maps of distribution based on images are to be done, the final resolution or pixel size of the image will indicate the lower size for the unit. In case a predictive model needs to be validated the unit must be at the same scale, e.g. for studies related to tree diversity a much larger unit of surveyed surface per sample will be necessary in comparison to studies of soil fauna [20]. As we saw, classical sampling theory suggests that a sampling design must have some element of randomization to ensure that the estimates from it are unbiased and to provide a probabilistic basis for inference [27], therefore a fully random design is a simple but elementary example. Every unit in the sample is chosen without regard to any other, and all units have the same chance of selection. If we achieve an acceptable precision by simple random sampling; it is still possible to obtain the same results with a reduced number of samples, as one of the aims on improving the sampling scheme. One way to reduce the number of samples, if there is any spatial dependence, is by doing classification (equivalent to an statistical stratification); then the variance will be transmitted between classes, while the within variance of each class or clusters will decrease [27]. A common way to improve the sampling scheme, if we are restricted to a certain area and sample size is by means of a systematic sampling, providing
3
1.5. Sampling for kriging and fuzzy-kriging
the most even cover of the study area. In one dimension the sampling points are placed at equal intervals along a line, a transect. In two dimensions the points may be placed at the intersections of an equilateral triangular grid for maximum precision or efficiency. With this configuration the maximum distance between any unsampled point and the nearest point on the grid is the least. The main disadvantage of systematic sampling is that classical theory provides no means of determining the variance or standard error from the sample because there is no randomization present [27].
1.5
Sampling for kriging and fuzzy-kriging
In spatial sampling most of the efforts to improve samplings have been related to interpolations with the hope that the phenomena studied will show continuous variation that could be fitted in to a model. Yfantis et al. (1987)(in [23]) compared the performance of square, equilateral triangular and hexagonal grids. They found that the equilateral triangular grid yielded the most reliable estimation for a variogram (with relatively low nugget and acceptable sampling density; otherwise hexagonal grids yielded the lowest kriging variances). Odeh and McBratney (1990) [18], considered fuzzy memberships to determine optimal spacing between samples as from soils studies and they concluded that ”the sample variogram of membership of a given class merely indicates the spatial dependence and the range of variation along a transect but does not indicate the physical locations of soils class boundaries”. Corsten and Stein (1994)(in [23]) showed that nested sampling designs produced inaccurate experimental variograms, as compared to random and systematic sampling designs. Later on, Van Groenigen (1999) [23] presented the Spatial Simulated Annealing (SSA) as a method to optimize spatial sampling schemes. Sampling schemes were optimized at the point-level, taking into account sampling constrains and preliminary observations. The method was illustrated by two optimization criteria. The first optimized even spreading of the points over a region, whereas the second optimized variogram estimation by minimizing the kriging variance. It was shown that SSA was superior to conventional methods of designing sampling schemes. Improvements up to 30% occurred for the first criterion, and an almost complete solution was found for the second criterion. If no prior data from the area are available, a model can be used to determine the optimal sampling grid for point kriging or block kriging, given an accuracy requirement. It has been shown that if the spatial variation is second order stationary and isotropic, then equilateral triangular grids, or systematic sampling, usually render the most accurate predictions, closely followed by square grids (de Gruijter, 1999) [9]. In second order stationarity we assume a constancy of the ensembleso that all the possible values for a point xi have the same mean, variance and covariance and depend only on the separation or lag and not on absolute positions [27]. In equilateral triangular grids the separation between values is equidis-
4
Chapter 1. Introduction
tant, then we ensure that a constant lag is reached. In case point data from the area pre-exists a predictive model can be used to find good locations for additional samples based on a contour map of the kriging variance, this is a practical technique with wide spread application [9]. Additional sampling is then projected preferably in regions with high variance as this provides the largest reduction of uncertainty. This technique is only approximative, however, in the sense that it does not lead to an exact optimal configuration of sampling points. As shown by Verma [26], for studies where data have already been sampled with a classical randomize schema or following a univariate stratification, and are used on kriging-modelling, uncertainties will exist about the corresponding variogram parameters and, therefore, predictions are hard to obtain.
1.6
Purposes on sampling vegetation
Many objectives exist to obtain samples. The recurring aim is to detect changes of values between and within samples. Wollum (1994) (in[20]) expresses this objective of sampling as ”the interpretation of differences and similarities between two or more measurements”. On mining or soil contamination studies the objective will be to measure differences on concentrations of elements with, sometimes as the final goal to find the highest concentration of mineral. Meanwhile in soil classification an objective could be to look for diversity in soil properties. What should be common for the objectives is to rely on proper sampling schemes that, on its turns relies on the optimization of certain parameters. For soil contamination the reduction of kriging variance seems to be the optimal objective function, with the disadvantage that a large amount of preliminary data is needed to model spatial correlation. In the specific case of vegetation mapping there could be many objectives. Among these are species distribution, tree diversity or monitoring. If a project is planned to perform sampling on the field in areas where not so much ancillary data are available, a proper set of hypotheses must be formulated based on what exists. In this case is certain that remote sensed information and knowledge about physical properties of the surfaces will lead the predictions. One hypotheses related to species distribution which was justified through several studies concerns correlation with environmental conditions. Townsend (2000)[21] proposed that in complex ecosystems such as the mixed deciduous forest of the eastern United States, vegetation communities intergrade along a continuum of multiple environmental gradients. This refers to the ”individualistic concept” of plant association first postulated by Gleason (1926)[21], in which the ranges of individual species are distributed independently along environmental gradients (Austin and Smith, 1989)[21]. Townsend [21] also proposes that, although sets of species may be regularly distributed along those gradients and therefore form identifiable associations, patterns in the distribution and abundance of individual species may vary widely according to environmental constrains, disturbance history, and compet-
5
1.7. Fuzzy modelling
itive interactions. This uneven distribution of the units will affect sampling in the sense that this gradients will not necessarily follow a spatial arrangement to be used for the allocation of samples. Again the use of conventional spatial modelling through interpolation for species distribution are hardly difficult, even almost impossible at scales where many variables are involved: from global variables like weather to very local like soil fertility; from long time dynamics like soil formation to events like burning activities [17].
1.7
Fuzzy modelling
Recently works have aimed to solve this issue of complexity in vegetation mapping, focussing on classification procedures, particularly on models based on fuzzy classification. Fuzzy sets represent a desirable paradigm for accuracy assessment of natural vegetation classes because locations on the ground frequently bear resemblance to multiple classes on a map (Gopal and Woodcock, 1994)in [21]. Burrough [5] presented fuzzy logic as an alternative to conventional logic in a way to solve such paradoxes by admitting partial truths or by class overlap. Fuzzyness is not a probabilistic attribute, in which the degree of membership of a set is linked to a given statistically defined probability function. Rather it is an admission of possibility that an individual is a member of a fuzzy set, or inexactly defined classes. The assessment of the possibility that one pixel belongs to a certain class can be based on subjective, intuitive (’expert’) knowledge or preferences [5]. But it could also be related to uncertainties due to predictor variables not present in the remote sensing data but for which a more intensive sampling is necessary. Lark (1998) [11] presented a method for the generation of classes with a spatial coherent distribution from multi-variate data. It was based on fuzzy clustering of the data, followed by spatially weighted averaging of the class memberships within a local neighborhood. Then it was concluded that fuzzy classification and spatially weighted smoothing of the class memberships is a useful strategy for the formation of spatially coherent regions by multivariate classification. DeBruin and Stein (1998) [4] used fuzzy-c-means clustering of attribute data derived from a Digital Elevation Model DEM to model Soil-landscape and his study confirms that fuzzy c-means clustering of terrain attribute data enhances conventional modelling, as it allows representation of fuzziness inherent to soillandscape units. Combination of techniques involving fuzzy classifications have been carried out as well. For instance in the Guyana rainforest a vegetation fuzzy-k-means classification was performed (Jetten, 1994)in [5] using data of presence/absence for original species. The procedure results in three stable classes. However when combined with kriging it was found that, unlike the pollution classes, they were not spatially well correlated based on the ratio of sill to nugget variances for the fitted variograms (values over 1) sill/nugget, and concluded that the
6
Chapter 1. Introduction
spatial distribution of the forest classes is not clear. Maselli et al (1997) [14] in a research for monitoring vegetation in Mediterranean environments using NDVI derived from sensors with different resolutions, identified an optimum methodology composed of fuzzy classification, mean degradation, and multivariate regression procedures. Townsend [21] using fuzzy similarity index assessment (FSA) methodology concluded that this provides results that can be used to report the nature of variation within natural vegetation classes and provides a quantitative measure of the nature of vegetation gradients on maps that for practical purposes treat vegetation classes as discrete units. But even so, it was recommended for researchers wishing to use map classifications to study natural ecosystems, to assess the inherent variability of the vegetation in the classification through detailed composition-based measurements as a critical point.
1.8
Fuzzy classification using RS derived information
Townsend (2000) [21] pointed out in relation to Fuzzy classifications and mixture/subpixel models used to map vegetation, that ”they do not necessarily convey the expected variability of the composition of species within classes. Rather, they provide information on variations in spectral signature between pixels and not on differences in the on-the-ground content of mapped vegetation”. This is particular true in studies of vegetation because a degree of interpretation is lacking in the rough images. Then many indices are able to interpret the information carried out on digital images that are related to vegetation distribution. For instance data derived from DEM’s could be used to interpret landscape units (Burrough et al. [6]; de Bruin and Stein, 1998 [4]), while a vegetation index could be helpful to summarize vegetation cover. Burrough et al. [7] showed that there was a stronger relation between a derived index named M NDVI and topo-climatic classes in the Greater Yellowstone Area in USA. They concluded that ”the fuzzy-k-mean procedure yielded sensible and stable topo-climatic classes that can be used for the rapid mapping of large areas”.
1.8.1
Landform parameters
Burrough et al. [6] say that, if considered as cells, each point in a DEM grid with a value assigned by interpolation, the fuzzy-k-mean could be used to define landforms. The attributes constitute all ’derived’ values that are possible to compute using only the x, y and z positions typical from a DEM. Burrough et al. used seven derived attributes: elevation, slope, profile curvature, planimetric curvature, mean wetness index, ridge proximity an annual irradiation calculated with the respective equations. With data corrected for artifacts and using also the sampling approach in the case of large data sets, Burrough et al. computed a classification using fuzzy k-means procedure for an arbitrary number of classes between 2 and 9.
7
1.8. Fuzzy classification using RS derived information
The parameters F and H were then evaluated for each iteration, looking after the best number of classes and finally a description was assigned to the class based in the prominent attributes. At this point it is important to highlight the concept of hard class as the classes in which are included the observations that have their highest membership on the same class [6]. Observations that cannot be assigned are named ”intragrades” and are really between two or more classes.
1.8.2
Vegetation Indices
Vegetation indices are mainly derived from reflectance data from discrete red (R) and near-infrared (NIR) bands, like NDVI or M NDVI. But every band that contains some information related to vegetation could be used as is the principle in Tasseled caps.
NDVI The Normalized Difference Vegetation Index (NDVI) operates by contrasting intense chlorophyll pigment absorption in the Red band against the high reflectance of plant materials in the NIR band, and is expressed by the well known formula: NDVI=(NIR-R)/(NIR+R) (Rouse et al., 1973, by [14]). This is the most widely used index, especially when analyzing data from satellite platforms. VanStraten [25], studying the post - disturbance dynamics of woodland ecosystems in budongo forest, used NDVI values for the 2002 and 1985 images, proved to explain 32 % of the species composition changes through succession. The NDVI changes between image years, also, corroborated the successional index and explained 28 % of the variation in compositional changes.
Tasseled Cap From Crist et al 1986, in [12] the principle is explained: the data structure (of a multispectral image) can be considered a multidimensional hyperellipsoid. The principal axes of this data structure are not necessarily aligned with the axes of the data space (defined as the bands of the input image). They are more directly related to the absorption spectra. For viewing purposes, it is advantageous to rotate the N dimensional space such that one or two of the data structure axes are aligned with the Viewer X and Y axes. In particular, the axes that are largest for the data structure produced by the absorption peaks of special interest for the application could be viewed. For example, absorption features related to vegetation like greenness and wetness will be the data structure of interest. To maximize visibility of this data structures a simple calculation (linear combination) rotates the data space to present any of these axes. These rotations are sensor-dependent, so that the coefficients will vary from sensor to sensor.
8
Chapter 1. Introduction
1.9
Data Fusion Techniques
Ancillary data used to study natural vegetation could be obtained from many sources. This will result in differences of resolution: the geometric, spectral and temporal resolutions; and there are major axioms about resolution to take into account for a research study [8]: • Geometry: a pixel should be less than half of the size of the smallest detail to be evaluated (i.e. Nyquist rate), this implies that with an image resolution of 15 or 30 m our objects of interest must be in the order of 30 to 60 m which in most of the cases are over a specimen size. Then we need to model the classification considering vegetation patch or use images with better resolution. • Spectrum: natural forest has a high spectral overlap with exotic trees and permanent crops (e.g tea, coffee, fruits). • Time: the spectral information provided by single date remotely sensed images is often not enough to distinguish among objects with similar reflectance behavior. Image fusion at pixel level is the fusion at the lowest level referring to the merging of measured physical parameters and is useful for visual interpretation. The approach in pixel-based methods requires certain processing steps. After the images have been corrected for system errors, the data is further radiometrically processed using filters or other algorithms. Next the data is geocoded or in some cases only co-registered to coincide on a pixel by pixel basis. The geocoding plays an essential role because misregistration causes artificial colors or features in multi-sensor data sets which falsify the interpretation later on. It includes the georeferencing to a map projection and resampling of image data to a common pixel spacing. The drawbacks of conventional image-to-map registration are the differences in appearance of the features used as control or tie points a better approximation could be achieved by using the register of the same satellite, as is the case of TM LANDSAT that appeared to provide a more accurate coordinate reference than 1:50,000 scaled maps (Welch et al. 1985; by [19]). A better use of the images could be done by fusion at feature level where decision or interpretation follows segmentation and value is added to the data by processing individually each image. The information extraction could be done through segmentation procedures to extract objects recognized in the various data sources. From single images like a DEM is possible to compute some parameters related to landforms; and from several images with different dates is possible to calculate indices and detect changes. Spectral and temporal resolution limitations could be solved using different sources of data and sensors. Radar is especially useful for studies in the rain forest and was used to improve classification of tropical land covers. For instance simulation of the level and variation of backscatter signal using time series could reveal land cover structures (Bijker, 1997 [2]).
9
1.9. Data Fusion Techniques
10
Chapter 2
Methodology 2.1
Sample space and Units
In probability the sample space S of a random phenomenon is the set of all possible outcomes. The name ”sample space” is natural in random sampling, where each possible outcome is a sample and the sample space contains all possible samples. To specify S, we must state what constitutes an individual outcome and then state which outcomes can occur (Moore, 2002), [16]. In 2D or 3D space the location and all possible values coming from it, will constitute the sample and the boundaries of the study area will define the sample space. The units in our researches of vegetation will be the data contained on the pixel after the data-combination and the equivalent on the field: the support must have the same size. The possible outcomes will be the pixels contained in the study area.
2.1.1
Sampling approach
By sampling approach we mean the selection between a design-based or a modelbased approach. De Gruijter (1990) [10] made a good differentiation between these two approaches for spatial sampling by giving a hypothetical example using planar point-sampling. Suppose an errorless measurement of a quantity z is taken at n points in a region A. The values obtained will be denoted by z(xi ), where xi is the vector of coordinates of the ith sampling point. Either the value or the coordinates (or both) may be random variables. To distinguish between random and fixed components, random variables are written in uppercase and nonrandom in lowercase. In the design-based approach, the n sample points are randomly selected according to a sampling design p, and the values are fixed. The variables involved can thus be written as z(Xi )(i = 1, ..., n). They represent random variables because the locations (Xi ) are random. Whether z(Xi ) and z(Xj ) for i 6= j are stochastically independent or not is completely determined by the sampling design and not by the spatial variation in A. The model based approach starts with a random function ε that generates random values over A. Sampling at fixed points thus yields in the given notation the random variables Z(xi ) and Z(xj ) for i 6= j are stochastically depen-
11
2.2. Model-based sampling
dent as determined by the random function ε, which is partly specified by the assumed geostatistical model.
2.2
Model-based sampling
The best location for a sample begins by considering the existence of a ”spatial dependence” between the variable Z and its position. Then this dependence must be fitted to a geostatitical model and therefore a systematic sampling scheme could be improved. There are two identified procedures based on stochastic models of the spatial variation that could be used: • Modelling with variograms: If detailed data from certain locations (previous sampling at better resolution) is available, kriging or even fuzzykriging could be applied to interpolate memberships for all the area and derive a model with parameters. • Design with constrains: This include basically functions of simulation until a better fitted schema of sampling is obtained considering constrains of certainty. The technique of Spatial Simulated Annealing (SSA) by Weighted Means of Shortest Distance(WMSD) is the more relevant. For studies of vegetation with less or any previous field data available our option is two base the sampling on a constrained design. Moreover to handle multivariate studies our option is to use stochastic models based on fuzzyclassification.
2.3
Fuzzy sampling scheme
Before going in the classification and optimization of sampling, a general scheme for the research must be defined. The definition of a sampling scheme based on fuzzy classification will include several items that are summarized in table 2.1: Table 2.1: General fuzzy sampling scheme
Study area: name of the selected area for research Purpose of sampling: objectives of the sampling Predictive model: previous data available Fuzzy-classification parameters: degree of overlapping between classes and number of classes resulting from the sensitivity analysis Target area: for fuzzy-sampling the target and the constrained areas could change from hard to vague zones depending on the purpose of the sampling Measurements: type of measurements to be performed, they could be expressed in scales of measurements like Shannon index or more variable predictors as soil properties, micro-climate, landuse.
12
Chapter 2. Methodology
It is important to explain that the proposed fuzzy-sampling methodology is a model-constrained approach with a boolean choice, where the areas to constrain from sampling are primarily composed either by hard or vague zones. We will constrain the sampling only to vague or uncertain zones when the purpose of sampling is to improve and validate a predictive model for a certain process, there the measurements should be focus in the identification of more predictors and its influence on the process. The criteria applied in this case is that the uncertainty is mainly originated from a poor definition of the predictive model. The second option constrains the sampling inside the hard or certain zones. The criteria in this case is that the model is good enough to explain the process and the larger range of variability lies on the hard zone. Additional constrains such as low accessibility due to lack of roads, health danger, research-terms or budget, could also be included in the sampling scheme to made it more efficient. Also the result of previous sampling could be included.
Figure 2.1: Fuzzy-sampling and model-constrained approach
2.4
Stochastic model with image combination
The predictive model to be used for the vegetation research and from where the constrains for uncertainty will be derived, is obtained from the combination of different thematic images as a simple form of data fusion, including landform parameters, spectral indices and change analysis. Landform parameters could be obtained with many software packages and many algorithms. ENVI software for instance, uses the methodology developed by Wood [28] which generates lambertian (shaded relief) surfaces to extract parametric information, including slope, aspect, and various convexities and curvatures. All of the parameters are calculated by fitting a quadratic surface to the digital elevation data for the entered kernel size and taking the appropriate
13
2.5. Fuzzy-c-means classification
derivatives. In the software is possible to change the kernel size so that multiscale topographic information can be extracted. In the group of spectral indices we used NDVI and Tasseled cap. The Tasseled Cap transformation offers a way to optimize data viewing for vegetation studies and is also available in commercial softwares. Research has produced three data structure axes that define the vegetation information content: • Brightness: a weighted sum of all bands, defined in the direction of the principal variation in soil reflectance. • Greenness: orthogonal to brightness, a contrast between the near-infrared and visible bands. Strongly related to the amount of green vegetation in the scene. • Wetness: relates to canopy and soil moisture. The options for tasselled caps are restricted to some sensors in the available softwares, because the linear combinations of bands is particular to each one. ENVI includes the coefficients for images from LANDSAT 4, 5 and 7. Thematic maps of change detection are possible to compute if images from different dates are available and a good registration is done. Change is expressed in absolute values for negative and positive differences between pixel values. For the classification, the thematic images corresponding to elevation are equally georeferenced checking for a proper overlap. Then pixel values need to be stretched to 8 bits format to avoid a bias for a certain predictor variable. Finally the different derived themes are stacked as separate layers in a multiband BSQ image according to the predictive model.
2.5
Fuzzy-c-means classification
The fuzzy-c-means or fuzzy-k-means algorithm is used in this study to classify images with many attributes (themes) and works by weighting membership values to a number of classes for each pixel. This allows us to derive a more realistic definition of ”vague pixels”. In FCM the membership values are weighted by their distances to the center cluster (Bezdek, 1995) [1]. The same author made a comparison between results obtained from Hard-cmeans (HCM) against fuzzy-c-means (FCM) using the Euclidean norm for distance. FCM as a generalization of HCM gives the same labelling results for the cluster center after hardening, as we approximate the exponent of overlaping m → 1 and also if the number of classes are small. Burrough (2001) [7] describes the procedures for the fuzzy-k-means as follows: Fuzzy-k-means uses an interactive procedure that usually starts with an initial random allocation of N objects to K clusters. Given the cluster-allocation (expressed in terms of the membership µic in the range 0-1), and a weight of the attribute values, the cluster center C of the cth cluster for the j th attribute x is calculated:
14
Chapter 2. Methodology
PN
q i=1 (µic ) xij q i=1 (µic )
Ccj =
PN
(2.1)
where the fuzzy exponent q determines the amount of fuzziness or overlap. In the next step, objects are reallocated among the classes according to the relative similarity between objects and clusters. The distance measure d is the index used to express similarity, is a diagonal metric in which attributes are scaled by dividing each distance by the sample variance Sj2 : 2
(dic ) =
" #2 K X (xij − Ccj ) j=1
Sj
(2.2)
Then in ordinary k-means, the membership µ of the ith object to the cth cluster is determined by: −1
[(dic )2 ] (q−1)
(2.3) −1 2 ] (q−1) [(d ) ic c=1 Burrough (2000)et al. [6] used the value q = 1.5 in this formula because stated that: ”appears to result in k-means classes mutually exclusive”. Using Reallocation proceeds by iteration until a stable solution is reached where similar objects are grouped together in a cluster. Once the sample variances and the optimal class centroids for the sample have been computed, unsampled objects (cells) can also be assigned membership values using the above equations. For our application the thematic maps included in the stochastic model constitutes the attributes and the computation was done using all the pixel values in the image, with the exception of the cluster center itself to avoid null distances. Then as a difference from the method used by Burrough et al. [7], where only a sample of pixels from a large image was used in order to reduce computations, the distance calculated in equation 2.5 is simplified to: µic = P k
(dic )2 = (xij − Ccj )2
(2.4)
In this case we ensure to avoid bias by previously equally scaling each of the themes to a 8 bit format. The PARBAT software developed by Arko Lucieer (2003) [13]was used to perform a non supervised fuzzy-c-mean classification. The algorithm used in this software complies with the procedures developed by Bezdek (1995)[1] for the fuzzy-c-mean. The software produces four kinds of files: 1. hard-classes based on the highest membership, 2. membership values for each class, 3. computed confusion index and 4. computed entropy for each pixel
15
2.6. Sensitivity analysis based on entropy and partition
2.6
Sensitivity analysis based on entropy and partition
With the resulting images of membership values, confusion index and entropy computed with PARBAT, an analysis for the best parameters of overlapping and number of classes was carried out in order to obtain a significant classification. VanderWel (1997) [22] proposes to use entropy for a detailed exploration of the uncertainties underlaying a remote-sensing classification. Build on the notion of weighted uncertainty this measure is originated from information theory. For a pixel in a remote-sensing classification , viewed as a statistical variable C, the uncertainty in class Ci is defined as: − log2 P r(C = Ci |x)
(2.5)
for i = 1, . . . , n, where x denotes the available data; the uncertainty is measured in units of bits of information. Generally, the true class of the pixel is not known and, as a consequence, the amount of information required to reveal the pixel’s class is unknown. The entropy of the pixel is therefore defined as the expected information content of a piece of information that would reveal its true class. To this end, the entropy measure combines the uncertainties in the various classes of the pixel by weighting them by their probabilities: −
n X
P r(C = Ci |x) · log2 P r(C = Ci |x)
(2.6)
i=1
The pixel’s entropy is minimal if the uncertainty as to its true class has been resolved. Thus, if P r(C = Ci |x) = 1 for some class Ci , 1 ≤ i ≤ n, that is, if class Ci has been establish with perfect accuracy, then the entropy equals zero and there is no further information required to reveal the pixel’s true class. In the PARBAT software the value of entropy for each pixel Ei is calculated in a similar way as equation 2.6, with the difference that the pixel value is weighted by a membership value instead of the probability: Ei = −
K X
µij · log2 · µij
(2.7)
j=1
Burrough (2000) [6] repeated the fuzzy-c-classification for a range of numbers of classes and determined the optimal number of classes using two parameters that express the overall fuzziness of the classification: these are the partition coefficient F and the classification entropy H, expressed as: F =
1 N
PN PK
H=
1 N
PN PK
i=1
2 c=1 (µic ) ,
1/K ≤ F ≤ 1, (2.8)
i=1
c=1 −µic ln(µic ),
1 − F ≤ H ≤ ln(K).
The F ratio is comparable to the F -ratio of the pooled within-cluster variance and the between-cluster variance and is closest to 1 for the most significant clustering.
16
Chapter 2. Methodology
Both F and H depend on the number of cluster K (Bezdek et al. 1984, by [7]) proposed to use scaled values: Fscaled =
(F −1/K) (1−1/K) ,
Hscaled =
(H−(1−F )) (ln(K)−(1−F ))
(2.9)
Burrough et al. concluded that the success of the classification was determined by a maximum partition coefficient F and a minimum entropy H; giving the most appropriate number of classes.
2.6.1
Map of Vague Zones
The resulting images were used to derive a map of ”vague zones” meaning the regions were the uncertainty, expressed in term of confusion index (CI) and entropy, is maximal. The ratio of the dominant and first sub-dominant membership value for each object and is a useful index of the degree of class overlap. Burrough et al. [6] termed this as the Confusion Index and represent as follows: CI = µi,max2 /µi,max1
(2.10)
Mapping the confusion index may indicate parts of the landscape where spatial change in classes is clear and abrupt, or diffuse and vague. In order to facilitate computations the image of confusion indices was then polygonized based on a threshold of CI = 0.5. This hardening resulted in two different zones confusion and hardzones which then were used as constrained areas in the sampling optimization. The ILWIS software was used for the polygonization and the map containing the polygons was exported to BNA format.
2.7
Spatial Simulated Annealing (SSA)
Simulated annealing (SA) (Aarts and Korst, 1989, in [23]) is a combinatorial optimization algorithm, originating from statistical physics. Is also known as Monte Carlo annealing, statistical cooling, probabilistic hill-climbing, stochastic relaxation, and probabilistic exchange algorithm. Is based on an analogy taken from thermodynamics (Michalewicz, 2000)[15]. To grow a crystal, you start by heating a row of materials to a molten state. You then reduce the temperature of this crystal melt until the crystal structure is frozen in. If the cooling is done too quickly some irregularities are locked into the crystal structure and the trapped energy level is much higher than in a perfectly structured crystal . In many studies, it has been applied successfully as a universal optimization method. Related algorithms have been applied to optimization of spatial sampling and to the restoration of degraded images because of its insensitivity to local extremes [23].
17
2.7. Spatial Simulated Annealing (SSA)
In Spatial Simulated Annealing (SSA) the coordinates of the sample points is the variable to be optimized with respect to the cost function φ which represents loose of energy or cooling while the substance is changing from one state to another (solidification). As we saw in the section 2.2, SSA has the main advantage of handling different optimization criterions, we could mention four of them tested by Groenigen (1999) [24]: • perfect random distribution • anisotropy where is spatial variation • spatial interpolation, minimize the kriging variance • reduced health risk This setting in advance for an optimal criteria made it insensitive to local max and min, hence it has widely applied in geostatistics for simulation of ReV’s; reproducing variograms and combining stochastic simulation at the same time. Then SSA is especially useful in studies with many sampling constrains. Minimization of the Mean of Shortest Distances(MMSD)criteria used in SSA (Groenigen, 1999)[23], aims at regular spreading of all sampling points over the sampling region. Regular spreading can be formulated as minimizing the expectation of the distance between an arbitrarily chosen point within the region, and its nearest sampling point. For sampling scheme S, minimizing this expectation leads to the following minimization function: Z
minS
k~x − VS (~x)k
(2.11)
A
where ~x is a two dimensional location vector, and VS (~x) denotes the location vector of the nearest sampling point ~xi ∈ S. An equilateral triangular grid optimizes this criterion in theory and furthermore with MMSD the schemes are improved in the borders, because a finite area for sampling is considered. The methodology developed by [23] for Spatial Simulated Annealing with the MMSD criteria was used. First a fixed number n of 100 observations were randomly distributed location of these observations x1 , ..., xn was established. Then a well defined, quantitative criterion φ(S) is optimized to reach a defined sampling scheme S = {x1 , ...xn }. A central concept in SA is the fitness function φ(S) that has to be optimized. Suppose that we can define a combinatorial optimization problem in which φ(S) has to be minimized. Starting with S0 , let Si and Si+1 represent two solutions with fitness φ(Si ) and φ(Si+1 ), respectively. Typically, Si+1 is derived from the neighborhood of Si by a random perturbation of one of the variables of Si . A probabilistic acceptance criterion decides whether Si + 1 is accepted or not.This probability Pc (Si → Si+1 ) of Si+1 being accepted can be described as: Pc (Si → Si+1 ) = 1, if φ(Si+1 ) ≤ φ(Si ) φ(Si )−φ(Si+1 ) Pc (Si → Si+1 ) = exp , if φ(Si+1 ) > φ(Si ) c
18
(2.12)
Chapter 2. Methodology
where c denotes a positive control parameter. The parameter c is lowered according to a cooling schedule as the process evolves, to find the global minimum. A transition takes place if Si+1 is accepted. Next a solution Si+2 , and the probability Pc (Si → Si+2 ) is calculated with a similar acceptance criterion as the equation above. A mathematical description of the SA-algorithm is given by the theory of finite Markov chains (Seneta, 1981). At each value of c, several transitions have to be made before the annealing can proceed, and c can take its next value.
2.7.1
SSA algorithm
Sacks and Schiller (1988) proposed several Simulated Annealing-related algorithms for optimizing sampling schemes using geostatistical criteria. Although the research of Groenigen (1999)[23] was related to this proposed method, some crucial differences were developed. In order to modify simulated annealing for optimization of spatial sampling, the fitness function, a generation mechanism and the cooling scheme have to be decided upon (Aarts and Korst, 1989)in [23]. Fitness function Let the total research region denoted by AR and the sub-region that can be sampled by AS ⊂ AR , thus excluding roads, houses etc. Next, the MMSD criterion is estimated by the fitness function φM M SD (S), which is an estimator of the function formulated in Equation 2.11: φM M SD (S) =
ne 1 X k~xj − VS (~xje )k ne j=1 e
(2.13)
with ~xje ∈ AR denoting the j th evaluation point. The ne evaluation points are located on a finely meshed grid over the whole area. In order to yield a reliable value of φM M SD (S), the number of evaluation points should be much higher than the number of sampling points. By choosing the evaluation points on a finely meshed grid over the whole region AR , while locating the sampling points strictly in AS , the algorithm spreads the sampling points optimally over the whole region AR , while taking physical sampling constrains into account. This was an important difference with the method of Sacks and Schiller, which could not handle such sampling constrains. Generation mechanism The aim of a generation mechanism is to generate a new solution Si+2 out of the solution Si+1 , by means of a random perturbation in one of the variables of Si+1 (Davis, 1990)in [23]. In SSA, this is done by moving one randomly chosen sampling point ~xi over a vector ~h with the direction of ~h drawn randomly, and k~hk taking a random value between 0 and ~hmax . One of the modifications of SSA as compared to ordinary SA and the method of Sacks and Schiller, is that ~hmax
19
2.8. Constrained MMSD Program
initially is equal to half the length of the sampling region, and decreases with time. This increases the efficiency of the demanding recalculations after each modification in the sampling scheme, because it can be expected that with optimization of sampling schemes, successful modifications consist of increasingly smaller values of k~hk as the SSA process advances. This is because the process deals with many similar variables (i.e. the coordinates of the sampling points). Therefore, moving sampling points randomly over large distances will not contribute much to finding the minimum towards the end of the optimization process. Furthermore, contrary to earlier optimization methods, co-ordinates of the sampling points are treated as continuous variables, rather than chosen from a discrete grid. This is in line with earlier studies, where SA was applied to continuous problems (Bohachevsky, 1986; Vanderbilt and Louie, 1984; in [23]). At the end the final value of the control parameter, ~hmax will be almost equal to zero. Cooling schedule For the cooling schedule, which expresses c as a function of the progress of the optimization, [23] used a basic set of empirical rules which have been proposed in many studies: - 95 % of the initial transition are accepted, we start with a random position We start with an initial value c0 which has an acceptance ratio of 0.95 or higher for alternative solutions. The decrement of c is given by ck+1 = α · ck , k = 1, 2, . . . ,
(2.14)
with 0 < α < 1. The maximum period of time for one Markov chain k to remain at any value of c is fixed, and the final value of c is explicitly given to the SSA algorithm. From this data α can be calculated. The acceptance criterion is similar to the one given in equation 2.12 substituting φM M SD for φ. Use of a variable c which ensures that inferior solutions are accepted with decreasing probability as the process evolves is the most important difference with the proposed algorithm by Sacks and Schiller.
2.8
Constrained MMSD Program
To apply the methodology developed by Groenigen and Stein (1999) [23] for Spatial Simulated Annealing with the MMSD criteria including constrains of uncertainty; a program called ”CONSTRAINED MMSD” was written in PASCAL language, based on the algorithms for MMSD and point-polygon intersection provided by Alfred Stein. The complete code is included in Appendix A. The procedures included in the program are: • Reading of BNA file containing the constrained areas, • Input of parameters for the study area: number of rows, columns, support size and boundary coordinates.
20
Chapter 2. Methodology
• Creation of a fixed number n of 100 observations randomly distributed in the sampling area: polygons with ID = 1. • Selection of a necessary large number of test points inside the sampling area. • Definition of sampling locations based on the minimization of the mean of shortest distances (MMSD).
2.8.1
Input parameters
After reading the BNA file and the identification of islands the program will request some parameters for the following computations, it includes the number of rows, number of columns and the support size. This values will be used to create the meshed grid of test-points which must result in a large enough number of points. The resolution of this grid could be varied but must always coincide with the total size of the study area. It is also necessary to input the boundaries of the image in UTM coordinates so that the final position of the samples will be presented in the same coordinate system and the allocation will be computed for the selected study area, the programs allows to put values for easting with 6 digits and for the northing between 6 and 7.
2.8.2
Point-polygon algorithm
To detect if a sample or a test point is inside a constrained polygon an algorithm of line intersections was used. Then each segment of the polygon is tested for intersection and the number of intersections will determine the relative position: odd intersections if it is inside and even number if it is outside. The algorithm presented by Bourke (1989) [3] to detect the intersection point of two lines in two dimensional space was used. The idea is explained in figure 2.2: where P 2 is the point to be tested, P 1 is a external fixed point and P 3 − P 4 is the segment of the polygon.
Figure 2.2: intersection point of two lines
21
2.8. Constrained MMSD Program
To find the intersection point, two parameters µa and mub are included in the corresponding equations for each line. After solving simultaneously the equations the two parameters could be calculated as follows:
Figure 2.3: equations for µa and mub
If the intersection of line segments is required then it is only necessary to test if µa and µb lie between 0 and 1. Whichever one lies within that range then the corresponding line segment contains the intersection point. If both lie within the range of 0 to 1 then the intersection point is within both line segments [3]. And additional case where the two lines are parallel or coincident was also considered in the program as a no-intersection case happening when the equation denominators for µa and µb are equal to 0. This procedure of testing the relative position of a point was used in three steps of the program to ensure that the sample points will lie inside the target area: 1. Creating random sampling points inside the target area 2. Selecting test points for the sample space 3. Randomization movement of the sampling points inside the target area during the MMSD procedure. To compute the cases for intersections of polygons inside other polygons (”islands”) a procedure of legal and illegal intersections counters was implemented.
2.8.3
MMSD computation
As explained 2.7.1 the MMSD algorithm consist of three essential parts explained as they are performed in the program: Generation mechanism: the position of each sampling point is moved randomly, one at a time, over the sampling area and the distance from each test point to each sampling point after the movement is computed. The shortest distance from a test point to a sample is selected and summed; then a mean distance is calculated and evaluated in the fitness function. Fitness function: better allocations for the sampling points are obtained each time the mean of short distances to the test points is decreased. The new distances are accepted as better schemas not only because of the decrement of its value but also evaluating against a distribution function with a decreasing probability of acceptance.
22
Chapter 2. Methodology
Cooling schedule: This includes a control parameter c defining the probability of acceptance for new schemas against a uninomial random distribution. At the same time the algorithm takes care of allowing the movement of the point to only half the range in the sampling area and decreasing the distance of movement at each iteration.
Figure 2.4: Parameter definition for the MMSD program
2.9 2.9.1
Accuracy evaluation Shannon index
For the evaluation of tree diversity measurements the Shannon index was chosen; is a commonly index used to express the diversity of a sample or quadrant as a single number. It indicates the abundance and the evenness of spread of the species. It is calculated from the equation: H0 = −
s X
pi ln pi
(2.15)
i=1
Where s pi
:the number of species :the proportion of individuals or the abundance of the ith specie expressed as the total cover.
The formula starts with a negative sign to cancel out the minus sign created when taking algorithms of the proportions. Values of the index usually lie be-
23
2.10. Methodology diagram
tween 1.5 and 3.5, but in exceptional cases the value can exceed 4.5. (Kent & Cocker, 1996; Nangendo, 2000 [17])
2.9.2
Kriging comparatives
Some kriging interpolation were applied to the data in order to analyze the results and made a comparison with common procedures.
2.10
Methodology diagram
A workflow including all the steps related to the fuzzy-sampling methodology are presented in the diagram :
24
Chapter 2. Methodology
Figure 2.5: Fuzzy-sampling workflow diagram
25
2.10. Methodology diagram
26
Chapter 3
Data description 3.1
Study Areas
Two sites were selected to apply the fuzzy-sampling methodology with different approaches. The first site is a subarea in the border of the YanachagaChemillen national park in Peru for which no previous studies were available. A simple predictive model using combination of thematic images was proposed. The sampling purpose was to implement the model. Therefore the target area for the allocation of samples was the uncertain zones. This area was also used to test the methodology and to carry out a sensitivity analysis for the fuzzyclassification, varying the classification parameters. The site was named as ”training area”. For the second approach, a site with existing data was selected, Budongo national park in Uganda. A predictive model based on previous studies was proposed; assuming that it has a good validity, the predictive model will be able to indicate the presence of higher values of tree diversity in the sampling space. The sampling space, area for sampling or just target area in this case is constituted by the hard zones. An accuracy assessment was performed with the results obtained from this area.
3.1.1
Training Area
The Yanachaga-Chemillen national park is situated in the Amazon region. The selected spot is located at the far west of the Amazonia, in the andean mountain range in Peru. It is specially suitable for a multivariate classification because it has large ranges of elevation (500 to 4500 m.a.s.l.) and different densities of vegetation cover. The data available for this site includes: • Landsat TM 5 image, with path-row 7-067, taken on 1999-08-05. • contour lines each 50 m digitalized from national topographic maps with 1:100,000 scale.
27
3.1. Study Areas
Figure 3.1: Location of Yanachaga-Chemillen National park
3.1.2 Budongo Forest Budongo forest national park is located in the western part of Uganda, near Lake Albert. Uganda is located between the drier east savannas and the western rain forests of central Africa. It has 4500 flowering plants species recorded. Budongo is a reserve of natural closed forest surrounded by grasslands with an altitude range between 600-1200 m It has an important tree diversity; Nangendo [17] recorded 105 species on a selected site of 2 x 1.5 km with a combined coverage of forest and grass. The calculated diversity Shannon indices for the samples taken in this study yielded up to H = 2.3. Data available for the area was supplied by Grace Nangendo and consist of: • Landsat 5 TM multispectral 172/59 (14/01/85) • Landsat 5 TM multispectral 172/59 (26/01/95) • Landsat 7 ETM P+MS 172/59 (06/02/02) • DEM with 30 m. resolution for the same area • Sets of sampled data with 593 records (260+333) The records corresponds to plots measured in recent campaigns by Nangendo (261 samples) and Oliver (334 samples). They yield maximum values of Shannon indices up to H = 2.5. The study [17] indicates that values of tree diversity in the area were influenced by burning activities, figure ?? shows a large column of smoke registered
28
Chapter 3. Data description
Figure 3.2: Location of Budongo forest
Figure 3.3: Landsat image from 1995 showing burning areas in Budongo forest
in a landsat image. It was also proposed that topographic factors like elevation and slope will influence the species distribution as well as canopy and humidity.
29
3.1. Study Areas
30
Chapter 4
Results 4.1
Training area
On the area selected to train the sampling procedures an hypothetical research was defined. Based on a simple model with three thematic maps, the fuzzy classification was carried out considering different parameters for overlapping and number of classes. Using appropriate parameters the confusion index map was computed and polygonized with the threshold CI = 0.5. The map was used to compute the final sampling schema for 100 samples.
4.1.1
Sampling scheme Table 4.1: Sampling scheme for the training area
Study area: Yanachaga-Chemillen National Park Purpose of sampling: definition of ecosystems based on a topo-land cover model, model validation Predictive Model: NDVI+Slope+Elevation Target area: vague zones, low confusion index Classification parameters: φ = 1.5, classes:10 Measurements to be performed: variable predictors as soil, microclimate, landcover.
A subset from the Landsat image 7-067 was selected with a size of 800 x 800 rows and columns, covering an area of 576 Km2 , (see figure 4.1); and the boundaries in UTM coordinates were registered (table 4.2.
4.1.2
Predictive model
A subset for all thematic images was determined using the same boundaries in UTM projection.
31
4.1. Training area
Figure 4.1: Landsat subset for the study area Table 4.2: Boundaries for the Yanachaga research area, UTM-18S, WGS84
Corner LLX LLY URX URY
Coordinate 400415 8850015 424415 8874015
Landform parameters The contour lines were used to compute a DEM with a pixel resolution of 30 m. Then the slope was calculated using the procedure explained in the ILWIS software. Vegetation indices The NDVI index was derived from the landsat image and the resolution was kept at 30 m per pixel. Size and boundaries of the images were made to coincide with those of landform.
32
Chapter 4. Results
The images were equally scaled to 8 bit format and stacked in a BSQ file for the computations.
Figure 4.2: Thematic images for training area. a)elevation, b)slope and c)NDVI.
4.1.3
Sensitivity analysis
A sensitivity analysis on the fuzzy-c-means classification was applied to the training area using 2 sets of parameters to compare the effects of the overlapping coefficient φ and the number of classes. Several runs with the PARBAT software were done to made the comparison. Figure 4.3 is showing the variation of the values for the scaled partition coefficient together with the scaled entropy coefficient for a φ value of 1.5. Figure 4.4 shows the variation for the same coefficients compared to the number of classes but for a larger φ = 2.5.
Figure 4.3: H and F scaled for an overlapping of 1.5
33
4.1. Training area
Figure 4.4: H and F scaled for an overlapping of 2.5
The first chart is showing that the maximum partition coefficient F (calculated with equation 2.6is reached with a low number of classes (3 to 4) and also with 10 classes, while intermediate values of 6 classes has the lower coefficients. For the entropy, the minimal values, that indicates a better classification are also indicating suitable number of classes either low or high (4 or 10). With the results on the analysis for classification parameters we decided upon a combination of an overlap φ = 1.5 and 10 classes.
Figure 4.5: Confusion index maps for overlapping of 1.5 and for 2, 5 and 10 classes respectively
34
Chapter 4. Results
4.1.4
Map of confusion
With the values for the first and the second memberships the confusion index map was computed. This image was transformed to a 1 bit format using the threshold of CI = 0.5. For the polygonization a sequence of degrading the image was necessary until more integrated areas were obtained. The final resolution of the image was 100 x 100 rows and columns and 240 m for pixel size (see figure 4.6. The polygon map was build using the ILWIS software and from there exported to a BNA format. Polygons representing uncertain or vague zones were assigned the ID = 1 while the hard zones were assigned ID = 0.
Figure 4.6: Confusion index map: bright areas are showing high confusion
4.1.5
Sampling allocation
Calibration First a trial for the CONSTRAINED-MMSD program was carried out over a single polygon covering the entire study area in order to check for the program performance. The results are shown in figure 4.7. The result is an almost equilateral triangular grid, where samples are almost equidistant. Notice that that a square area doesn’t allow for a perfect disposition of the 100 samples.
35
4.2. Budongo forest
Figure 4.7: Unconstrained disposition of 100 samples
Samples for the study area Using the MMSD program the confusion map was used to constrain the allocation of 100 samples inside the target area. The program reads the polygon file and then creates a set of random points inside the sampling area identified with the number 1 (for the Yanachaga model inside the vague zones). After this, a set of test points are defined for the same area according to a defined resolution; for this model the settings were 150x150 columns and rows and a support size of 160x160 m. This results in a set of 10429 testpoints, which are the necessary large enough number of test points for the MMSD procedure. Then the best allocation for the samples was computed after running 3000 iterations and the resulting positions stored in a CSV file and plotted over the map (figure 4.8).
4.2
Budongo forest
For the area in Budongo forest we proposed to identify the best places to cover the larger range of tree-diversity possible, first at a local scale. The model includes three topographic parameters derived from a DEM, three data structures from the tasselled caps and one temporal NDVI change index derived from the 1985 and 2002 images. The parameters chosen for the classification were a low number of classes: 4 and a relatively high overlap: 2.4 according to the results obtained in 4.1.3. The target area for sampling will be the hardzones, were is expected that a more clear definition of classes: the range of biodiversity, is present. Therefore excluding from the survey the areas with
36
Chapter 4. Results
Figure 4.8: Location for 100 samples over the Yanachaga study area
intermediate values.
4.2.1
Sampling scheme Table 4.3: Sampling scheme for a local biodiversity research in Budongo forest
Study area: Budongo Forest National Park Purpose of sampling: identify areas with different tree diversity Predictive Model: Elevation + aspect + slope + greenness + brightness + wetness + NDVI change Target area: areas with low confusion index: hardzones Fuzzy-classification parameters: φ = 2.4, classes:4 Measurements to be performed: tree-biodiversity
A study area was defined considering the location of sampling data taken on the field, the subset includes 404 samples out of the total 593, resulting on an image with 150 columns by 150 rows (see figure 4.9. A pixel resolution of 40 m was chosen to made a correlate the support size on field (20 x 20 m); this gives a total area of 36 km2 . The boundaries in UTM coordinates are presented in
37
4.2. Budongo forest
table 4.7.
Figure 4.9: Landsat subset for the Budongo forest study area, containing field samples
Table 4.4: Boundaries for Budongo research area, UTM-36N, Arc1960, Clark 1880 Spheroid
Corner LLX LLY URX URY
4.2.2
Coordinate 354400 208700 360400 214700
Diversity predictive model
Using the topographic modelling available in ENVI, parametric information was extracted from the DEM image, including slope, aspect, and various convexities and curvatures. To preserve the scale the minimum kernel size available (3x3) was chosen. The slope is measured in degrees with the convention of 0 degrees for a horizontal plane. With the Landsat 7 ETM image, tasseled caps showing absortion picks for brightness, greenness and wetness data structures were calculated using the correspondent option in ENVI.
38
Chapter 4. Results
From the landsat images of 1985 and 2002 was obtained the correspondent NDVI and an image reflecting the absolute changes on index values was computed to include a temporal variable in the analysis. The resulting images were stacked equally scaled to 1 byte format and stacked in a BSQ file to build the predictive model (figure 4.10).
Figure 4.10: Thematic images for Budongo tree-diversity model. From left to right and up to down: 1)elevation, 2)aspect, 3)slope, 4) greenness 5) brightness 6) wetness and 7)NDVIchange.
4.2.3
Fuzzy classification
A higher overlap (2.4) combined with a relatively low number of classes (4) gave better defined, more compact areas of confusion, which made the polygonization process more operational. The settings of this two parameters were based on a sensitivity analysis of partition and entropy as shown in figure 4.11. Map of confusion After the classification with the defined parameters, the same procedure for polygonization as the training area was performed, with the difference that no degradation of the image was necessary, so preserving the resolution at 40 m. per pixel. In this case the areas having the CI < 0.5, the hard zones, were identified with ID=1, to indicate the sampling area.
4.2.4
Samples for the study area
The CONSTRAINED-MMSD program for the sample allocations was run with 150 x 150 r/c, 40 m. support size and 3000 iterations. The resulting coordinates are shown in figure 4.13.
39
4.2. Budongo forest
Figure 4.11: Budongo forest: H and F scaled for an overlapping of 2.4
Figure 4.12: Confusion index maps considering 4 classes and overlapping coefficients of 1.5, 2, and 2.5 respectively
4.2.5
Accuracy assessment
An evaluation of the accuracy of the proposed design was made using field data from the surveys done by Nangendo and Olivier. Even though their research has other objectives, the measurements of tree biodiversity with the Shannon index will be useful to make an analitycal comparison with the proposed schema. This assessment will evaluate how much the fuzzy sampling procedures approximates to obtain to achieve the purposes of the sampling. The source of error will include a poor model definition; a the error due to allocation of samples. The allocation error measurement will be unfeasible because is hardly to find coincidence between the proposed locations and the field data. A spatial selection of the field samples included in the study area was done,
40
Chapter 4. Results
Figure 4.13: Allocation of samples for the biodiversity model on Budongo forest
from a total of 593 samples in the data set 404 were included in the study area and from them 180 lie inside the target area (excluding 4 samples with no data) see table 4.5. The position of the selected samples is represented in figure 4.14. The mean of the correspondent shannon index values obtained for the study area excluding records without information, was 1.43. On the other side the mean for the set of selected samples (also excluding the lacked information) was 1.53, representing a difference of 7%.
Table 4.5: Number of field samples in the study area
Samples Study area Target area With data
Oliver 333 194 82 82
Grace 260 210 102 98
Total 593 404 184 180
Shannon mean
1.43 1.53
41
4.2. Budongo forest
Figure 4.14: Right: samples obtained on the field, left: selection of samples corresponding to the study area
Figure 4.15: Comparative of positions from field data and proposed sampling schema
42
Chapter 4. Results
4.2.6
Regional modelling
An extra modelling was performed for a larger area in Budongo forest, covering the whole set of field data in order two find correlations for a different model (see table 4.6 that considers the predictor variables at a regional scale. Table 4.6: Sampling scheme for a regional biodiversity research in Budongo forest
Study area: Budongo Forest National Park Purpose of sampling: identify areas with different tree diversity for a larger area Predictive Model: aspect + slope + greenness + brightness Target area: areas with low confusion index: hardzones Fuzzy-classification parameters: φ = 2.1, classes:4 Measurements to be performed: tree-biodiversity
Table 4.7: Boundaries for Budongo regional model
Corner LLX LLY URX URY
Coordinate 400400 424400 8850000 8874000
The resulting allocation of samples is showed in figure 4.16 and the contrasting with values obtained from field shows no differences in range or mean values for the Shannon index, which confirms that dissimilarities in the sampling schemes make impossible to correlated field data with the allocated points and is not due to effects at different scales.
4.2.7
Kriging analysis
Prediction maps using ordinary kriging were elaborated using data from the field, to made a comparative analysis. Three maps were composed, one based on number of species, a second based on Shannon index and a third using the evenness index values (see figure 4.17), darker colors indicates higher values. Using different ranges of values for hard zones in Budongo (higher range) and vague zones (lower range) we simulated values for 200 sample points distributed in the area in order to figure out the spatial distribution of an hypothetical outcome. Figure 4.18 shows this interpolation for the Study area and presents a simulated in which procedures of fuzzy-classification, simulated annealing and kriging interpolation is used. The differences with figure 4.17 explains the overall errors due to the predictive model and differences on sampling scheme parameters.
43
4.2. Budongo forest
Figure 4.16: Right: samples obtained on the field, left: selection of samples corresponding to the study area. In red squares: the proposed sample points
Figure 4.17: Predicted distribution of values for a) number of species b)Shannon index and c)Evenness index
44
Chapter 4. Results
Figure 4.18: Kriging with simulated values for hard and vague zones ranges resulting from the fuzzy-sampling procedures
45
4.2. Budongo forest
46
Chapter 5
Discussion The sensitivity analysis based on entropy and partition coefficients for the fuzzy classification shows in general terms that meaningful classes could be obtained with lower overlapping coefficients φ in combination with either low (4) and high (10) number of classes. Meanwhile a high overlap only allows to rely on rather low number of classes (4) (see figure 4.4). The identification of compact areas of uncertainty using the confusion index required a low overlap with large number of classes in the case of the Yanachaga forest case study; this could be due to the more extensive range of variation for the predictors: high differences in elevation and NDVI values. A higher overlapping and low number of classes were necessary in the Budongo forest case, because the range of values were shortest and a more regional scale. The boolean level of constrain, which defines options only for ”vague” and ”hard” areas is somehow loosing precision and information about uncertainty, but is operational in order to perform point allocations. Also the confusion index threshold of 0.5 is an arbitrary value to define areas but a first approximation to interpret discontinuous geographic spaces having similarity in attributes, and its suitability need to be proved with surveying. The MMSD procedure is successful in the allocation of positions for a number of samples considering many constrains, among them the constrained areas of certainty-uncertainty created by the fuzzy-classification but compared to data resulting from field it doesn’t give a complete solution. The predictive models used in this study are not based in a precise analysis of biodiversity and were used to test operational procedures for fuzzy-sampling. Therefore they failed to give reliable results because they are not good enough to explain the distribution of values. The variables considered, like the topographic parameters has a clearly more regional effect, which is out of scale for the Budongo subset area. Individual species are influenced by all these factors but diversity has different shows different relations, for instance alfa diversity is affected by very local effects while beta diversity receive the influence of more regional variables. Also relative weights for the variables are not considered in the predictive model. We also need to consider that usually areas with more green biomass show a larger diversity than those with lower biomass in natural vegetation but NDVI will be not a good indicator because values saturates at a certain level of green
47
biomass, hence will be not so sensitive to differences in forest. The disparity on sampling schemes for the original studies carried out in Budongo forest and the proposed schemes makes the accuracy assessment invalid. The disposition in transects for field data, which is particularly different from the equilateral triangular grids resulting from the constrained-MMSD procedure, gives no coincidence with the proposed sample coordinates. The constrained target area is also spatially different and specially, values from points in the border will not fit in the predicted range (see figure 4.14).
5.0.8
Applications
Optimization of sampling could be applied to many fields of research and production. For instance, in precision agriculture, location of sampling target areas could be defined to search for spots of disease and pest for a effective treatment and control, decreasing costs and pollution. The same could be done for soil fertilization, proposing optimal sampling procedures based on indicators and produce maps that will help differential application of fertilizer. Crop diversity is also a multivariate phenomena, important to study in regions repositories of large number of species, varieties and cultivars like the andean mountain range. Large number of factors influencing this process, from traditional farming activities, environmental conditions to market accessibility, could be modelled and used to predict the spatial and temporal distribution.
48
Chapter 6
Conclusions 6.1
About the procedure
• The fuzzy-c-means classification shows to be a promising tool to built models for multivariate studies by identifying proper parameters and making use of the interpretation of remote sensed data. • Uncertainty of classes could be used to indicate constrained areas for sampling in processes with discontinuous spatial dependency as vegetation distribution. • Clear definition of research parameters for the sampling scheme, such as study area, purpose of sampling, predictive model, target area, units, classification parameters and measurements to be performed, will contribute to a more efficient survey and further overall sampling optimization. • Optimization of the target area and the scale of measurements was obtained by the definition of compact areas with different levels of uncertainty using the confusion index as discriminant. • Optimization of the unit of sampling was proposed through the correlation of image resolution with the support size. • Optimization of sampling allocation for model-based design was achieved with the use of simulated annealing procedures by minimization of the mean shortest distances, yielding an even spreading of points by an equidistant grid arrangement, hence a better solution for interpolation purposes. • Optimization of surveying is possible by the addition of constrained areas considering accessibility and time limitations.
6.2
About the results
• The sensitivity analysis for different areas and models shows a general tendency to increase uncertainty on classification while overlapping and number of classes is increased.
49
6.2. About the results
• For the case studies proposed, considering a low overlap (φ = 1.5)the optimal number of classes could be either low or high (4 or 10). For higher overlaps (φ = 2.5) the only optimal was obtained with relative low number of classes (4). • The sampling scheme proposed for the unvisited site in the amazon region could be validated with measurement of predictor variables in the proposed locations. • Assumption of validity for the species distribution model in Budongo forest is doubtful because of its poor definition. • The largest ranges of biodiversity (beta diversity), for a determined area could be localized based on models including topographic, spectral and temporal data.
50
Bibliography [1] B EZDEK , J. C. Fuzzy and neural pattern recognition : video short course : Parts 1-2 and Parts 3-4 and Part 5 and Course notes. The International Society for Optical Engineering (SPIE), 1995. [2] B IJKER , W. Radar for rain forest, a monitoring system for land cover change in the Colombian Amazon. PhD thesis, International Institute for Aerospace Survey and Earth Sciences - ITC, 1997. [3] B OURKE , P. Intersection point of two lines. web http://astronomy.swin.edu.au/ pbourke/geometry/lineline2d/, 1989.
site:
[4] B RUIN, D., AND S TEIN, A. Soil-landscape modelling using fuzzy-c-means clustering of attribute data derived from a digital elevation model dem. Geoderma 83 (1998), 17–33. [5] B URROUGH , P. A., AND M C D ONNELL , R. A. Principles of Geographical Information Systems, first ed. Oxford University Press, 1998. [6] B URROUGH , P. A., VAN G AANS, P. F. M., AND M AC M ILLAN, R. A. Highresolution landform classification using fuzzy k-means. Elsevier Science B. V. (2000), 37–52. [7] B URROUGH , P. A., W ILSON, J. P., VAN G AANS, P. F. M., AND H ANSEN, A. J. Fuzzy k-means classification of topo-climatic data as an aid to forest mapping in the Greater Yellowstone area, USA. Landscape Ecology 16 (2001), 523–546. [8] C ARVALHO, L. A. Mapping and monitoring forest remnants, a multiscale analysis of spatio-temporal data. PhD thesis, Wageningen University, 2001. [9]
DE G RUIJTER , J. Spatial Statistics for Remote Sensing, vol. 1. Kluver Academic Publishers, 1999, ch. 13, Spatial sampling schemes for remote sensing, pp. 211–242.
[10]
DE G RUIJTER , J., AND TER B RAAK , C. Model-free estimation from spatial samples: a reappraisal of classical sampling theory. Mathematical geology 22, 4 (1990), 407–415.
[11] L ARK , R. Forming spatially coherent regions by classification of multivaraite data: an example from the analysis of maps of crop yield. Geographical information science 12, 1 (1998), 83–98.
51
Bibliography
[12] LLC, E. ERDAS field Guide, sixth ed. Leica Geosystems, 2002. [13] L UCIEER , A. Parbat version 0.24. http://parbat.lucieer.net, 2003. [14] M ASELLI , F., G ILABERT, A MPARO, M., AND C ONESE , C. Integration of high and low resolution ndvi data for monitoring vegetation in mediterranean environments. Remote Sensing Environment, 63 (1998), 208–218. [15] M ICHALEWICZ , Z., AND F OGEL , D. B. How to Solve it: Modern Heuristics. Springer-Verlag. Berlin Heidelberg-Germany, 2000. [16] M OORE , D., AND M C C AB, G. P. Introduction to the practice of statistics, fourth ed. W. H. Freeman and company, 2002. [17] N ANGENDO, G. Assessment of the impact of burning on biodiversity using geostatistics, geographical information systems (gis) and field surveys. Master’s thesis, ITC, 2000. [18] O DEH , I. O. A., M C B RATNEY, A. B., AND C HITTLEBOROUGH , D. J. Design of optimal sample spacings for mapping soil using fuzzy-k-means and regionalized variable theory. Geoderma 47, 1-2 (1990), 93–122. [19] P OHL , C. Geometric Aspects of Multisensor Image Fusion for Topographic Map Updating in the Humid Tropics. PhD thesis, ITC, 1996. [20] S TEIN, A., AND E TTEMA , C. An overview of spatial sampling procedures and experimental design of spatial studies for ecosystem comparisons. Agriculture, Ecosystems and Environment 94 (2003), 31–47. [21] T OWNSEND, P. A. A quantitative fuzzy approach to assess mapped vegetation classifications for ecological applications. REMOTE SENS. ENVIRON. 72 (2000), 253–267. [22]
VAN DER W EL , F. J., VAN DER G AAG, L. C., AND G ORTE , B. G. H. Visual exploration of uncertainty in remote-sensing classification. Pergamon (1997), 335–343.
[23]
VAN G ROENIGEN, J. W. Constrained Optimization of Spatial Sampling. PhD thesis, Wageningen Agricultural University and ITC, 1999.
[24]
VAN G ROENIGEN, J. W., S IDERIUS, W., AND A., S. Spatial simulated annealing for optimizing sampling. Geoderma, 87 (1999).
[25]
VAN S TRATEN, O. Changing woodland ecosystems : post - disturbance woody species succession dynamics and spatial trends. Master’s thesis, ITC, 2003.
[26] V ERMA , M. Handling spatial data uncertainty using a fuzzy geostatical approach for modelling methane emission at island of java. Master’s thesis, ITC-Netherlands, 2002. [27] W EBSTER , R., AND O LIVER , M. A. Geostatistics for Environmental Scientists, first ed. Statistics in Practice. John Wiley & Sons, Ltd, 2001.
52
Bibliography
[28] W OOD, J. The geomorphological characterization of digital elevation models, ph. d. thesis, university of leicester, department of geography, leicester, uk. http://www.geog.le.ac.uk/jwo/research, 1996.
53
Bibliography
54
Constrained MMSD program code program CMMSD(input,output); {$N+} const
ns = 100; eps = 1e-9;
type
real = double;
var
i,i0,j,k,l,m,n,ni,le,po,n1,n2 no,nrow,ncol,sup x,y,x1,y1,x3,y3,x4,y4,t xa,ya,xb,yb,id,id1,con LLX,URX,LLY,URY c,xc,yc,xr,yr,xs,ys,muad,muan,mubn d,dmin,dt,dtold cn s,s1 ready,change bna,spa,cou,ras,sam,sap,test b,polname,pntname,outname
: : : : : : : : : : : :
integer; integer; longint; longint; longint; real; real; array[1..30] of integer; array[1..ns,1..2] of real; boolean; text; string;
{procedure for the point-poly intersection} procedure intersect; begin muan:=((x4-x3)*(y1-y3))-((y4-y3)*(x1-x3)); muad:=((y4-y3)*(xc-x1))-((x4-x3)*(yc-y1)); mubn:=((xc-x1)*(y1-y3))-((yc-y1)*(x1-x3)); end; begin writeln(’ ’); writeln(’ START CONSTRAINED-MMSD PROGRAM’); writeln(’ ’); writeln(’READING BNA FILE FOR THE SAMPLING AREA ...’); write(’BNA polygon file : ’); readln(polname); writeln(’ ’); assign(bna,polname); reset(bna); assign(spa,’spa.res’); rewrite(spa); {rewriting the bna format to space separated: tab becareful: i am using the fake bna with the partial counter version from ilwis} while not eof(bna) do begin b:=’ ’; readln (bna,b); if (b[1] = ’"’) then begin write (spa,b[2],chr(0));
55
i:=2; repeat i:=i+1; until (b[i]=’"’); writeln (spa,b[i+2],b[i+3],b[i+4]); end; if (b[1] ’"’) then begin write (spa,b[1],b[2],b[3],b[4],b[5],b[6],chr(0)); i:=14; repeat i:=i+1; until (b[i]=’.’); j:=i-14; for i:=1 to (j-1) do write(spa,b[13+i]); writeln(spa,b[13+j]); end; end; close(bna); close(spa); writeln(’IDENTIFYING COUNTERS FOR ISLANDS ....’); writeln(’ ’); {for a maximum of 30 islands} assign(cou,’cou.res’); rewrite(cou); reset(spa); readln(spa,id,con); while not eof(spa) do begin i:=0; id1:=id; readln(spa,xa,ya); for j:=1 to 30 do cn[j]:=0; repeat n:=0; repeat readln(spa,xb,yb); id:=xb; n:=n+1; until ((xb=xa) and (yb=ya)); i:=i+1; cn[i]:=n; readln(spa,xb,yb); id:=xb; until ((id=1) or (id=0)); write(cou,id1:4,i:4); for k:=1 to 29 do write(cou,cn[k]:4); writeln(cou,cn[30]:4); end; close(spa); close(cou); writeln(’INPUT DEFINITION FOR STUDY AREA:’); writeln(’Image size:’); write(’Number of rows : ’);readln(nrow); write(’Number of columns : ’);readln(ncol); write(’Support size : ’);readln(sup); {write(’Number of samples : ’);readln(no);} writeln(’Image coordinates (pixel corner):’); write(’LLX : ’);readln(LLX); write(’URX : ’);readln(URX); write(’LLY : ’);readln(LLY); write(’URY : ’);readln(URY); writeln(’ ’); {calculate ranges for the sampling space} xr := (URX-LLX); yr := (URY-LLY); {assign external point} x1:= LLX-1000; y1:= LLY-1500; writeln(’CREATING RANDOM SAMPLING POINTS ...’); writeln(’ ’); assign(ras,’ras.res’); rewrite(ras); no:=0; repeat {the grid starts at the uper left corner} xc := LLX + (random*xr); yc := LLY + (random*yr); {writeln(’C1:gridpoint ’,xc:14:5,yc:14:5);{CHECK1} reset(spa); reset(cou); le:=0; while not eof(spa) do begin read(cou,id,po); for k:=1 to 29 do read(cou,cn[k]); readln(cou,cn[30]); {writeln(’C1 id ’,id,’po ’,po,’prin ’,c[1]);} readln(spa,x4,y4);readln(spa,x4,y4); {writeln(’C2’,x4,’ ’,y4);} ni:=0; for n := 1 to cn[1] do begin
56
Appendix . Constrained MMSD
x3:=x4; y3:=y4; readln(spa,x4,y4); {writeln(’C3a ’,x4,’ ’,y4);} intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; {writeln(’C3 ni =’,ni);} end; if ((ni mod 2 > 0) and (id = 0)) then le:= le - 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le + 1; {writeln(’C4 le =’,le);} for n1 :=1 to (po-1) do begin readln(spa,x4,y4); {writeln(’C5’,x4,’ ’,y4);} ni :=0; for n2 := 1 to (cn[n1+1]-1) do begin x3:=x4;y3:=y4; readln(spa,x4,y4); intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; {writeln(’C6 nis =’,ni);} end; if ((ni mod 2 > 0) and (id = 0)) then le:= le + 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le - 1; readln(spa,x4,y4); {writeln(’C7 le =’,le);} end; {if (po>1) then readln(spa,x4,y4);} {writeln(’C8’,x4,’ ’,y4);} end; close(cou); close(spa); {writeln(’C9 le’,le);} if (le > 0) then begin {extra line for future use of pointer} no:=no+1; s[no,1] := xc; s[no,2] := yc; writeln(ras,xc:6:0,yc:8:0); end; until (no=ns); close(ras); {option for samples created} {assign(sam,’samoa.txt’); reset(sam); for i:= 1 to ns do begin readln(sam,u,v); s[i,1] := u; s[i,2] := v; end; close(sam);} writeln(’SELECTING TEST POINTS FOR THE SAMPLE SPACE...’); {building a grid of points, checking point by point and polygon by polygon} assign(test,’test.res’); rewrite(test); t:=0; for i:= 1 to nrow do for j:= 1 to ncol do begin {the grid starts at the uper left corner} xc := ((LLX-(sup/2))+(sup*j)); yc := ((URY+(sup/2))-(sup*i)); {writeln(’C1:gridpoint ’,xc:14:5,yc:14:5);{CHECK1} reset(spa);
57
reset(cou); le:=0; while not eof(spa) do begin read(cou,id,po); for k:=1 to 29 do read(cou,cn[k]); readln(cou,cn[30]); {writeln(’C1 id ’,id,’po ’,po,’prin ’,c[1]);} readln(spa,x4,y4);readln(spa,x4,y4); {writeln(’C2’,x4,’ ’,y4);} ni:=0; for n := 1 to cn[1] do begin x3:=x4; y3:=y4; readln(spa,x4,y4); {writeln(’C3a ’,x4,’ ’,y4);} intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; {writeln(’C3 ni =’,ni);} end; if ((ni mod 2 > 0) and (id = 0)) then le:= le - 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le + 1; {writeln(’C4 le =’,le);} for n1 :=1 to (po-1) do begin readln(spa,x4,y4); {writeln(’C5’,x4,’ ’,y4);} ni :=0; for n2 := 1 to (cn[n1+1]-1) do begin x3:=x4;y3:=y4; readln(spa,x4,y4); intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; {writeln(’C6 nis =’,ni);} end; if ((ni mod 2 > 0) and (id = 0)) then le:= le + 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le - 1; readln(spa,x4,y4); {writeln(’C7 le =’,le);} end; {writeln(’C8’,x4,’ ’,y4);} end; close(cou); close(spa); {writeln(’C9 le’,le);} if (le > 0) then begin t:=t+1; writeln(test,xc:6:0,yc:8:0); end; end; close(test); writeln(’Test points :’,t); writeln(’ ’); writeln(’CONSTRAINED MMSD STARTS...’); write(’Number of iterations : ’);readln(m); writeln(’ ’); assign(sam,’sam.csv’);rewrite(sam); writeln(sam,’x’,chr(44),’y’); assign(sap,’sap.res’);rewrite(sap); ready:=false; c:=1; dtold:=1e30; for l:=1 to m do begin for i:=1 to ns do for j:=1 to 2 do s1[i,j]:=s[i,j]; i0:= round(ns*random+0.5); {Select a random point} {repeat xs:=s[i0,1]+((random-0.5)*xr)/sqrt(k);
58
Appendix . Constrained MMSD
ys:=s[i0,2]+((random-0.5)*yr)/sqrt(k); until ((xs > LLX) and (xs < URX) and (ys > LLY) and (ys < URY));} repeat xc := s[i0,1]+((random-0.5)*xr)/sqrt(l); yc := s[i0,2]+((random-0.5)*yr)/sqrt(l); reset(spa); reset(cou); le:=0; while not eof(spa) do begin read(cou,id,po); for k:=1 to 29 do read(cou,cn[k]); readln(cou,cn[30]); readln(spa,x4,y4);readln(spa,x4,y4); ni:=0; for n := 1 to cn[1] do begin x3:=x4; y3:=y4; readln(spa,x4,y4); intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; end; if ((ni mod 2 > 0) and (id = 0)) then le:= le - 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le + 1; for n1 :=1 to (po-1) do begin readln(spa,x4,y4); ni :=0; for n2 := 1 to (cn[n1+1]-1) do begin x3:=x4;y3:=y4; readln(spa,x4,y4); intersect; if (muad = 0) then continue; if (((muan/muad) >= 0) and ((mubn/muad) >= 0) and ((muan/muad) < 1) and ((mubn/muad) < 1)) then ni:=ni+1; end; if ((ni mod 2 > 0) and (id = 0)) then le:= le + 1; if ((ni mod 2 > 0) and (id = 1)) then le:= le - 1; readln(spa,x4,y4); end; end; close(cou); close(spa); until ((xc > LLX) and (xc < URX) and (yc > LLY) and (yc < URY) and (le > 0)); s1[i0,1]:=xc; s1[i0,2]:=yc; dt:=0; reset(test); while not eof(test) do begin readln(test,x,y); {writeln(x:14,y:14,w:5,’ ’,’CK4 testpoint’);{CHECK4} dmin := 1e30; for i:=1 to ns do begin d:=sqrt(sqr(x-s1[i,1])+sqr(y-s1[i,2])); if (d < dmin) then dmin:=d; end; dt:=dt+dmin; end;
59
close(test); dt:=dt/t; {writeln(dt,’ ’,’CK6 mean distances’);{CHECK6} change:=false; if (dt < dtold) then change:=true; if (dt > dtold) then if ((dtold-dt) > c*ln(random)) then change:=true; if change then begin for j:=1 to 2 do s[i0,j]:=s1[i0,j]; dtold:=dt; end; c:=c*0.9; if (abs(dtold - dt) < eps) then ready:=true; if change then begin writeln(dt:20,’ ’,l:5,i0:5); {read(i);} end; if change then begin writeln(sap,dt:20,’ ’,l:5,i0:5); {read(i);} end; end; for i:=1 to ns do writeln(sam,s[i,1]:6:0,chr(44),s[i,2]:8:0); close(sap);close(sam); end.
60