CEFAGE Working Paper
2016/06
Disaggregating Statistical Data at Field Level: An Entropy Approach
António Xavier 1, Maria Belém C. Freitas 1, Maria do Socorro Rosário2, Rui Fragoso3 1 2
Universidade do Algarve e MeditBio
Direção de Serviços de Estatística, GPP (Gabinete de Planeamento e Políticas) 3 Universidade
de Évora e CEFAGE-UE
Universidade de Évora, Palácio do Vimioso, Largo Marquês de Marialva, 8, 7000- 809 Évora, Portugal Telf: +351 266 706 581 - E-mail:
[email protected] - Web: www.cefage.uevora.pt
DISAGGREGATING STATISTICAL DATA AT FIELD LEVEL: AN ENTROPY APPROACH
António Xavier Universidade do Algarve,Campus de Gambelas, 8000-117 Faro. MeditBio. E-mail:
[email protected] Maria de Belém Costa Freitas Professora Auxiliar c/Agregação, Universidade do Algarve, Faculdade de Ciências e Tecnologia. Campus de Gambelas, 8000-117 Faro. MeditBio. Email:
[email protected] Maria do Socorro Rosário Direção de Serviços de Estatística, GPP (Gabinete de Planeamento e Políticas)
[email protected] Rui Fragoso Professor Auxiliar c/Agregação Universidade de Évora e CEFAGE-UE (Center For Advanced Studies in Management and Economics) Email:
[email protected]
Abstract The objective of this paper is to present an entropy approach in order to disaggregate agricultural data (temporary and permanent crop’s areas) at a local level. It comprises several steps of application. In a first one, a HJ-Biplot methodology and a cluster analysis is implemented. In a second step an interactive procedure is developed to establish relations between the land use cartography and statistics, and then coefficients of data allocation for each variable are created. Finally, in a third step, in order to manage information inputs and guaranty consistency a minimum cross entropy process is implemented. The model was applied to the region of Algarve and showed satisfactory results since the estimated values revealed a good approximation to the true values. Key words: data disaggregation, minimum cross entropy, land uses, Algarve.
1. Introdução The European Union faces with very complex challenges on the beginning of a new programming period. Detailed and disaggregated analysis, considering all the territory and the spatial implications of the Common Agricultural Policy (CAP) are many times disregarded in policy evaluation and having the information to analyse it could be of great value (Xavier et al., 2011, Martins et al., 2011). The spatial and temporal dynamics of crop distribution at a local level can reveal the pattern of agricultural production over different periods and be relevant data on agricultural ecosystem patterns and functions, as well as regarding the impact of global changes on agricultural production (Foley et al., 2005; Portmann et al., 2010; Tan, 2014). To develop policies and agricultural economics analyses detailed data are needed. In Europe, agricultural data are supplied mainly by the Farm Accountancy Data Network (FADN), which covers all European Union. These data are collected in an annual survey carried out by the Member States and provided in an aggregated form at the level of regions, according to the Nomenclature of Territorial Units for Statistics II (NUTS II) (Chakir, 2009). Only the Agricultural Census provides more detailed information, but it is still limited. In Portugal, as in other countries, the National Agricultural Census is implemented every 10 years and between these years there isn’t any information available at municipality or parish level. However, exists geo-referenced information of several farms, which could be used to update the land use in all territory. The increased availability of geo-referenced data and more accessible geographic information system (GIS) are providing better opportunities for spatial analyses (Yout et al., 2009). Several impacts induced changes in agricultural systems and their analysis is based on artificial statistical units, which often include very heterogeneous areas (Xavier et al.., 2011). In Portugal, studies show that the changes in the Common Agricultural Policy (CAP) have lead recently to an extensification of the production systems in areas such the Alentejo region (Xavier and Freitas, 2014). This calls for disaggregated agricultural data with precise geographical references in order to have a correct evaluation of agricultural policies (Just, 2000 cited by Chakir, 2009). Note that, usually, within the administrative boundaries, the natural conditions differ and the assumption of identical cropping pattern, yields or input use cannot be maintained (Kempen et al., 2005). In order to tackle such problems data disaggregation processes are needed (Howitt and Reynaud, 2003, Fragoso et al. 2008, Martins et al., 2011). In the last years, there was an increase in the demand for tools to analyse the impact of policies and technological
innovations on agricultural sustainability (Louhichi et al., 2012). In order to overcome these difficulties several simulations and studies are being carried out by the authors in Portugal and at international level (You and Wood, 2006, 2009; Chakir, 2009; Kempen et al., 2005, Goch and Röeder, 2011). These studies use logistic regression, expert knowledge, homogeneous units, or the combination of different sources of geographical data to create an information prior. However, there are techniques that combined may be very important for creating a more informative information prior and hence for improving results. These include, less used techniques as the HJ-Biplot methodology (Galindo, 1986) and dasymetric mapping techniques (Martins et al., 2012) which allow relating statistical data and land cover classes. Thus, the objective of this work is to present a combined entropy approach in order to disaggregate agricultural data (temporary and permanent crop’s areas) at a local level, using dasymetric mapping (Martins et al., 2012) and the HJ-Biplot (Galindo, 1986) for obtaining a better information prior. This application includes important improvements in order to have a consistent data disaggregation process under a situation of incomplete information. The first one is the use of a cross entropy approach as alternative to the traditional regression analysis methods. The entropy methods allows to deal with likehood problems with negative degrees of freedom. This is a great advantage compared with the traditional methods of regression analysis for which is necessary to work with positive degrees of freedom and hence dispose a lot of data. Entropy also allows to include in the problem additional constraints, such as limits of land use in the past. For applying a cross entropy an information prior is needed. However, some times the needed information is lacking, but it is possible to build one if we can use several available sources of information (statistical, mapping and experts’ opinion). This is another improvement that the approach proposed in this paper offers when using HJ – Biplot analysis and dasymetric mapping. The remainder of the paper is organised as follows: section two describes the selected approach and its mathematical formulation; section three presents the empirical implementation of the methodology and the modifications made; in section four, results are presented and the discussion is made. Finally, section five presents the main conclusions.
2. The methodological approach The approach proposed was implemented considering the existing areas of temporary and permanent crops and combines several techniques: HJ-Biplot, cluster analysis, dasymetric mapping and minimization of the cross entropy. The HJ-Biplot is a multivariate analytical technique proposed by Gabriel (1971) in which any matrix of rank two can be displayed graphically as a biplot, where the vectors of each row and each column are chosen such that any element of the matrix is exactly its inner product. More recently, it was demonstrated that the HJ-Biplotis is able to produce better results than the previous classic biplot methods proposed by Gabriel (Galindo, 1986). This method achieves an optimal representation quality for both rows and columns, as they are represented on the same reference system (Galindo, 1986; Garcia-Talegonet al., 1999, Cabrera et al., 2006). The coordinates resulting of this methodology may also be used to implement a cluster analysis and defining homogeneous groups of units, which may be important for data disaggregation. Considering relative homogenous areas, one can implement iterative processes that allow defining the relations among the landcover and the statistical data. Galego and Peedel (2001) proposed a method such that for population analysis, which was transposed to the analysis of agricultural data by Martins et al. (2012) considering the following steps: 1) The exclusion of the target zones for which the statistical variable doesn’t exist (binary method); 2) The application of an iterative process of defining the most precise densities for distributing the data; 3) The stratification/ definition of sub-units with homogenous characteristics, if the results of previous step don’t are satisfactory. However, this information is neither consistent nor compatible at a unit level. An entropy approach (Xavier et al., 2011) could combine all the sources of information and this information prior previously built. The methodological approach combines cluster analysis, iterative methods and entropy to disaggregate agricultural data at a local level, through several steps (figure 1): 1) Collection of an extensive database of the territory including land use, partial disaggregated data, and biophysical data; 2) Application of a HJ-Biplot and cluster analysis to the main available land uses provided by the land use cartography. In this step there are created homogenous groups in each one of them and the data disaggregation process is developed individually. 3) The iterative process regarding dasymetric mapping proposed by Galego and Peedel (2001) is implemented and allows at an early phase to redistribute the data of the farms (main land uses, temporary crops and permanent crops) by the land use classes of the cartography.
It’s assumed that inside each land use classes the national agricultural census tend to have the same density, as they are in general homogeneous areas. 4) Improvement of the previous estimate using experts’ analysis depending of the available information. 5) Implementation of a cross entropy minimization process which allows the disaggregation of the data with respect of the previous estimate, the consistency with the aggregate and assuring that the biophysical and historical restrictions regarding the farms’ land uses are respected.
Land use maps S oil capacity Slopes Hipsometry Historical information Experts
HJ-Biplot Dasymetric mapping Land use maps
Iterative pro cess Galego and Peedel (2001 ) Martins et al . (201 2)
Aggre gatestatistical data
Cluster analysis
Land use data Redistributed data Experts’ knowledge
Historicalrestrictions
Previousestimate
Crossentropy Biophysicalrestrictions
Disaggegated data-pixel lev el
Figure 1- The methodological framework
2.1. The HJ-biplot and cluster analysis The biplot provides a tool of data analysis and it allows the visual representation of large data matrices (Gabriel, 1971). In detail, it is a graphical representation of a data matrix Xn×p using markers 1…., n a1…an for rows and markers b1 ;bj for columns, chosen in such a i T way that the inner product represents the initial matrix xij, x j ai b j . The initial matrix can
be written according the singular value decomposition:
X U nr V 'rp U V 1 AB' U V '
(1)
r
' is a matrix whose Where U nxr is a matrix whose columns contain the eigenvector of XX’; Vrxp
columns contain the eigenvector of X’X. In order to guaranty the unity of the representation a factorization is made. The HJ-Biplot is a symmetric simultaneous representation technique in which both markers (rows and columns) can be represented in the same reference system and the quality of representation is the same for rows and columns (Galindo Vilardónet al., 1996). A representation HJ-Biplot for a data matrix X containing the units is defined as a graphical representation by multivariate markers j1, j2... jn for lines and h1, h2,...hn for the columns of X, selected so that both markers may overlap in the same reference system with a high quality of representation. The lines are represented by dots and the columns by vectors (Galindo, 1986). Thus, the HJ-Biplot, is based on the singular value decomposition (SVD) of the data matrix, and any matrix may be factored as the product of three matrices such that (Castela and Villardón, 2010; Xavier and Freitas, 2014):
X ( nxp ) U ( n r ) ( r r )V '( r p )
with
U 'U V 'V I r
(2)
Where, U(nr ) is the matrix of eigenvectors of XX '; V( pr ) is the matrix of eigenvectors of X'X; and (rr) is a diagonal matrix of 1 2 3 ...r corresponding to the r eigenvalues of XX'or X'X. The elements of X (nxp) are given by:
r
X ij k u k v jk i 1,2...,n j 1,2..., p k 1
(3)
A cluster analysis may then be applied to the resulting HJ-Biplot coordinates. This analysis includes different algorithms and methods for grouping objects of similar
characteristics in categories. In fact, a general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures or to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group (STASOFT, 2013). The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, we determine the distances between those new cluster using linkage rules such as the single linkage or the Ward’s method.
2.2. The dasymetric mapping method According to the method proposed by Galego and Peedel (2001) and Martins et al. (2012) the disaggregation of the regional data is made in two steps. In the first step is used a “binary” dasymetric procedure, by excluding from the project the classes with zero densities. In a second step we apply an iterative process, in which it is assumed that the density of the classes is the same in all source zones (administrative units). And if the initial results are not satisfactory, a stratification of the data can be made. Thus, the value of a variable S for a region r is given by:
S r C jr Uj Wr j
(4)
The density for each land cover j in the region r is:
D rj Uj
Sr C jrUj j
(5)
Where C jr is the area of land cover j in region r; D rj is the density of the variable attributed to land cover j in region r; Wr is an adjustment factor to ensure that the total population in each region coincides with the known total. The land use estimated to each county or parish ke in region r for a certain variable is:
S kei C ij D rj j
(6)
For each region, a set of coefficients weighting population to the land cover category has to be calculated in an iterative way until the difference indicator becomes stable. The
difference indicator r for the region r was computed as the sum of the absolute values of the differences between the population attributed to each county and the known value:
r S kei S i
(7)
ir
The ratio between the attributed population and the known population is given by:
S kei i S i
(8)
In order to adjust the correlation weighting coefficients, a correlation P jr between the ratio of the population attributed to each administrative unit and the known population in that administrative unit ( i ) and the ratio between the area of land cover and the total area of the region ( c ki ) has to be calculated. This correlation was used to compute the new value of the Uj according to equation: j
U ' rj U 1
Pjr r 2 S r
(9)
Finally, the areas calculated S ei for each land use or variable k, are transformed in a set of probabilities or percentage values( x ki ) to be used in the next step: x ki
S ki
(10)
k
S
i k
k 1
2.3. The disaggregation process In this step, an entropy process is applied. Shannon (1948) introduced information entropy to measure the uncertainty of the expected information:
(11)
Where, pk is the probability of variable k. Jaynes (1957) proposed the maximum entropy principle in statistical inference and stated that the theory of information entropy provided a constructive criterion for the establishment of probability distributions, using partial or incomplete knowledge (You et al. 2009, Golan et al., 2006).
Good (1963) developed the minimum cross entropy which is a measure of discrepancy between the posterior probabilities p and their priors q, i.e. the distance between two distributions:
(12) The cross-entropy (CE) approach can be defined as a minimization problem where the cross entropy
is minimized subject to applicable constraints and the prior knowledge.
Therefore, maximizing entropy is in fact a special case of minimizing cross entropy with respect to a uniform distribution (You et al., 2009). Golan et al. (1996) for estimating parameters developed the generalized cross entropy (GCE), that take into account the unknown distribition and measurement of errors. The GCE approach has been used with success by several authors in agricultural data disaggregation, such as You and Wood (2006), You et al. (2009), Chakir (2009), Martins et al. (2011), Xavier et al. (2010, 2011) and Tan et al. (2014). The model proposed is based on a GCE (Chakir, 2009) with the inclusion of error to disaggregate the regional values and it guarantees data compatibility among the different sources of information: I
Min H( x ki / R ik )
K
K
∑∑
x ki . log(x ki / R ik )
i
k
N
∑∑e k
kn . log(ekn )
n
(13)
K
∑x
i k
1 and xki ∈0,1
k 1
(14)
I
K
N
k
kn
x .ST ∑ ∑e i k
i
i
kn
. log(ekn ) STATk i and k (15)
x ki .ST i LAND ik i and k
(16)
xki .ST i Bik i and k
(17)
HM ik x ki . HMX ik i and k
(18)
N
∑e
kn
n1
1 and ekn (t) ∈0,1
(19)
Where xki are the probabilities of each crop in area i; R ik are the probabilities of each crop, in area i resulting from prior information or previous estimates; ST i is the area weight of each
disaggregated unit; STAT k are the regional statistics for crop or occupation k; LANDik is the land use available for that occupation in disaggregated unit i which can be expressed as a share; Bik are the biophysical limits B; HMik are the minimum historical limits and HMX ik are the maximum historical limits for each disaggregated unit . Finally, ekn refers to an error term re-parameterized in terms of probabilities, such that ek n .ekn , where { 1 ,..., N } n
with N ≥ 2 points is the support vector associated to the probabilities {ek1,...,eKN} , according to the formalism of Golan et al. (1996). Equation (13) is the objective function that minimizes the composite cross entropy of the estimated probability distribution ( xki ) and the information prior ( R ik ), and the errors distribution ( ekn ). Equation (14) guarantees that xki has the characteristics of a probability distribution, this is, the sum of k for each unit must be one. Equation (15) states that the disaggregated shares xki , must be compatible with the aggregate at the regional level. Equations (16) and (17) ensure that the available land cover limits, reinterpreted from the land use cartography, and the biophysical limits are not overcome. Equation (18) relates to historical limits that must be respected. Finally, equation (19) regards that the error respects the properties of a probability distribution. After having calculated the shares, one simply needs to redistribute the regional data by using the following equation: Sˆki xki .SAi
In which
Sˆ ki
(20) is the estimated area for crop or land use k for unit i and SA is the area of
unit i. 3. Empirical implementation 3.1. The study area The study area is the Algarve region in southern Portugal with an area of 4.996,8 Km2 as it is presented in the following figure. It is composed by 16 municipalities and 84 parishes. The climate presents Mediterranean characteristics, although it reveals differences between the inner and coastal areas. The conditions for developing agriculture in the inner region are harsher, as the slopes are higher and the soils have less capacity.
The main land uses of the region are presented in the Figure 3, where we can conclude that more than 68% of the region’s area is occupied by forest and shrubs. Regarding the farms, Table 1 shows that the farm average area is low (about 7 hectares) and the agricultural land (UAL) is mostly concentrated in permanent crops, which are then followed by the pastures.
Figure 2-The Geographical localization of the Region of the Algarve
Table 1–Main land uses of farms-UAL Numbers Numb. Farms
88297 Areas (ha)
UAL
7,13 88297
Arable land
22327
Temp. Crops
7981
Utilized agr. Land(UAL)/farm
Fallow
14346
Kitchen garden
(source: COS 2007)
Figure 3 –The main land uses
628
Permanent crops
45007
Pastures
20335 (source: model results)
3.2. Model implementation For the application of the model proposed several sources of information were considered. The land use cartography comprises both the Corine Land Cover 2000 (CLC 2006) and the Land use cartography of the AFN, which has as original date 1995, and the more recent Land use Cartography of 2007 at a scale of 1: 25.000 (COS 2007). The soil capacity map is at the scale of 250.000, but it’s also available the IDRHA cartography of the soils’ capacity, which also presents the different types of soils. For the slope and hypsometry data is available from a slope map created based on the vector data, which was used as basis for the military relief 1.25.000 data (Xavier et al., 2011). The rest of data are from the Instituto do Ambiente and comprise a very low resolution scale. The main land use classes for implementing the HJ-Biplot method and the iterative process proposed by Galego and Peedell (2001) result from the simplification of the COS 2007 in a reduced number of classes, attending the detail of the available information and the objectives of the study: artificial areas (ARTA), forest and shrubs areas (FESHRB), heterogeneous agricultural areas (HETAGRAREAS), permanent crops (PC), permanent pastures (PP), temporary crops (TCROPS), water and other related areas (WRA) and areas with few vegetation (AREFVEG). For the transformation process in the HJ-Biplot method we considered the experts’ opinion to the final results and selected double centring for the analysis of the main landcover classes. The different homogenous groups of municipalities, were defined using the Biplot coordinates to apply a hierarchal cluster analysis method. The Euclidean distances were used as a dissimilarity index (Rajaramanetal., 2010) and for the linkage method we used Ward’s method. Ward's minimum variance criterion minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance is merged. To implement this method, at each step it is necessary to find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. The model was implemented to disaggregate several types/ groups of temporary and permanent crops. The groups of temporary crops considered were the following: Cereals (CER), Leguminoses (LEG), Temporary pastures and forages (PCF), horticultural crops and potatoes (HORTEBAT) and other temporary crops (OUT), which include all the other crops, industrial crops and ornamental flowers.
The groups of permanent crops resulted from a small grouping available in the official statistics: fresh fruits (FRTFRES), citrus (CITR), nuts fruits (FTRCRIJ), olive trees (OLIV), vineyards (VIN) and other permanent crops (OCP). These variables were disaggregated to the pixel level, totalizing 6832 disaggregated units, which were are then aggregated by pixel in the required situations. The entropy process guarantees that the data is disaggregated with the respect for the existent biophysical conditions and the general aggregate, which is known, providing the data compatibility between the several layers of information treated in a previous step of the model. These biophysical restrictions result from information on soil capacity, slope, hypsometry and climate data, which allowed the definition of suitability areas for each one of the crops or occupation considered. For the error definition, one followed previous studies, namely Chakir (2009), Martins et al. (2011) and Fragoso et al. (2008). So, it was used as reference the three sigma rule. Regarding the error limits, one chooses the limits which provided better results. Therefore, the error considered was the following one v={-0.5, 0, 0.5}for the disaggregation process at pixel level regarding Group 1 in the temporary and permanent crops while in all the others it was v={-1, 0, 1}. 3.3. Validation The model validation was made comparing its results with real statistical data using several deviation measures and the opinion of experts (Xavier et al., 2010, 2014; Martins et al., 2010, 2012; Fragoso et al. 2008). In addition, the validation procedures of You and Wood (2006) and You et al. (2009), who used the correlation and determination coefficient (R2), were followed. The statistical measures used to compare the estimated and real data are detailed now for better comprehension. The Prescription Absolute Deviation (PAD) measures the deviations between estimates and real statistics data of 2009 National Agricultural Census. The weighted absolute deviation (WPAD) was calculated considering the total deviation for the disaggregated parishes and the weight of land uses: PAD ik
i ˆi S ki - Sˆ ik i i S k - Sk WPAD P k k S ki S ki
k
and
WPADi ∑WPADki
(21)
k 1
And at aggregate level, according to the aggregation of the disaggregated units:
si WPAD .WPAD i i 1 S I
(22)
The WPADki expresses the percentage of deviation in each crop class regarding the PAD values weighted by its true importance at disaggregated level i. The WPADi corresponds to the sum of WPADki values by crop k, which allows giving the idea of the real total deviation for the values of the unit i. The WPAD corresponds to the weighted sum of the WPADi .
The Pearson correlation coefficient was another validation measure used R S ki , Sˆ ik , which determines how the two variables are related to each other. Based on these correlation coefficients R2 was calculated and it represents the the "strength" or "magnitude" of the relationship (STASOFT, 2013). The Modeling efficiency (EF) indicator is a normalized measure of evaluating model performance proposed by You et al. (2009) and it’s presented as follows:
S EF 1 S
i k
Sˆ ki
i k
S ki
2 2
(23)
where S ki is the observed value and Sˆki the model result, S ki is the average of the S ki values. 4. Results and discussion
4.1. The HJ-biplot The HJ-Biplot analysis and of the resulting cluster analysis is presented as follows. In the HJ-Biplot two axis were retained with 84,246% of the accumulated inertia. The relative contributions of the factor to the element are presented in the following table. The analysis shows that the first axis is highly correlated with the artificial areas (ARTA), forest and shrubs areas (FESHRB) and heterogeneous agricultural areas (HETAGRAREAS), meaning that it represents the forest systems and the agro-forest systems. The second axis is correlated with permanent crops (PC) and water and other related areas (WRA), representing the productive permanent crops oriented systems. The HJ-Biplot graphical representation and the resulting groups of parishes are presented as follows (figure 5). It seems to be a correlation between the permanent crops (PC) and the heterogeneous agricultural areas (HETAGRAREAS), but also between the artificial areas (ARTA), areas with few vegetation (AREFVEG) and water and other related areas (WRA). All these landcover classes are inversely correlated with the forest and shrubs areas (FESHRB).
Table 2 - The relative contributions of the factor to the element Column Axis 1 Axis 2 ARTA 15 395 FESHRB 0 996 HETAGRAREAS 95 209 PC 261 615 PP 63 2 TCROPS 1 1 WRA 253 578 AREFVEG 191 174 (source: model results)
Figure 5- The HJ-Biplot representation (source: model results)
In figure 5 it can also be identified four groups of parishes: Group 1- Parishes that tend to by highly oriented towards the permanent crops (PC) landcover class; Group 2- Parishes with diverse orientations, but mainly to heterogeneous agricultural areas (HETAGRAREAS), artificial areas (ARTA), areas with few vegetation (AREFVEG) and water and other related areas (WRA). These are parishes which lie near the littoral and have developed their infrastructures connected with the tourism activity, being converted many agricultural areas in artificial spaces. Group 3-Parishes highly oriented to water and other related areas (WRA). These are coastal parishes with urban uses and related to the fishery activities or tourism. Group 4- Parishes oriented to forest and shrubs areas (FESHRB). This group integrated all the municipalities of the inner region, where forest uses have most importance, including the parishes situated in the mountain area of Monchique and Caldeirão. Several of these parishes are in demographical decline and hence the agricultural activity is decreasing.
4.2. The dasymetric mapping method The iterative process of dasymetric mapping proposed by Galego and Peedel (2001) was implemented in order to generate similar densities for distributing the data in each group of municipalities, for the several classes of temporary and permanent crops considered. Validation was also made using the PAD for the different clusters and it’s presented considering the PAD median (table 3 and 4). Note that, for group 3 calculations were not made using the iterative method of Galego and Peedell (2001) followed by Martins et al. (2012), because of the limited area of this group and the fact that several crops’ values are concentrated in only one parish. Results show that in several cases the PAD median is unsatisfactory since the values are clearly superior to 50%. Moreover, these differences may reflect the need of further grouping by cluster analysis. However, the experts consulted highlighted this region is very heterogeneous in agricultural terms, and unlike other regions of Portugal, such as the Alentejo, data disaggregation and prediction of crops may be a very difficult task. In addition, this disaggregation process doesn’t consider historical restrictions of biophysical ones regarding to soil capacity and some expert knowledge, which may clearly improve the data. Table 3- The PAD’s median for the temporary crops Group Group 1 Group 2 Group 4
CER 67,5 129,3 47,3
HORTEBAT 50,1 51,9 100,0
OUT 59,7 46,9 3,6
PCF 79,2 325,8 230,8
LEG 93,0 33,0 50,9
CER-Cereals, LEG-Leguminoses, PCF-Temporary pastures and forages, HORTEBAT -horticultural crops and potatoes, OUT-other temporary crops (source: model results)
Table 4- The PAD’s median for the permanent crops Group Group 1 Group 2 Group 4
OCP FTRCRIJ VIN OLIV CITR FRTFRES 54,6 40,4 82,6 23,3 27,9 39,9 80,0 79,3 90,9 69,3 98,2 92,5 0,0 55,7 74,3 87,8 155,6 67,9
FRTFRES-Fresh fruits, CITR-citrus, FTRCRIJ -nuts fruits, OLIV-olive trees, VIN-vineyards, OCP- other permanent crops. (source: model results)
4.3. The disaggregation process For implementing the data disaggregation process two models (for temporary and permanent crops) were constructed for each group of municipalities. These were then unified in a regional map, allowing cartographical representation for each crop, as it is presented as
follows for temporary and permanent crops (Figure 6): temporary pastures and forages (PCF) and citrus (CITR). Results show that for temporary pastures and forages there is a considerable allocation in the inner municipalities such as Alcoitim and Vila do Bispo. In the case of citrus results show a more coastal distribution, where there is a considerable area in the municipality of Silves.
Shares per disaggregated unit
Areas per disaggregated unit
Pixelleveldisaggregation
Figure 6–The spatial distribution of the final results (source: model results)
Validation was made considering the correlation among the predicted and real shares at the parish level, but also, through the PAD, WPADi and WPAD indicators, which allow analysing the deviations from real shares. The synthesis of the PAD analysis is presented in the following table and shows, that in the case of temporary crops, the average presents mostly unsatisfactory values as it’s influenced by extreme values. However in most cases the median reveal results quite satisfactory in different crops: cereals (CER) with 12,0%, other temporary crops (OUT) with 22,1%, temporary crops and forages (PCF) with 36,1% and leguminoses (LEG) with 0%. The average and median of the WPADi are unsatisfactory as it is influenced by the high deviations in some crops. Regarding the permanent crops, the average results are unsatisfactory in all crops at least the other permanent crops (OCP). The median is satisfactory for nuts fruits (FTRCRIJ) with 32,0%, olive trees (OLIV) with 33,0% and other permanent crops (OCP) with 0,0%. The summary results for the WPADi reveal an improvement in permanent crops regarding the median and average. Table 5-The statistical deviation indicators (PAD) for the different crops-parishes Crops
Median
Average
Max
Min
12
125,9
1980
0
HORTEBAT
50,7
122,1
1040
0
OUT
22,1
35,7
200
0
PCF
36,1
125,2
6170
0
0
18,3
243,4
0
57,3
74,9
351,5
0
Temporary crops: CER
LEG WPADi Permanent crops: FTRCRIJ
32
81,5
1700
0
51,8
113,2
2420
0
OLIV
33
113,9
3900
0
CITR
76,4
86,6
620
0
FRTFRES
57,9
83,6
845
0
0
20
677,8
0
46,1
61,5
391,8
0
VIN
OCP WPADi
CER-Cereals, LEG - Leguminous plants, PCF-Temporary pastures and forages, HORTEBAT -horticultural crops and potatoes, OUT-other temporary crops, FRTFRES-Fresh fruits, CITRcitrus, FTRCRIJ -nuts fruits, OLIV-olive trees, VIN-vineyards, OCP- other permanent crops. (source: model results)
The results for the summary statistics of the PAD (Table 6) at municipality level show that in the case of temporary crops all but the horticultural crops and potatoes (HORTEBAT)
present very satisfactory results in the case of the median. The analysis of the average shows also several improvements regarding the previous validation at parish level. Regarding the permanent crops, the average value of PAD is satisfactory for nuts fruits (FTRCRIJ) with 21,0% and other permanent crops (OCP) with 9,9%. The median values present mostly very satisfactory results with the exception of fresh fruits (FRTFRES) where the value is 46,8% The resulting WPAD for the parishes is of 86% for the temporary crops and of 38% for the permanent crops. This means that at the parish level global results are not satisfactory, but in the case of permanent crops, the value is more acceptable. Regarding the municipalities, a WPAD of 39% for the temporary crops and of 20,65% for the permanent crops was obtained, which means that the validation at this level provided satisfactory and very good results for the permanent crops as the WPAD is low. This situation shows the limits of the prior information used at local level, namely land use maps. In fact, there is a minimum mapping unit of 1 hectare, but farms’ area is quite low and diverse in the Algarve region. Table 6 - The summary statistical deviation indicators (PAD) for the different cropsmunicipalities Crops
Median
Average
Max
Min
CER
47,9
119,4
660,0
0,0
HORTEBAT
63,9
94,1
421,5
0,5
OUT
32,1
35,3
94,0
0,0
PCF
22,4
413,9
6170,0
3,4
LEG WPADi
32,5
45,3
120,0
0,0
42,9
61,3
351,5
13,6
8,3
21,0
80,4
0,0
VIN
22,2
50,7
279,8
0,0
OLIV
31,3
97,9
982,5
0,0
CITR
38,5
44,3
142,9
0,0
FRTFRES
46,8
72,1
482,5
5,6
0,9
9,9
35,0
0,0
29,5
31,8
88,1
4,7
Temporary crops:
Permanent crops: FTRCRIJ
OCP WPADi
CER-Cereals, LEG - Leguminous plants, PCF-Temporary pastures and forages, HORTEBAT -horticultural crops and potatoes, OUT-other temporary crops, FRTFRES-Fresh fruits, CITRcitrus, FTRCRIJ -nuts fruits, OLIV-olive trees, VIN-vineyards, OCP- other permanent crops. (source: model results)
The correlation coefficients and R2, as well as the modelling efficiency indicator (EF) are presented in Table 7 and were calculated following the validation process of You and Wood (2006) and You et al. (2009).
The values of Pearson coefficient are always higher than 0.6, meaning that the results are quite satisfactory. Also R2 presents good values, being the lowest values in the Cereals (CER) - 0,476 - and in the horticultural crops and potatoes (HORTEBAT)-0,441. All the others values are higher than 0,5. Regarding permanent crops, the results using these indicators are also very satisfactory. All correlation coefficients show values close to one, existing in the case of citrus (CITR) and the other permanent crops (OCP) an almost perfect correlation. For the R2, all crops at least Vineyards (VIN) present good results. Table 7 - Person correlation coefficient, R2 and the modelling efficiency (EF) indicator Pearson
R2
Modelling Efficiency
CER
0.690
0.476
0.426
HORTEBAT
0.664
0.441
0.438
OUT
0.764
0.584
0.502
PCF
0.771
0.595
0.426
LEG
0.735
0.540
0.456
Permanent crops: FTRCRIJ
0.899
0.809
0.796
VIN
0.623
0.388
0.233
OLIV
0.851
0.725
0.665
CITR
0.950
0.903
0.848
FRTFRES
0.725
0.526
0.477
OCP
0.984
0.968
0.967
Crops Temporary crops:
CER-Cereals, LEG-Leguminous plants, PCF-Temporary pastures and forages, HORTEBAT horticultural crops and potatoes, OUT-other temporary crops, FRTFRES-Fresh fruits, CITRcitrus, FTRCRIJ -nuts fruits, OLIV-olive trees, VIN-vineyards, OCP- other permanent crops. (source: model results)
Finally, the EF indicator, which, for temporary crops presents values between 0.426 in temporary pastures and forages (PCF) and 0.502 in other temporary crops (OUT). Permanente crops exhibits better levels on this indicator, being its maximum value (0.967) obtained for other permanent crops (OPC). However, the most land uses under permanent crops present very good values for the EF indicator. These are the cases of olive trees (OLIV), nuts fruits (FTRCRIJ) and citrus (CITR), which the values are 0.665, 0.796 and 0.848, respectively. The worst values of the EF indicator in permanent crops is 0.223 and was obtained for vineyards (VIN).
We can consider that the model results are very acceptable in general. You and Wood (2006) and You et al. (2009), in a similar validation process made in Brazil obtained Pearson correlation coefficients between 0.4 and 0.65 and satisfactory P values. The same authors, in another study made in Africa, obtained R2 values between 0.4 and 0.8, which are not better than ours. For the EF indicator, the best performance reached by You et al. (2009) was 0.71, but only in a crop with a R2 around 0.8, being the values of remaining crops always lower.
5. Concluding remarks The proposed approach allowed disaggregating data at a local level with satisfactory results. The entropy approach is a useful methodology for achieving consistency and combining several sources of data, assuring that restrictions regarding the historical land uses of crops are respected. The restrictions considered at a historical level and at a biophysical one contributed to the improvement of the proposed methodology. The use of expert knowledge to build information prior was of great value to the methodology and the results. The approach proposed showed to be a good tool for solving the problem of lack of local data. In the context of rural development such a tool can be used for an evaluation of the CAP reform in Europe. Moreover, it shows that a combination of a cluster analysis and a dasymetric mapping procedure for creating the information prior can help to improve the process of data disaggregation. Knowing the relations among landcover data and the statistical ones, and assuming that this relation is maintained, we can replicate the model to other years for which landcover data and aggregated statistical data are available. In order to achieve more satisfactory results a new research is been carried and include better improvements on calculus of the agricultural areas. However, for developing the methodology, several research streams were defined. One first stream regards the development of highest posterior density methods (HPD) for handling different sources of information. This approach will allow testing an alternative to minimum cross entropy to deal with several sources of data in a more precise way. A second research stream is related to the application of the approach proposed to other areas. We believe that this methodology may be a very good way of providing disaggregated data for the Alentejo region, since its territory is more homogenous and hence the municipalities and parishes present more similarities among them.
References Bielecka, E. (2005). “Dasymetric population density map of Poland”. Proceedings of the 22 nd International Cartographic Conference. 9 to 15 July, Coruna, Spain. Castela, E. and M. PurificacionVillardón, (2010). Ecological inference for the characterization of electoral Turnout: the portuguese case. Spatial and organizational dynamics. Quantitative Methods Applied to Social Sciences. Discussion papers Nº 3, 6-25. Cabrera, J., M. Martínez, E.MateosandS.Tavera (2006). Study of the evolution of air pollution in Salamanca (Spain) along a five-year period (1994–1998) using HJ-Biplot simultaneous representation analysis. EnvironmentalModelling& Software. 21, 61–68. Carvalho, M.and M. Godinho (2011). A nova reforma da política agrícola comum e suas consequênciasnum sistema agrícola mediterrâneo de Portugal. OrganizaçõesRurais&Agroindustriais. 13(2), 165-175. Chakir, R. (2009). Spatial downscaling of agricultural land use data: an econometric approach using cross–entropy. Land Economics85(2): 238–251. Direção Geral do Território (2007). Carta de Uso e Ocupação do Solo de Portugal Continental para 2007. Accessed in 10-05-2014 at [http://www.dgterritorio.pt/cartografia_e_geodesia/ cartografia/cartografia_tematica/cos/cos__2007]. Foley J. A., DeFries R., Asner G. P. (2005). Global consequences of land use. Science, 309: 570–574. Fragoso, R., Martins, M.B., and Lucas, M.R. (2008). Generate disaggregated soil allocation data using a Minimum Cross Entropy Model. WSEAS Transaction on Environment and Development, 9(4): 756-766. Gabriel, K.R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Component Analysis. Biometrika 58: 453-467. Galindo, M. (1986). Una alternativa de representacionsimultanea: HJ-Biplot. Questio. 10(1), 13-23. Gallego. F.J. and PeedellS. (2001). “Using CORINE Land Cover to map population density. Towards Agri-environmental indicators”. Topic report 6/2001 European Environment Agency, Copenhagen, pp. 92-103. Gallego-Ayala, J. and J.Gómez-Limón (2011). Future scenarios and their implications for irrigated agriculture in the Spanish region of Castilla y León. New Medit. 1/2011, 4-16. Garcia-Talegon, J., M.Vicente, E. Molina-Ballesteros and S.Vicente-Tavera (1999). Determination of the origin and evolution of building stones as a function of their chemical composition using the inertia criterion based on an HJ-Biplot. Chemical Geology 153(1999): 37–51. Golan, A., Judge, G. and Miller, D. (1996). Maximum Entropy Econometrics: Robust Estimation with Limited Data. NewYork, USA: John Wiley & Sons. Good, I. (1963). Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. The Annals of Mathematical Statistics 34(3): 911– 934. Hidalgo, M. (2011). HJ –Biplot aumentado, Departamento de Estadística, Universidad de Salamanca. Howitt, R. and Reynaud. (2003). A. Spatial disaggregation of agricultural production data using maximum entropy, European Review of Agricultural Economics, 30 (3): 359– 387. INE-Instituto Nacional de Estatística 2001d. Recenseamento geral da agricultura de 1999. Lisbon, Portugal: INE. INE-Instituto Nacional de Estatística 2002e. Recenseamento geral da população -Censos 2001. Lisbon, Portugal: INE.
INE-Instituto Nacional de Estatística (2011). Recenseamento geral da agricultura de 2009. Lisbon, Portugal, INE. Kempen, M., Heckelei, T., Britz, W., Leip, A., Koeble, R., &Marchi, G. (2005). Computation of a European Agricultural Land Use Map–Statistical Approach and Validation. Discussion Paper. Institute for Food and Resource Economics, Bonn. Louhichi, K., Jacquet, F., Butault, J.P. (2012). Estimating input allocation from heterogeneous data sources: A comparison of alternative estimation approaches. Agricultural Economics Review 13(2): 83-102. Martín-Rodriguez J., M. Galindo-Villardon and J. Vicente-Villardon, 2002.Comparison and integration of subspaces from a bi-plotbiplotperspective.Journal of Statistical Planning and Inference.102 (2002), 411–423. Martins, M. B., Fragoso, R., Xavier, A. (2011). Spatial disaggregation of agricultural data in Castelo de Vide, Alentejo, Portugal: an approach based on maximum entropy. J.P. Jounal of Biostatistics5(1), 1-16. Martins, M.B., A.M. Xavier, R. Fragoso (2012), Redistributing agricultural data by a dasymetric mapping methodology. Agricultural and Resource Economics Review. 41/3: 351-366. Portmann, F T, Siebert S, Döll P (2010). MIRCA 2000-Global monthly irrigated and rainfed crop areas around the year 2000: A new high-resolution data set for agricultural and hydrological modeling. Global Biogeochemical Cycles, 24 (1011): 1–24, doi: 10.1029/2008 GB003435. Rajaraman, A., J. Leskovec and J. Ullman(2010). Mining of Massive Datasets, Stanford University Cambridge University Press . Roeder, N., &Gocht, A. (2011). Municipality disaggregation of German's agricultural sector model Raumis. In 122nd Seminar, February 17-18, 2011, Ancona, Italy (No. 99248). European Association of Agricultural Economists. Shannon, C. (1948). A mathematical theory of communication, Bell System Technology Journal, 27, 379–423. StatSoft, Inc. (2013). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB: http://www.statsoft.com/textbook/. Tan, J., Yang, P., Liu, Z., Wu, W., Zhang, L., Li, Z., ., Li, Z. (2014). Spatio-temporal dynamics of maize cropping system in Northeast China between 1980 and 2010 by using spatial production allocation model. Journal of Geographical Sciences 24(3): 397-410. Vicente-Villardón, J. (2013).MULTBI-PLOTBIPLOT program (Beta version). Salamanca: Statistic department, University of Salamanca. Vicente-Villardón, J. (W.D.) The BIPLOT methods [in Spanish].Accessed 15-06-2011 at [http://bi-plotbiplot.dep.usal.es/classicalbi-plotbiplot/documentation/notas-sobre-biplotbiplot-clasico-.pdf]. You, L., Wood, S. (2006). An entropy approach to spatial disaggregation of agricultural production. Agricultural Systems 90, 2006: 29–347. You, L., Wood, S. and Wood-Sichra, U.(2009). Generating plausible crop distribution maps for Sub-Saharan Africa using a spatially disaggregated data fusion and optimization approach. Agricultural Systems 99 (2-3): 126-140. Xavier, A., Costa Freitas, M. B. & Fragoso, R. (2014a) Disaggregation of Statistical Livestock Data Using the Entropy Approach. Advances in Operations Research, Volume 2014, Article ID 397675, 9 pagesXavier, A., Costa Freitas, M. B. (2014b) Recent dynamics and trends of Portuguese agriculture – a Biplot analysis. New Medit, 13 (4). in press.
Xavier, A., Martins, M. B., Fragoso, R. (2010). Combined disaggregation of agricultural land uses, livestock numbers and crops’ production: an entropy approach, in "Advances in Mathematical and Computational Methods", proceedings of the 12th WSEAS International Conference on Mathematical and computational methods in science and engineering, WSEAS, Faro, Portugal. Xavier, A. Martins, M. B., Fragoso. R. (2011). A mininum cross entropy model to generate disaggregated data at the local level. 122nd EAAE Seminar "Evidence-based agricultural and rural policy making: Methodological and empirical challenges of policy evaluation", Ancona, 17-18 February, 2011.