Modeling Farmers’ Crop Choice Using Data Mining Approach: A Revisit Kamol Ngamsomsuke1,* and Benchaphun Ekasingh1 1
Department of Agricultural Economics, Faculty of Agriculture, Chiang Mai University, Thailand *
Corresponding author:
[email protected]
Abstract Ekasingh et al. (2005) introduced a data mining approach to study farmers’ crop choice in the watershed areas of Mae Rim, Mae Kuang and Mae Ping Part II, Thailand. The authors found a set of socioeconomic conditions determining the farmers’ crop choice decision. This study used the same approach to model farmers’ crop choice in other two watershed areas in Thailand i.e., Chan and San watersheds to test the applicability of the approach in different settings. The study yielded very similar results to the earlier study. Soil condition whether expressed as a more unique concept (i.e. land unit: LU) or a more general one (i.e. soil series), growing season, cash investment and gross margin were found to play important roles in farmers’ decision to grow a crop. In addition, this study also showed that land characteristics such as paddy, lowland, upland or slope area also determined farmers’ crop selection. The model’s predictability was 85-87 percent but lower than the previous study (96 percent). Nevertheless these findings confirmed that the methodology used was appropriate and the results were robust. The resulted farmers’ crop choice model was one of the key inputs for integrated water resource management and its decision support system. KEY WORDS: Data mining approach, crop choice, integrated water resource management
Introduction Land, water and forest resources in northern and northeastern Thailand are becoming scarce and depleted. Many forest areas have been converting into agricultural lands. Soil erosion and soil fertility became major problem among farmers. More seriously, water resources have been substantially reduced both in the uplands and lowlands. Conflicts between the uplanders and lowlanders for the use of water resources became severe in many areas of the regions. Yet declining water quality and sedimentation resulted partly from declining forest cover is also an important issue to policy makers and farmers. The key issue related to the above mentioned problems is how people use and manage their land. Proper land utilization and management is the most suitable solution to the problems. This prominent solution needs an integrative approach and tools. The integrated water resource assessment and management project (IWRAM project) came with these ideas and attempted to create an integrative decision support system (DSS) in such geographical areas. This involved the interaction of agronomic, hydrological, soil erosion and socio-economic models (Ekasingh et al., 2005; Jakeman et al., 2005). Besides studies on land use and its costs and benefits, a socio-economic model also dealt with farmers’ decision making on crop choice. In choosing an appropriate crop, water yield derived from a hydrological model is fed into a crop model to predict crop yields and suggested better farm practices. At the same time, based on the chosen crop and water yield, soil erosion is calculated in the erosion model. Crop yields, suggested proper practices are fed into the calculation of costs and benefits to derive financial returns to farmers. The water yield, erosion and financial returns are indicators of proper land and water management in the system. Figure 1 shows the concept of the project. Unacceptable indicators require the change of crop grown and better farm practices in this integrative DSS. By changing those mentioned requirements, the DSS would finally produce
281
Ngamsomsuke and Ekasingh: ASIMMOD2007, Chiang Mai, Thailand
Figure 1: Concept of the IWRAM project
geographical information system for water resources assessment and management which can be utilized by policy makers and farmers. During the second phase of IWRAM, Ekasingh et al. (2005) applied data mining technique to derive farmers’ crop choice decision tree in three watersheds of northern Thailand during the second phase. In its third phase of IWRAM, this paper again describes a model of farmers’ crop choice in other two watersheds, one in northern (Chan watershed) and another one in the northeastern Thailand (San Watershed). The same approach has been used in order to find out
similarities and differences. Such findings would be very useful for applying the decision support system for integrated water resource assessment and management (IWRAM-DSS) model in a wider scope.
Objective of the study This paper focuses on work undertaken by the socioeconomic team of IWRAM III in developing a model of farmers’ crop choice decision making processes in the two study catchments. It has an objective on modeling the farmers’ decision making on crop choice in the watersheds under study.
The study sites Two watersheds attached to the Mae Kong river basin were chosen for the study: Chan watershed in the north and San watershed in the northeast. The Chan watershed is in the Chiang Rai provinces of northern Thailand. It covers part of Mae Fahluang, Mae Chan and Chiang Saen districts of the province. Its western part in of Mae Fahluang district is undulated mountainous areas. The elevation of the areas ranges from 360-1,180 meters with the average of 770 meters above mean sea level (msl). Down to the east from Mae Chan to Chiang Saen sub-districts are characterized by upland and flat areas along Mae Chan, Mae Khum, Mae Khum Luang, and Mae Sa-long streams. These rivers joint and run into Mae Kong river in Chiang Saen district. Farmers in this watershed grow several types of cash crops. Rice, maize, soybean and tobacco are major crop in the lowland while maize, fruit trees (longan and lychee) are grown in the upland areas. Tea, coffee, maize and upland rice are common among highland farmers. On the other hand, San watershed is in Loei province of the northeastern Thailand. It covers part of Dan Sai and Phu Roer districts. Only small area of the watershed is in Thali and Muang districts. The whole watershed areas are generally characterized by undulated mountainous areas. There are some small flat areas locate between high slope mountainous areas. The important streams in the area include San, San Tom, San Noi, Huay Kaen Yanang, Huay Khao-man, Huay Cha-nang, Huay Pong, Huay Hmee, and Huay Ma-nao. These streams join at the east to run also into Mekong river via Hueang river. The elevation of the areas ranges from 320 to 1,365 meters above sea level (msl) with the average of 842.5 msl. There are only few crops available in the areas. Most of the people in this area grow maize. Some farmers also grow rice, ginger, cassava, lychee, tangerine and sweet tamarind. The approach on land classification used in this study focuses on soil series rather than land unit as adopted during Phase II. These two study areas have been classified into different soil series by the Department of Land Development. There are 11 soil series in Chan watershed namely, Alluvial (AL), Ban Chong (Bg), Chaing Rai (Cr), Chiang Saen (Ce), Hang Dong (Hd), Mae Sai (Ms), Nong Mot (Nm), Phimai (Pm), Tha Muang (Tm), Thad Panom (Tp) and Slope complex (SC). On the other, there are 8 soil series in San watershed. They are Dan Sai (Ds), Hang Domg (Hd), Khao Yai (Ky), Loei (Lo), Lat Ya (Ly), Ngao (No), Ta Yang (Ty) and Slope Complex (SC). This biophysical classification of land has been adopted among different researchers to simplify integration between the socioeconomic and biophysical components of the IWRAM III project. 282
Ngamsomsuke and Ekasingh: ASIMMOD2007, Chiang Mai, Thailand
Materials and methods Data collection The survey of households in these two watershed areas was conducted as part of the socioeconomic component of the IWRAM III project. The farmers’ survey was conducted by selecting farmer representatives in each soil series. The total of 507 households was included in this survey (5-8 households for each soil series). There are 340 and 167 households from Chan and San watersheds respectively. This survey was conducted in the year 2005. Global Positioning System (GPS) equipment together with detailed administrative maps was used to pinpoint the exact location of the soil series under survey. Since there are many crops grown in each study area, the survey covered only major crops of each watershed. Rice, upland rice, maize, soybean, pineapple, sweet corn, baby corn, tobacco, lychee, tangerine, tea, and coffee were selected for Chan watershed. While maize, rice, upland rice, job’s tears, ginger, lychee, tangerine and sweet tamarind were chosen for San watershed. In most cases, about five to twenty-three farmers in each selected soil series, to represent each major crop, were selected for interviews. However, due to a small number and disperse distribution of farmers growing ginger, baby corn, tobacco, pineapple, jobs tear and tangerine, the study team could interview only 1-3 farmers in each soil series. Questions asked included household characteristics, land tenure and utilization, crops grown in different season, problems of farming, plot characteristics, production cost and revenue, financial support, use and management of irrigation systems and environmental problems (Table 1).
Data mining The data collected represent comprehensive information of crop activities and household and plot characteristics suitable for classifying crop choice decision-making behavior of farmers in the study area. Data mining techniques were then applied to derive a set of decision rules, describing wet and dry season cropping decisions using household and plot attributes from these data set. Data mining procedures and techniques were well discussed for example in Ekasingh et al. (2005), and Whitten and Frank (1991). Readers interested in details of the techniques, please refer to those literatures. Discussions here focus on practical procedures as applied in the standard computer software. This study makes use of computer software name “Waikato Environment for Knowledge Analysis Version 3.5.3: WEKA 3.5.3” developed by Waikato University, New Zealand. Table 1: Survey Information Collected for the Study Part Requested information Part 1
General: household characteristics: farm and household size
Part 2
Land type, tenure and land utilization, crop year 2005, plot characteristics
Part 3
Production costs for annual crops and perennial crops including fertilizers, materials, machinery and labor use; Output, product sold and income for annual or perennial crops
Part 4
Income for other sources and capital availability
Part 5
Environmental problems
Part 6
Past use of land, competition of annual crops, farmers’ attitude
Part 7
Use and management of irrigation water
Part 8
Description of farmers’ crop choice decision making
Ekasingh et al. (2005) mention clearly that one expression of a decision tree is a graphical tree where each branch represents a decision rule and a leaf or node represents the selected choice. This decision tree could be built using data collected from the survey and put in an appropriate classification scheme (Whitten and Frank, 1991). According to Ekasingh et al. (2005), Whitten and Frank (1991), Buntine (1993), and Quinlan (1996), the most appropriate decision tree algorithms or classification scheme is the C4.5 as popularized by Quinlan (1993). This algorithm is available in the J48 classifier tree in the WEKA software package (Whitten and Frank, 1991 and 2005) and University 283
Ngamsomsuke and Ekasingh: ASIMMOD2007, Chiang Mai, Thailand of Waikato (2003)). Detail and intensive discussion of this methodology is available in Ekasingh et al. (2005). In order to perform data mining on the surveyed results, a data set must be prepared. This data set must consist of a class or decisive variable and other variables those considered to be descriptors of the decision choice. In this study, crop grown on the plot was the classifier variable. The variables considered from the survey by the data mining analysis as possible descriptors of crop choice in this study included soil series, planted area, land characteristics, etc (Table 2). In some cases these variables were grouped into discrete values to facilitate the analysis. A description of variables and the groups used is given in Table 2. To ease understanding, these variable names and their labels used in this table are consistent with the labels used for variables in the final decision trees. Wet and dry season crop choices were analyzed separately using the C4.5 algorithm.
Results and discussions There were different numbers of cases representing each crop in the original data set1 used in this study. In order to effectively perform the classification, the data set was re-sampled in a way to balance the number of crop. The training data set (the re-sampled one) was put WEKA version 3.5.3.
Table 2: Variables used in Data Mining Variable name Description SOIL Soil series as defined by the Department of Land Development (DLD) AREA_P Area planted LANCHAR This variable represents land characteristics which is defined as paddy field, lowland, upland and slope area TENURE This variable represents land tenure status for farmers WATUSE This variable indicates source of water availability and use. Note that land unit may be correlated with this variable Source of water use for crop cultivation FLOOD Whether the plot has flood problem DROUGHT Whether the plot has drought problem TVCACGR This is redefined from the actual cash cost of production (TVCAC). This variable indicates the level of cash investment farmers want to make in a particular crop
TCGR
1
Values use in the analysis Values as defined by DLD Numeric value in rai (6.25 rai = 1 ha.) Nominal Values: paddy field, lowland, upland, and slope area Nominal values: own, lease, and both own and lease Nominal values: irrigate, rain and both irrigation and rain
Nominal values: yes and no Nominal values: yes and no Nominal values: 2000 if TVCAC ≤ 2000 Baht; 4000 if TVCAC = 2001 - 4000 Baht; 6000 if TVCAC = 4,001 – 6,000 Baht; 8000 if TVCAC = 6,001 – 8,000 Baht; 10000 if TVCAC = 8,001 – 10,000 Baht; 15000 if TVCAC = 10,001 – 15,000 Baht; and 20000 if TVCAC > 15,000 Baht Nominal values: 2000 if TC ≤ 2000 This is redefined from the total cost of Baht; 4000 if TC = 2001 - 4000 Baht; production (TC). This variable indicates 6000 if TC = 4,001 – 6,000 Baht; 8000 the level of investment that farmers want if TC = 6,001 – 8,000 Baht; 10000 if to make in a particular crop TC = 8,001 – 10,000 Baht; 15000 if TC = 10,001 – 15,000 Baht; and 20000 if TC > 15,000 Baht
Originally, there were 794 cases in the data set. They were separated into 657 cases for wet season and 137 cases for dry season.
284
Ngamsomsuke and Ekasingh: ASIMMOD2007, Chiang Mai, Thailand Table 2: (cont’d) Variable Description name GRMGR This is calculated from gross margin level (GRM). Profit aspiration is divided into 8 groups. Certainly a farmer wants more profit rather than less, but usually more profit means more risk, skills and management. One can think of these as a variable indicating risk and skill levels. Level one of GRMGR is low risk, low returns and requires easy skills. Higher level of GRMGR being higher risk, return and higher level of skills requirement NPGR This is redefined from the net profit (NP). Certainly a farmer wants more profit rather than less, but usually more profit means more risk, skills and management. One can think of these as a variable indicating risk and skill levels. Level one of NPGR is low risk, low returns and requires easy skills. Higher level of NPGR being higher risk, return and higher level of skills requirement HHMEM The number of units of household member HHAL The number of units of household labor FARMSIZE The number of units of household total area LLRTIO This is farm size divided by the units of household labor. Low values indicate land scarcity in relation to labor. High values indicate relative land abundance in relation to labor LIVOTR Whether the farmer has income from livestock ONFTR Whether the farmer has income from offfarm employment OFFTR Whether the farmer has income from nonfarm employment CAPRAIGR This variable is calculated from the owned capital availability adjusted for farm size (CAPRAI) P_LOAN ALTCROP WATOMEM CROP
Thai variable expresses the proportion of external loan to farmer’s total investment on crop production This is an indication from farmers whether there is in their thinking availability of an alternative crop This variable indicates farmers’ membership status in a water users’ association. This is a classifying or decisive variable, It is crop name grown on the plot.
285
Values use in the analysis Nominal values: o if GRM < 0 Baht: 3000 if GRM = 0 – 3,000 Baht; 6000 if GRM = 3,001 – 6,000 Baht; 9000 if GRM = 6,001 – 9,000 Baht, 12000 if GRM = 9,001 – 12,000 Baht; and 15000 if GRM > 15,000 Baht Very category is expressed as per rai
Nominal values: 2000 if NP ≤ 2000 Baht; 4000 if NP = 2001 - 4000 Baht; 6000 if NP = 4,001 – 6,000 Baht; 8000 if NP = 6,001 – 8,000 Baht; 10000 if NP = 8,001 – 10,000 Baht; 15000 if NP = 10,001 – 15,000 Baht; and 20000 if NP > 15,000 Baht Numeric values (persons) Numeric values (persons) Numeric values (rai) Numeric values (rai/person)
Nominal values: yes and no Nominal values: yes and no Nominal values: yes and no Nominal values: 5000 if CAPRAI ≤ 5,000 Baht: 10000 if CAPRAI = 5,001 – 10,000 Baht; and 20000 if CAPRAIGR > 10,000 Numeric values (percent) Nominal values: yes, no and do not know Nominal values: yes and no Nominal values: glutinous rice, nonglutinous rice, upland rice, maize, sweet corn, baby corn, jobs tear, soybean, ginger, tobacco, pineapple, tea, coffee, lychee, orange and sweet tamarind
Ngamsomsuke and Ekasingh: ASIMMOD2007, Chiang Mai, Thailand The J48 classifier was chosen for the analysis. The binary splits, reduced error pruning, and sub-tree pruning were set to be “true” for all analysis. Optionally, “the minimum number of object to be classified into a particular leave” may be reset at any number depending on the purpose of reducing the tree size. The more the number of minimum number of object the size of tree will be reduced. In this analysis, the test option was set to the original data set in order to confidentially justify the predictive ability of the constructed model. Wet and dry season crop choices were analyzed separately using the above mentioned on the data set as described in Table 2. Using the minimum number of object at 2, these crop choice models were able to classify crop choice correctly in the original data set at 87% for the wet season and almost 85% for the dry season. However, the size of the wet season decision tree was very large, i.e., 112 leaves when compare to only 23 leaves for the dry season (the corresponding size of the tree would be predicted by 2*leaves -1). As we would like to avoid a very big and complicated decision tree, we therefore tried to improve them by increasing the number of the minimum number of object in order to remove the small leaves. After several attempts, the more appropriate numbers of the minimum number of object were 35 for wet season and 5 for dry season. These improved trees had the sizes of 29 and 21 leaves respectively. At the same time the correct classification rates were 53% and 75% for the wet and dry season respectively (again using the original data set as the test set). Even though the correct classification rate in the case of the wet season was lower, the decreasing size of the tree was far better than the reduction of the correct classification rate. We could not find a significant improvement for the dry season crop choice model. This is simply because the first try was already yielding a small tree. Yet the decreasing number of the minimum number of object reset for the analysis was so small (only 5 compared to 35 for the wet season). We therefore were satisfied with the result from the first try for dry season crop choice decision. Figure 2 and 3 represent the most appropriate farmers’ crop choice decision trees in this study. Note that the notation “!=” stands for not equal to.
GRMGR = 20000
GRMGR != 20000
TVCACGR != 20000
SOIL = Hd
SOIL != Hd TVCACGR != 2000
TVCACGR = 20000
Ginger Orange
TCGR != 10000
TCGR = 10000
SOIL != Tm
TVCACGR = 2000
Tobacco LANCHAR = upland
FARMSIZE 76.0
Glutinous rice LANCHAR = upland
Jobs tears
SOIL != Nm
LANCHAR != upland
Pineapple
SOIL != SC
SOIL = SC
Orange
TCGR != 2000
Sweet tamarind
TCGR = 2000 NPGR != 3000
TCGR = 2000
P_LOAN 50.0
Sweet corn
TCGR != 2000 DROUGHT = no
Coffee
LANCHAR != paddy LANCHAR = paddy
DROUGHT != no
WATERUSE = rain
LANCHAR = paddy
WATERUSE != rain
Coffee
Non-glutinous rice
LANCHAR != paddy
FARMSIZE 20.5
AREA_P > 14.5
WATERUSE = rain
WATERUSE != rain
Lynchee HHAL 2.0
WATERUSE = rain WATERUSE != rain
AREA_P 33.0
Non-glutinous rice
CAPPAIGR = 5000
GRMGR = 9000
FARMSIZE