Generating thematic pedologic maps by using data mining and interpolations Elma Hot, Vesna Popović-Bugarin, member, IEEE, Ana Topalović, Mirko Knežević
Abstract—A problem of soil clustering and visualisation of the obtained results is analysed in the paper. K-means is adapted for the soil data clustering. Clustering is done based on chemical parameters of soil. Soil database of Montenegro, which contains values of physical and chemical parameters of soil, is used. Clusterized soil data has been presented on dynamic map and compared with existing pedologic map made by human experts. In addition, by using spatial interpolation distance weighting and results of K-means clustering, different types of thematic pedologic maps were made and integrated into WEB application. Index Terms—soil, data mining, k-means, visualization, pedologic maps, R programming language, spatial interpolation
I. INTRODUCTION Data mining is set of techniques for analyzing large data sets in order to extract new knowledge. Data mining techniques have been used for industrial, commercial and scientific purposes. Data mining techniques are: anomaly detection, association rule learning, clustering, classification, regression and summarization. Cluster analysis or clustering, as one of data mining techniques, aims to make groups (clusters) of similar objects of big databases. This grouping is done without using known structures in the data. It is widely used, including machine learning, pattern recognition, image analysis, bioinformatics, psychology, biology, data compression and information retrieval. In this paper, clustering method K-means is presented and applied for soil clustering. K-means clustering aims to make specified number of clusters from data samples based on distance (similarities of the observed parameters) between samples. This clustering method is ideal method for distribution soil samples into specified number of soil types. In our work, algorithm of this clustering method has Elma Hot, Faculty of Electrical Engineering Podgorica, University of Montenegro, Džordža Vašingtona bb, 81000 (e-mail:
[email protected]). Vesna Popović-Bugarin, Faculty of Electrical Engineering Podgorica, University of Montenegro, Džordža Vašingtona bb, 81000 Podgorica (email:
[email protected]). Ana Topalović, Biotechnical Faculty Podgorica, University of Montenegro, Mihaila Lalića 15, 81000 (e-mail:
[email protected]). Mirko Knežević, Biotechnical Faculty Podgorica, University of Montenegro, Mihaila Lalića 15, 81000 Podgorica (e-mail:
[email protected]).
been implemented in Java and adapted for soil clustering. The challenge has been to extract useful knowledge from big soil database. This research aims to apply data mining techniques to a soil science database and find proper way of visualisation of results. The outcome of the research may have many benefits in agriculture, especially in improving fertilizing, soil management and environmental protection. The application of data mining techniques has never been conducted for soil database of Montenegro. Almost thirty-years-old soil database of Montenegro was used for analysing with K-means, visualization of the obtained results and making pedologic maps. The database contains 22310 rows and more than 200 information about each row. It contains chemical and mechanical-physical parameters of soil horizons. Data samples regarding to surface horizons with coordinates are used for experiments. There are more than 6,000 such samples. In order to validate and to understand results of data mining K-means algorithm, the findings have been discussed with soil experts and have been compared to previous results of soil analysis. In addition, different types of soil maps were made, map of soil samples which are used of experiments, thematic pedologic map which presents values of chemical parameters of soil made by Kmeans and spatial interpolation, etc. The paper is organized as follows. In Section II soil database in Montenegro is presented; in Section III data mining and K-means are reviewed, while experimental results and maps are analysed in Section IV. II. SOIL DATABASE IN MONTENEGRO The systematic investigation of about 2000 soil profiles, their description and soil classification were performed during the period of 1958-1988. The main result of this long process is soil map in scale of 1:50 000. Unfortunately, the data and data maps are available only in hard copy version. The enormous effort and the work were made for digitalization of database. After connecting all data and digitalization, extensive checks of the entered data were made. Today, Montenegro has useful multi-purpose soil database and it is available for the wide professional community and land users. Digitalization of soil database is result of researchers of BIO-ICT Centre of Excellence. Database contains information about chemical and mechanical-physical characteristics of soil. For visualization, only useful data samples are samples with the coordinates are used. This database contains more than 6,000 such samples. Coordinates of data samples in the
database are the coordinates in meters in coordinating system (CRC) MGI 1901 / Balkans zone 6th. One step before visualization in maps was to convert coordinates to longitude and latitude. Topsoil horizons, horizons at the surface with minimum depth equal zero, were extracted from database and used for visualization. Analysed database contains 3493 such data samples. In this paper emphasis is given to chemical parameters of soil which is contained in more than 95% data samples from database. The basic clustering of topsoil horizons is done based on: active acidity – pH(H2O), potential acidity – pH(KCl), total carbonates – CaCO3, humus – major soil organic matter, available phosphorus – P2O5 and available potassium – K2O. A. Soil analysis Soil is a valuable non-renewable resource with broad diversity, which provides essential support to ecosystems, human life and society. Therefore, it is imperative to maintain soil functions and qualities to sustain the ecosystem and the human being. Soil analysis provides information about physical and chemical characteristics, which are important for the quality of soil suitability assessment and the reliability of soil use. For purpose of plant production, determination of level of soil acidity, total and active carbonates, organic matter, fraction of available forms of nutrients and soil texture enables creating the conditions for optimum crop nutrition by rational application of fertilizer along with environmental protection. Connecting ICT and soil analysis allows creation of a better integrated soil management system for both the private and public sectors. III. DATA MINING Data mining is set of techniques that aims to discover knowledge from big data and allows focus on the most important information in data. Data mining is the computerassisted process, which “digs” into databases to find hidden patterns, finding predictive information that experts may miss [1]. Nowadays, huge amount of data is constantly collected. Thus, there is need for finding useful knowledge in such databases. That's why data mining has attracted a great attention of scientific society. Data mining can be considered as evolution of information technology. Knowledge discovery is complex process consists of several steps (Fig. 1.): 1. Data cleaning; 2. Data integration; 3. Data selection; 4. Data transformation; 5. Data mining; 6. Knowledge presentation. The steps performed before data mining is applied present process of preparing databases for data mining.
Fig. 1. Steps in processes of knowledge discovery
Data cleaning removes data samples containing noise and those with missing data. The second step is process of combination multiple data sources. Data selection and transformation allow focus on relevant data and transform into forms appropriate for mining [2]. Data mining include: Anomaly detection; Association rule learning; Clustering; Classification; Regression; Summarization [3]. Anomaly detection is identification of unusual data records. It can be used for early detection of plant diseases in agriculture, beekeeping etc. Association rule learning aims to find relationships between variables in data records. This technique has application in market basket analysis, supermarkets investigate customer purchasing habits. Clustering is the task for discovering groups of similar data samples in data without using known structures in the data. In this paper, results of soil clustering are presented. Classification aims to classify new data based on known structure. Regression is used for making predictions based on available data records. For example, prediction can be used fos sales predictions. It can be also used to predict the value of a house based on its location, number of rooms, lot size, etc. Summarization involves techniques for finding a compact description of a dataset. It can be used for visualization and report generation. A. K-means clustering K-means (KM) clustering, also known as "hard" or crisp clustering is widely used partitioning method. Clustering is used for discovering groups of similar data samples in data base, but without using known structures in the data. Thus, this method aims to partition data into given number of mutually exclusive clusters. Mutually exclusive clusters mean that data sample can be member of exactly one cluster. KM aims to make K clusters of n data samples, defined with d parameters. Each cluster K is defined with one centroid, its central point. Centroid is determined by a certain combination of parameters contained in each data sample and recalculated in each iteration. Initial centroids c11 , c12 ,..., c1K are randomly selected from the data samples,
ci R d . For better illustration we may consider these data samples as points in d-dimensional space. Then, we may say that ”distance” (similarity) among data samples is
calculated comparing each d parameters of each data sample to each centroid’s parameters. Next step is finding the nearest centroid to each data sample, and declaring data sample as a member of that nearest cluster, ci ,1 i K [1]. Now we can say that the KM is a method of vector quantization, since partitioning is done by calculating the squared Euclidean distance between data samples. New centroids for next iteration are estimated as mean of all data samples which are member of corresponding cluster. This procedure is repeated until convergence has been reached or for a specified number of iterations. Converges is reached when | ckt ckt 1 | | ckt | , where t is number of iteration and is sensitivity threshold. For clustering presented in this paper sensitivity threshold is 0.01. Input data for KM is database for clustering and number of clusters. Before clustering, maximum number of iteration and sensitivity threshold should be defined. Example of basic KM clustering based on two parameters is illustrated on Fig. 2. Each color presents one cluster, while i is number of iterations. KM algorithm is adapted for soil clustering and implemented in Java. Results of KM clustering are used for creating soil map of Montenegro.
with a weighted average of the values available at the known points. If the known values are v[1],...,v[n] at locations x[1],...,x[n] respectively, then the estimated value at a location u is inversely proportional to distance d(u,x[i]), where d(u,x[i]) is the Euclidean distance from u to x[i]. One of analysed chemical parameters of soil is available phosphorus expressed as P2O5. It is fraction of phosphorus readily available for plant uptake. The map of its value is shown in Fig. 4. Colors presents values of P2O5, shown by color bar. Minimum value of this parameter in database is zero and maximum is 100 (concentration unit mg/100g). In map minimum value presents blue and maximum color presents red. Average value of P2O5 in database is 4.33, that’s why blue color is dominant in map of P2O5. Humus is specific soil organic matter, which significantly influences the bulk density of soil and contributes to moisture and nutrient retention. The map of value of humus is shown in Fig. 5. Maximum value of this parameter in database is 43% and average value is 7.7%. Red color in map presents maximum and blue color presents minimum of humus. Visualisation can help to find extremums in database and detect possibly wrong values. In map of humus appear a few extremums but number of such values is insignificant.
Fig. 2. Example of KM clustering, clustering based on two parameters, each color presents one cluster, i-number of iterations
IV. EXPERIMENTAL RESULTS Visualization of data from soil database is implemented in R programming language. Leaflet, Google Maps and Google Earth are also used. Maps of values of different chemical parameters, like pH(H2O), pH(KCl), CaCO3, humus, P2O5, K2O were made. In this paper, maps of value P2O5, humus, pH(H2O) and pH(KCl) are presented. Fig. 3. presents map of soil profiles contained in analysed soil database. It can be seen that the whole territory of Montenegro is not covered with this soil database. However, the database contains information of soil samples from largest area of the territory of Montenegro. Before plotting maps, spatial interpolation on existing data samples was executed to fill most of the territory of Montenegro. Interpolation is a method of estimating new data points within the range of a discrete set of known data points. Spatial interpolation was provided using inverse distance weighting. This kind of interpolation is realized as function idw in package spatstat of R programming language. Unknown points get values, which are calculated
Fig. 3. Map of soil samples from soil database from Montenegro
Fig. 4. Map of Montenegro, colors presents values of P2O5, colors presents values of P2O5.
Fig. 5. Map of Montenegro, colors presents values of humus
Visualization of spatial objects in Google Earth is done by plotKML package. PlotKML package is R programming package for plotting points, lines, polygons and gridded objects on dynamic map. This package allows to plot data from databases in map; also allows adjusting color and size of markers in map.
profiles are important for soil analyses, and it is useful to add that photos with others soil data on Google Earth maps. In Fig. 6. and Fig. 7. shown bubbles present value of P2O5 of data samples. Size and color of each bubble is function of values of P2O5. Beside every bubble value of this parameter is written. This way of visualisation is simple and suitable for presenting different kind of data, such as data from sensors and sensor networks. Leaflet is open-source JavaScript libraries for interactive maps. This package of R programming language allows easy integrate and control Open Street maps in R programming language. Created maps right from the RStudio have interactive panning/zooming, map tiles, markers, polygons, lines, popups. Thus, spatial objects and data frames with longitude and latitude can be embedded on dynamic map. The goal of this research is to make public and easy for understanding all results and all useful data from existing soil database. Till now WEB application with maps of location of data samples, maps of values of chemical parameters of soil samples and maps which presents results of KM soil clustering is implemented. Hence, user can get lots of different soil data through dynamic map, choose type of map via radio buttons and legends for all maps are available (Fig. 8., Fig. 9., Fig. 10.). Zooming in map allows user to get chosen soil data from specific place. Fig. 8. presents map of value of pH(KCl). Color of markers in map depends on value of pH(KCl). Red color present minimum value of pH(KCl), blue color presents minimum of pH(KCl). Samples which do not have this parameter are signed as NA and that markers are gray. All maps on this WEB application has pop ups; an easy was to see exact value of chosen parameter of that point.
Fig. 6. Google Earth – Map of Montenegro, color and size of bubbles present value of P2O5 of topsoil horizons
Fig. 8. WEB application with dynamic soil maps, Open Street map, color of markers depend on value of pH(KCl) of soil samples Fig. 7. Google Earth – Map of Boka Bay, color and size of bubbles present value of P2O5 of topsoil horizons
This package can be used to visualize field photographs, for example trees, land cover and similar. Photos of soil
Second type of maps on this WEB application is raster maps. After interpolation, raster maps are made for value of pH(H2O) and for results of KM clustering. An interpolation
and raster map allows to make maps of greater coverage. Fig. 9. presents map of value of pH(H2O) of soil samples.
Fig. 10. Example of KM clusters, clustering base on two parameters a) K=3 b) K=4
Fig. 9. WEB application with dynamic soil maps, Open Street map, raster map, color depends on value of pH(H2O) of soil samples
First were made illustration and validation of KM clustering results by visualisation of results of clustering based on two and three parameters (Fig. 10.). Second type of validation of KM clustering results were made by visualization of results in maps, and comparing that soil map with soil map made by human experts.
Results of smart algorithms for clustering, which discovers groups of similar data samples, were analysed and presented in maps also. Fig. 11. presents basic soil map of Montenegro made using only results of the KM clustering. Different colors in map presents different types of soil, which are estimated using the KM clustering. Colors changes form blue to red, depending of value of KM results. As mentioned above, some parts of map are empty so far, like north-western and the southern part of Montenegro, also area around the capital, because of nonexisting data samples for these areas in soil database. Soil database of Montenegro is still developing and digital soil maps like this on Fig. 11. will be improved in future.
Fig. 11. WEB application with dynamic soil maps, Open Street map, raster map, color of markers depend on value of soil type (KM results)
filtering of data, using data mining techniques for extracting knowledge and through visualization of obtained results, data collected for 30 years will be properly presented. ACKNOWLEDGMENT This work has been supported by the Ministry of Science of Montenegro and the HERIC project through the BIO-ICT Centre of Excellence (Contract No. 011001). Authors are grateful to Dr Budimir Fustic, senior soil expert, for help in searching of relevant data and its understanding. REFERENCES [1]
[2]
[3] Fig. 12. Pedologic map of Montenegro [4]
Experts’ soil map of Montenegro is used for comparison with soil map made using only results of the KM clustering (Fig. 12) [4]. On both maps two soil types are dominant. After overlapping of both pedologic maps it is found out that significant similarity is achieved. Although KM clustering is basic soil clustering because it is done based only of six chemical parameters, two dominant types are on same areas on both maps. Future work is to improve KM clustering and use more soil parameters, which will result in more accurate pedologic map.
[5]
[6]
[7]
[8]
[9]
V. CONCLUSION Soil database of Montenegro, which contains almost 30 000 data samples is analysed. R programming language is used for visualization. Leaflet, Google Maps and Google Earth are also used for this purpose. The goal of this paper is to make detail pedologic map of Montenegro using data mining techniques for clustering and spatial interpolation, thus visually presenting important information in the database. Results of KM clustering based on six chemical parameters are presented. Our future work will be dedicated to improving soil clustering using mechanical-physical characteristic of soil and to make detailed soil map of Montenegro. Goal is also to improve WEB application and publish all made maps. In this manner, through process of digitalization and
[10] [11]
[12] [13]
[14]
[15]
E. Hot, V. Popović-Bugarin, “Soil data clustering by using Kmeans and fuzzy K-means algorithm,” 23rd Telecommunications Forum TELFOR 2015, Belgrade, November 2016. D. Rajesh, “Application of Spatial Data Mining for Agriculture,” International Journal of Computer Applications, (0975 – 8887) Volume 15– No.2, February 2011 U. Fayyad, G. Piatetsky-Shapiro, P. Smyth , “From Data Mining to Knowledge Discovery in Databases,” AI Magazine, Vol 17, No 3, 1996 B. Fustic, G. Djuretic, “The Soils of Montenegro,” University of Montenegro and Biotechnical Institute, Podgorica, Montenegro, 2000. S. Ghosh, S. K. Dubey, “Comparative Analysis of K-Means and Fuzzy C-Means Algorithms,” (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 4, no.4, 2013. J. L. Armstrong, D. Diepeveen, R. Maddern, “The application of data mining techniques to characterize agricultural soil profiles,” Sixth Australasian Data Mining Conference, Gold Coast, Australia, 2007 G. Nasrin Fathima, R.Geetha, “Agriculture Crop Pattern Using Data Mining Techniques,” International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014 H. Patel, D. Patel, “A Brief survey of Data Mining Techniques Applied to Agricultural Data,” International Journal of Computer Applications, Volume 95– No. 9, June 2014 J. Solanki, Y. Mulge, “Different Techniques Used in Data Mining in Agriculture,” International Journal of Advanced Research in Computer Science and Software Engineering, Volume 5, Issue 5, May 2015 Andrew Ng, “CS229 Lecture notes”, Machine Learning Course Materials J. Balkovič, Z. Rampasekova, V. Hutar, J. Sobocka and R. Skalsky, “Digital Soil Mapping from Conventional Field Soil Observations”, Soil & Water Res., 8, 2013 (1): 13–25 S. Har-Peled, B. Sadri, “How Fast is the k-means Method?*”, January 2, 2010 L. G. Vendrusculo, A. L. Kaleita, “Terrain Analysis And Data Mining Techniques Applied To Location Of Classic Gully In A Watershed,” 2013 ASABE Annual International Meeting “Vector Quantization and Clustering,” Courses of Electrical Engineering and Computer Science, Massachusetts Institute of Technology L. Rokach, O. Maimon, “Clustering Methods,” The Data Mining and Knowledge Discovery Handbook, pages 321–352. 2005.