2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'15)
A Distributed Inverse Distance Weighted Interpolation Algorithm Based on the Cloud Computing Platform of Hadoop and Its Implementation Zhong Xu, Jihong Guan
Jiaogen Zhou*
School of Electronics and Information Tongji University Shanghai City, China
Institute of Subtropical Agriculture The Chinese Academy of Science Changsha City, Hunan Province, China
Abstract—A centralized inverse distance weighted interpolation (IDW) method is simple and widely used, but it is difficult to meet the requirements of mass data processing. The cloud computing technology of Hadoop has the advantages of simple application portability, high system reliability and node dynamic load balancing. The extension of the centralized IDW to the distributed version based on Hadoop is one of the effective ways to deal with massive data processing requirements. This paper presented a distributed algorithm IDW under the MapReduce framework of the Hadoop technology. The core ideas of the algorithm are: (1) the data set to be interpolated is divided into multiple sub-data sets, and each of Map tasks run the serial IDW interpolation algorithm to interpolation a subset of the data set; (2) the Reduce task merges the interpolation results by all map tasks, and outputs the final result. Experimental results shown that the distributed IDW algorithm had good acceleration performance for large-scale data sets, and significantly improve the computational efficiency of spatial interpolation. Keywords-component; interpolation, IDW
Hadoop,
Cloud
computer,
spatial
I.
INTRODUCTION Geospatial variable is a quantitative or descriptive measure of geographic feature. The production of accurate and precise maps for geographical variables can provide an effective data support for spatial planning, decision-making, evaluation of land degradation assessment study [1]. The geography spatial variable information can be obtained by sampling analysis or remote sensing technology in practice, but sampling quantity is limited because of manpower. Remote sensing technology is an effective means for obtaining geographical spatial variable information. However, the obtaining of some specific geographical elements of information is not accessible or its accuracy is not enough high. To some extent, prediction of specific geographical elements in medium and large scales is still the problem of small sample statistical inference. Many methods have been produced to statistically, infer geographical spatial variables. Spatial interpolation methods are often used in geosciences, and spatial autocorrelation theory is the core theory basis of spatial interpolation method [2]. This theory resulted in many interpolations, such as inverse
*Corresponding author:
[email protected]
978-1-4673-7681-5 ©2015 IEEE
2460
distance weighted (IDW)[3], polynomial interpolation, ordinary Kriging[4], Co-Kriging[5], Simple Kriging, geographically weighted regression [6]. Compared with other interpolations, inverse distance weighted method is simple in principle, easy to be implemented. It has been widely used in the field of geology, soil, geophysics, oceanography, meteorology and ecological and Environmental Studies [7-9]. With the demand for the increasing of study areas and the finer resolution in the application of geographic environment variable mapping, a centralized interpolation method gradually meet a computation bottleneck. In order to solve this problem, the main method is the use of distributed computing on multiple computers at the same time to perform a calculation, reduce the running time of the program[10-13]. The Hadoop platform is a new distributed computing model of MapReduce Java implementation. Like the traditional distributed computing technology, Hadoop platform automatically distributes the program to a computer cluster composed of a plurality of compute nodes and executes it parallelly. The purposes of this study is: (1) extend a centralized IDW to its distributed version based on Hadoop platform to, deal with massive data processing needs; (2) evaluate the influence of data scale on the distributed IDW algorithm. II.
METHODS
A.
Introduction of Hadoop Hadoop is a distributed system infrastructure developed by Apache foundation. It is used to manage a computer cluster for large-scale distributed computing across the cluster nodes. The core of the design is a java implementation of Map/Reduce and the Google GFS. The HDFS provides storage for the mass data, Map/Reduce provides the calculation for the mass data [14]. Map/Reduce is a serialize programming paradigm which express the large-scale distributed computing as distributed operations on a Key/value pair data set. Map/Reduce calculation is divided into Map and Reduce two stages, the input data is a key/value pair data set. In the stage of Map, input data set is split into segments according to the user specified form by the framework, and each segment is assigned
to a Map task, each Map task is performed by a node in the cluster. Map task calls mapping function defined by users and transforms each input key/value pair(K,V) into a different key / value pair(K', V'). After the Map phase, framework sorts the intermediate Key/value pair data sets (K', V'), so that all the values related to each specific key appear together and a set of tuples (K', V'*) is generated. Then it will decompose the tuple, the number of fragments equals that of Reduce tasks. For the Map task in the stage of Reduce, the framework will assign each (K', V'*) tuple fragments to a Reduce task to execute, each Reduce task is performed by a node in the cluster. Each Reduce task reads tuple fragments assigned to it (K', V'*), and calls the user-defined Reduce function, transforms them into one output (K, V) key/value pairs. Tasks in each stage of Map/Reduce are executed in a faulttolerant way, if some nodes fail when they are running tasks, these tasks will be redistributed to the remaining nodes. A large number of Map tasks and Reduce tasks contribute to load balance, and the failed task can be re executed with less overhead of runtime[16]. B.
IDW algorithm IDW is an earlier proposed spatial interpolation algorithm which is relatively simple and most widely used[3]. The core idea of IDW algorithm is that: 1) the unknown valuation is more susceptible to the nearest observation point value; 2) The weight of the contribution of observation point to the unknown point is inversely proportional to the distance between two points. IDW algorithm used the k nearest observation point to estimate the unknown points. The estimate value is a weighted sum of k observations. The calculation formulas is as below k
∑ i
zˆ( x 0 ) =
λi =
=1
λ i z( x i )
(1)
k
1 / d ip0
i =1
In the equation, z(xi) is the observed xi values,
The section A shows that the MapReduce framework will split the computing task into a number of map tasks and automatically assign to each node in the stage of map. In the stage of reduce, it will assign the outcome of the map tasks into the reduce tasks, and the reduce tasks will output the results finally. Intuitively, the procedure of the implementation of distributed IDW algorithm on the MapReduce framework is: each valuation point data is assigned to a map task, and each map task performs a serial IDW algorithm, then in the reduce phase, the result of each map task is merged as the final output. However, the above method has obvious defects: no matter how many valuation points each map task will calculate, it has to read all the training sample set in the executing process, and transfer the result to the node which is running reduce task respectively. But the time used for disk read/ write and network transmission is considerable[17]. Moreover, start and end the map task is also time-consuming for additional system overhead. If all the valuation points are computed by one map task, then the overhead of the disk read /write, network transmission could be minimum theoretically. But this has degenerated into a serial IDW algorithm, because all the valuation points are calculated by only one node, and the advantage of parallel computing in the computer cluster is unable to be played. In summary, the specific steps of the implementation of distributed IDW algorithm on the MapReduce framework are as follows:
1 / d ip0
∑
C. Distributed IDW algorithm based on MapReduce framework In IDW algorithm, the estimation of valuation point is only related to the training sets, but independent of the valuation point set. Therefore estimation task can be split into numerous sub tasks, each task works similarly to the serial IDW algorithm, that is to first find the k nearest neighbors for a valuation point in the training set, and estimate the value according to (1) and (2).
λi
(2) is the
p
weight of contribution of observation points of xi, d i 0 is the distance between the observation point xi and the unknown point x0, p is a power index (generally 2). The implementation steps of serial IDW algorithm are as follows: 1) Set the input parameter values for K and P to create the training sample set and the valuation set respectively. 2) Perform KNN searches in the training set for every point in the valuation set, then calculate and output estimates according to (1) and (2). 3) Repeat the process 2) until iterate through all the valuation point.
1)
Set the parameter K, P and linespermap(LPM).
2) Set the file contains the training sample set as a distributed buffer file( it will be distributed to each node). 3) The valuation set is divided into sub sets, each contains linespermap lines and is submitted to a Map task to be processed. 4) Like the setp 2 of serial IDW, each map task perform KNN searches in the training set for every point in the sub set, then calculate and output estimates according to (1) and (2). This process is repeated until iterate through the whole subset. 5) The results from each map task is outputted to a reduce task, in wich all the results is merged as the final result. Step 3 is implemented in the class MyRecordReader which is inherited from RecordReader class. Step 4 is implemented in the class MapClass which implements the interface Mapper. The K in the (K,V)key pair received by the MapClass is the line number of the packet in the original data, V is a packet which is divided by MyRecordReader. Each valuation point in
2461
the packet is calculated, and it is outputted as the V’ in the key pair. The K is direct outputted as the K in the key pair. Step 5 is implemented in the class ReduceClass which implements the interface Reducer. It simply receives the V and merges it according to the order of the corresponding K.
Speedup is the ratio of the time of the same task running on a single processor system and parallel processing system:
To improve the overall efficiency of the system, it is necessary to find a reasonable value of linespermap which is the number of valuation points to be assigned to each map task. It could be acquired through the comparison of the time spend by different linespermap value in the experiment.
Here, Ts is the time consumed in the single processor system, Tp is the time consumed in distributed system with P nodes. The efficiency should be close to 1, the closer the better.
III.
EXPERIMENTAL RESULTS AND DISCUSSION
A.
Experimental platform and data The cloud computing platform of this experiment is a Hadoop cluster consisting of 22 PC with a same configuration. Each PC is equipped with Intel dual core processor (2.93GHz), 3GB of memory, and is installed with Linux 2.6.35-28-generic operating system, JAVA virtual machine (JAVA1.6.0_20) and Hadoop 1.2.1 software package. The PCs are connected with Gigabit LAN. All the code is written in JAVA language. It was evaluated that the influence of the number of the unmeasured locations assigned to each map task on the performance of the algorithm. The data sets used in this experiment are three groups of soil data sets which have significant different size. The sizes are 8M, 100M and 1G, respectively. The data set of 8M is the real soil observation data, including 1023 training samples (each sample contains the geographical coordinate and the content of organic carbon, a total of 3 fields) and 160000 valuation points (each point contains only the geographical coordinate, a total of 2 fields). Soil observation data is obtained in Jinjing River Basin in Hunan city of Changsha province.
Speedup = Ts
Tp
(3)
The experimental results on the small scale data set (8M) are shown in Fig.1 and Fig.2. The time consumed in distributed IDW algorithm generally ranged from 20 to 100 seconds The efficiency of the distributed IDW algorithm is relatively low, ranged from 0.01 to 0.25, and decreases with the increase of the number of nodes. The efficiency increases with the increase of the value of linespermap when the number of nodes remains unchanged. The experimental results on the medium data set (100M) are shown in Fig.3 and Fig.4. The consumed time in distributed IDW algorithm ranged from 41 to 200 seconds commonly. The response of efficiency to the change of number of nodes and linespermap is similar to that on small scale data set. However, the value of efficiency increases obviously, indicating that the efficiency on the medium data set is significantly improved. Experimental results on 1G data sets are were shown in Fig.5 and Fig.6. The time consumed in distributed IDW algorithm ranged from 270 to 1000 seconds commonly. Compared with small and medium scale data set, the efficiency of the algorithm in the experiment on large scale data set is further improved, and when the node number is less (mean less communication between network nodes), the efficiency exceeds 0.8. In addition, the efficiency of the algorithm still decreases while nodes are added, but no longer keep the trend that the efficiency increases with the increase of the value of linespermap.
Considering the large scale training samples in real data is very difficult to obtain and the size of valuation data set is the key factor influencing the calculation efficiency, the training sample set in above-mentioned soil data remains unchanged, and the valuation data set was multiple copied to generate two data sets of 100M and 1G, respectively. They were used to test the calculate efficiency of distributed IDW algorithm in medium / large data sets. B.
Performance analysis In MapReduce programming framework, each Map task contains more data will lead to less Map tasks , so some nodes will not be assigned task and the efficiency will be reduced; each Map task contains less data will lead to more map tasks, the communication overhead between the nodes will increase, and the efficiency will be reduced too. Therefore, different values are set to the parameter Linespermap in order to test the performance of the proposed algorithm.
Figure 1. Running times of the distributed IDW on the small dataset with different number of computing nodes
Considering the distributed system is a generalized parallel processing system,Efficiency-the evaluation index of parallel system- is introduced to evaluate the performance of the distribution IDW algorithm. Efficiency equals to speedup divided by the total number of process in the cluster[17].
2462
Figure 2. Efficiency of the distributed IDW on the small dataset with different number of computing nodes
Figure 3. Running times of the distributed IDW on the medium dataset with different number of computing nodes
Figure 5. Running times of the distributed IDW on the large dataset with different number of computing nodes
Figure 6. Efficiency of the distributed IDW on the large dataset with different number of computing nodes
The efficiency of the algorithm improves significantly with the increase of data scale. Because regardless of the size of the data set, the switching of map tasks in the Hadoop framework will consume additional time(network communication and the establishment and revocation of map task), and the time consumed will not obviously change when the amount of data reduces. When the amount of data is small, additional time overhead will account for most of total time. When the amount of data is large enough, the time for the assignment and transmission of tasks will be a small percentage of total time, and the efficiency of the algorithm will be improved obviously.
Figure 4. Efficiency of the distributed IDW on the medium dataset with different number of computing nodes
In the experiment when working with large scale data, time consumed in the distributed IDW algorithm decrease with the increase of the number of nodes while the value of linespermap is 1000000 and 2000000. But the time consumed increases when the number of nodes is larger than 10.The consumed time gradually decreased, because the tasks assigned to each node decrease with the increase of nodes and the time decreases. The time consumed increase when the number of nodes is more than a certain number, because the amount of network traffic in LAN increases significantly when the number of nodes increases and a new bottleneck appears.
2463
IV. CONCLUSION AND PROSPECT This paper proposed a distributed IDW algorithm based on Hadoop. In the experiment on the 1G data sets, the consumed time in the distributed IDW algorithm is less than 250 seconds when working with a proper value of linespermap and enough nodes. In the same condition, the serial IDW algorithm consumed 2210 seconds. Clearly, the distributed IDW algorithm runs faster ten-fold than serial IDW. So it has good acceleration performance when working with large amounts of data. The production of geographical space variable distribution map of large area with high resolution needs large amount of calculation. For example, in order to output a soil element distribution map with resolution of 30 meters: output a grid map of a city(Changsha) requires the computation of 13000000 grid, output a grid map of a province(Hunan) requires the computation of 235000000 grid, when expanded to the whole nation, the amount of calculation amplifies to 10700000000 rapidly. The distributed IDW algorithm could be used to solve this problem. As shown in the experiment, more than 9/10 time consumed in the traditional way will be saved. However, the analysis in section Ⅲ shows that when the size of the cluster is larger than a certain scale, the acceleration performance of this algorithm is no longer obvious, and the overall efficiency began to decrease. How to reduce the Influence of network transmission on the performance of this algorithm needs to be further researched. ACKNOWLEDGMENT The study is funded by National Natural Science foundation (No.41201299). REFERENCES [1] [2] [3]
Tomislav.Hengl, A Practical Guide to Geostatistical Mapping.5-9,2009. Tobler, W. R. A computer move simulating of urban growth in the detroit region. Economic Geography,46 (2):pp. 234–240,1970. P.A. Burrough, R.A. McDonnell, Principles of Geographical Information Systems. Oxford :Oxford University Press,pp. 333,1998.
[4] [5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14] [15]
[16] [17]
Cressie, N. A. C, The origins of kriging. Mathematical Geology ,22 (3):pp. 239–252,1990. Walvoort, D. J. J., de Gruijter, J. J, Compositional Kriging: A Spatial Interpolation Method for Compositional Data. Mathematical Geology, 33 (8):pp. 951–966,2001. A S Fotheringham, M E Charlton, C Brunsdon, Geographically weighted regression: a natural evolution of the expansion method for spatial data analysis. Environment and Planning A, 30, pp.19051927,1998. Papari, G., Petkov, N, Reduced inverse distance weighting interpolation for painterly rendering. Proceedings of 13th International Conference of Computer Analysis of Images and Patterns(CAIP2009), North RhineWestphalia, Germany. Lecture Notes in Computer Science, vol.5702, Springer, Berlin,pp. 509–516,2009. Chang, C.L., Lo, S.L., Yu, S.L, The parameter optimization in the inverse distance method by genetic algorithm for estimating precipitation. Environmental Monitoring and Assessment ,pp.117,145– 155,2006. Jason, W.W., Jeffrey, D.C, Spatial characterization, resolution, and volumetric change of coastal dunes using air borne LIDAR, Cape Hatteras, North Carolina. Geomorphology, 48,pp.269–287,2002. Yang, C.W., Li, W.W., Xie, J.B., Zhou, B, Distributed geospatial information processing: sharing distributed geospatial resources to support Digital Earth. International Journal of Digital Earth,1(3), pp.259–278, 2008. Armstrong, M.P.,Marciano,R.J, Local interpolation using a distributed parallel super computer. International Journal of Geographical Information Systems,10(6), pp.713–729 ,1996 Guan, X.F.,Wu,H.Y, Parallel optimization of IDW interpolation algorithm on multicore platform. Proceedings of Geo informatics 2008 and Joint Conference on GIS and Built Environment: Advanced Spatial Data Models and Analyses, Guangzhou, China, 7146Y ,pp.1-9,June 28, 2008. Huang, F, Research on the key techniques and prototype system of parallel GIS based on cluster. Institute for Remote Sensing Applications, Chinese Academy of Sciences, Beijing, China, pp.168,2008. Office of Hadoop Online: http://wiki.apache.org/hadoop/#User_Documentation. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. OSDI ’04:6th Symposium on Operating Systems Design and Implementation, San Francisco,USA,2004. Tom White,Hadoop: The Definitive Guide,USA:O’ReillyMedia,Inc, pp.167-188,2012. Brawer,B, Introduction to Parallel Programming. Academic Press Professional, Inc, San Diego, CA,USA,pp.422,1989.
2464