A Parallel Finite Element Surface Fitting Algorithm for Data Mining 1 ...

1 downloads 7505 Views 121KB Size Report
are widely used in Data Mining applications. Thin plate splines have recently been used to model interaction terms in a large data set by reducing the high- ...
A Parallel Finite Element Surface Fitting Algorithm for Data Mining Peter Christen Institut fur Informatik Universitat Basel CH-4056 Basel, Switzerland christen@i .unibas.ch

Irfan Altas School of Information Studies Charles Sturt University Wagga Wagga, NSW 2678, Australia [email protected]

Markus Hegland and Stephen Roberts Computer Science Laboratory, RSISE Australian National University Canberra, ACT 0200, Australia [email protected]

Kevin Burrage and Roger Sidje Department of Mathematics University of Queensland St. Lucia, QLD 4072, Australia [email protected]

Abstract

A major task in Data Mining is to develop automatic techniques to process and to detect patterns in very large data sets. Multivariate regression techniques are widely used in Data Mining applications. Thin plate splines have recently been used to model interaction terms in a large data set by reducing the high-dimensional nonparametric regression problem to the determination of a set of interdependent surfaces, that is the estimation of functions of two variables. Obtaining standard thin plate splines requires the solution of a dense linear system of equations of order n where n is the number of observations. Considering the number of observations for Data Mining is usually millions, so standard thin plate splines may not be practical. We have developed a nite element approximation of thin plate splines that can handle data sizes with millions of records. The resolution of the method can independently be chosen from the number of observations. The observation data can be read from a secondary storage once and do not need to be stored in memory. In this paper, we discuss the parallel implementation of this method in MPI environment.

1 Introduction In the last decade, there has been an explosive growth in the amount of data collected due to advances in data collection. The computerization of business transactions and use of bar codes in commercial outlets have provided businesses enormous amounts of data. Revealing patterns and relationships in a data set can improve enterprise goals, missions and objectives of many organisations. For example, sales records can reveal highly pro table retail sales patterns. As such it is important to develop automatic techniques to process and to detect patterns in very large data sets [1]. This process is known as the "Knowledge Discovery from Databases" (KDD) process or simply "Data Mining" (DM). DM techniques are used to spot trends in data that may not be easily detectable by traditional database query tools that rely on narrow human queries to produce results. DM tools reveal hidden relationships and patterns as well as uncover correlations that are not detected by the human query [2]. A typical example of data mining is insurance data with several hundred thousands policies with tens of variables each. Insurance companies should analyse such a data set to understand claim behaviour in insurance. Some insurants

lodge more claims than others. What are the typical characteristics of insurants who lodge many claims [1]? Would it be possible to predict whether a particular insurant will lodge a claim and accordingly o er an appropriate insurance premium? A major task in DM is to nd out answers to these types of questions. An important DM technique is multivariate regression or estimation. Multivariate regression is used to model functional relationship of high dimensional data sets that introduce the curse of dimensionality. Additive and interaction splines can be used to overcome the curse dimensionality problem [3]. If interaction terms are limited to order two interactions, generalized additive models reduced to the determination of coupled surfaces and curves. Thus, an important part of any data analysis algorithm for these kinds of problems is the determination of an approximating surface for large data sets. As such surface tting is an important technique for the DM of large data sets. We have developed a generic surface tting algorithm that can handle data sizes with millions of records. The algorithm combines the favourable properties of nite element surface tting with the ones of thin plate splines (TPS). The surface tting technique that is called thin plate spline nite element method (TPSFEM) is discussed in Section 2. We discuss the numerical solution of the linear equations arising from the TPSFEM technique in Section 3. In order to interactively analyse big data sets, a considerable amount of computational power is needed in most cases. High-performance parallel and distributed computing are crucial for ensuring system scalability and interactivity as data sets grow considerably in size and complexity. The parallel implementation of the algorithm is discussed in Section 4 whereas test results are given in Section 5. Conclusions and future work are given in Section 6.

2 Surface Fitting Algorithm Data mining is attracting researchers from a number of branches such as statistics, machine learning, arti cial intelligence, computational mathematics and databases. Statistics has an important role in DM. Surface tting and smoothing splines techniques are widely used to t data. We have developed a surface tting algorithm, TPSFEM, that can be used for various applications such as DM and digital elevation models (DEM). For example, it can be used to model interaction terms in very large data sets for DM. The TPSFEM algorithm can handle data sizes with millions of records. It can be viewed as a discrete thin plate spline. The following is a brief summary of the derivation of the method. However, the interested reader can refer to [4, 5] for details. The standard thin plate spline considered here is the function f that minimises the functional Z n X ?  1 2 M (f ) := n (f (xi) ? yi) + f12 (x)2 + 2f1 2 (x)2 + f22 (x)2 dx

i=1 where x = (1; 2)T and xi = (i;1; i;2)T 2 R2; yi 2 R; i = 1; : : : ; n denote the observations. The smoothing parameter that depends on the size of the data errors can be determined by generalised cross validation.

In order to reduce the smoothness requirement of this problem we can reformulate this functional in terms involving only rst order derivatives. Let u0 be in the closed subspace of the Sobolov space, H 1, and be the solution of (rv; ru0) = (rv; u); for all v 2 H 1 and let the function values of u0 on the observations points be denoted by

P u := (u0(x1); : : : ; u0(xn))T : Furthermore let





T 1    1 X := x    x 2 Rn;3 1 n and y := (y1; : : : ; yn)T . Using all of this one can introduce the functional J (u; c) = n1 (P u + X c ? y)T (P u + X c ? y) + (ju1j21 + ju2j21) (1) where the minimum is taken over all functions u = (u1; u2) with zero mean and constants c = (c0 ; c1; c2 )T . It follows now that the functional has exactly one minimum if the observation points xi are not collinear. It can also be seen that the function de ned by f (x) = u0(x) + c0 + c11 + c22 minimises the original functional M (f ) and is, thus, the thin plate smoothing spline. Formal proofs of these results can be found in [4]. After some manipulations and discretization [5] one gets the equivalent linear system of equations of (1). 2

32

3

2

3

0 A 0 0 0 ?B1T uh1 h T 6 7 6 7 6 0 A 0 0 ?B2 7 6 u2 7 6 0 77 6 6 0 (2) 0 M F A 77 66 uh0 77 = 66n?1 N T y77 6 4 0 0 F T E 0 5 4 c 5 4n?1 X T y5 0 ?B1 ?B2 A 0 0 wh where w is the Lagrange multiplier vector. u0 is a discrete approximation to f whereas u1 and u2 are the discrete approximations of the rst derivatives of f . The index h denotes the discretisation level and will be omitted in the subsequent discussions.

3 Solution of Linear Systems The size of the linear system (2) is independent of the number of observations, n. The observation data are visited only once during the assembly of the matrices M , E , F , Ny and Xy in (2). Thus, the time complexity is O(n) to form (2). If the number of nodes is m in the nite element discretization, then the size of (2) is 4m +3 that is independent of n. All sub-systems in (2) have the dimension m, except the fourth one that has dimension 3. Thus, the total amount of the work required for the surface tting algorithm is O(n) to form (2) and plus the work required to solve it.

It can be seen that the system matrix (2) is symmetric inde nite and sparse. One of the ecient techniques to solve such systems is the Uzawa algorithm [6, 7]. We also applied a modi ed version of the uzawa algorithm to (2). If several adjacent elements do not contain any observation points, the matrix M in (2) becomes singular, hence, the Uzawa algorithm becomes unapplicable. In order to avoid this problem we make a block row interchange of the third and the fth block row of the system (2). Then, the variant of the Uzawa algorithm used here has the form of a block Gauss-Seidel iteration and can be written as 2

32

3

2

3

2

3

0 B1T A 0 0 0 0 u1k+1 k +1 B 6 7 6 6 7 7 6 0 0 77 A 0 0 0 7 6 u2 7 6B2 7 6 6 k k +1 6 7 7 6 6?B1 ?B2 A 0 0 7 6 u 0 77 76 0 7 = 6 0 7w + 6 6 ? 1 T k +1 4 5 4 4 5 5 4 0 n X T y5 0 0 F E 0 c n?1N T y 0 0 0 M F A wk+1

(3)

This system requires the solution of four equations with matrix A for each iteration step. The convergence behaviour and some variations of this approach are discussed in [5]. In the next section, we will discuss the parallel implementation of this approach.

4 Implementation and Parallelization The original implementation of the TPSFEM algorithm has been done in Matlab [4, 5]. For increased performance and parallelization, C and MPI [?, ?] were chosen for a second implementation. The main purpose of this implementation is to analyze parallelization aspects of the TPSFEM algorithm. Several matrices are needed within this algorithm. Matrices M and A consist of nine diagonals with non-zero elements are symmetric. So only the upper part ( ve diagonals) have to be stored. B1 and B2 also have nine diagonals with non-zero elements, but as these matrices are non-symmetric all nine diagonals need to be stored. The chosen data structure for all matrices consists of a two-dimensional array of m rows and 5 or 9 columns, respectively. Each of the columns contains one of the diagonals with non-zero elements. With this packed data structures, a good data locality and thus good cache utilization can be achieved. Finally, the matrix F is of dimension m  3 and matrix E is a small 3  3 matrix. The dot product and vector addition (DAXPY) implementations use loop unrolling for better pipeline usage. As the data sets can be quite large, le reading and assembly of the matrices M , F and E as well as the vectors Ny and Xy are the time consuming steps. Fortunately, this assembly process can be parallelized easily. First, the data set le is split equally into P smaller les by a cyclic distribution. Then, these les can independently be processed by the P processors, that each assembles local matrices and vectors. After the assembly, these local matrices and vectors have to be collected and summed to get the nal matrices M , F , E and vectors Ny and Xy. The amount of data to be communicated depends on the dimension of the matrices m, but not on the number of data points. If the amount of communicated data is small compared to the number of data points to process from

le, an almost ideal speedup can be achieved for the assembly process. The assembly of the matrices A, B1 and B2 only depends on the matrix dimension m and takes much less time than the assembly of M and E . The resulting linear system (3) to be solved contains ve sub-systems, four having a dimension of m and one having dimension 3. As can be seen from (3), the rst and second sub-systems can be solved in parallel, and only the resulting vectors have to be communicated. The parallel implementation applies a functional parallelism in the following way:  The assembly of matrices M , F , E and vectors Ny and Xy is done by all processors P , where each of them reads one locally stored data le with n=P data points. After the assembly, the local matrices and vectors are collected and summed in the host processor P0 (with a simple call to MPI Reduce). Two messages are communicated, one containing M (5m oating point values), the other one containing F , Ny, E and Xy (containing 4m + 12 oating point values).  The matrices A, B1 and B2 are only assembled on processors P0 and P1.  The iterative solving with the Uzawa's algorithm is only started on P0 and P1. After the initialization, processor P0 starts solving the rst sub-system in (3). Processor P1 solves the second sub-system in (3) and sends the resulting vector u2 to processor P0, which then can solve the third and fth sub-systems. The fourth sub-system has the dimension 3 and, thus, its solution cost is negligable. After the convergence check, processor P0 sends the vector w to P1 which then starts the next solving of the second sub-system. Applying Amdahl's law, the solver part of the algorithm can achieve at most a speedup of 1.333 if the communication of vectors is neglected and all four large sub-systems need the same time to be solved.

5 Tests Results from Parallel Implementation The TPSFEM algotithm has already been applied to t surfaces to large data sets from insurance, ow eld, digital elevation and magnetic eld areas. In this section, although we demonstrate the eciency of the parallel implementation of the TPSFEM algorithm by using two large digital elevation data sets, the algorithm can equally be applicable as a non-parametric regression technique used in DM. Elevation data is an important data component required to transform a Geographical Information System (GIS) from a 2-dimensional map storage system to a 3-dimensional information system ([8]). Hence, an ecient surface tting algorithm such as TPSFEM is an important tool for GIS applications. The involvement of end-users in this area is usually interactive. Therefore, they can bene t signi cantly from a fast surface tting algorithm. Two digital elevation data sets are obtained by digitizing the map of the Murrumbidgee region in Australia. The rst data set has n = 547453 observation points whereas the second data set has n = 1887250 observation points. The test platform is a 10 Sun Sparc5 workstation cluster networked with 10 Mbit/s twisted pair Ethernet. We did the test

runs for m = 2601, 10201, and 63001. The following table shows timings in milliseconds for the assembly stage of the rst data set. All timings presented here are average of ten test runs on an otherwise idle platform. m Serial Assembly Parallel Assembly Computation Communication Total 2601 87159 8781 1258 10039 10201 86809 8770 5162 13932 63001 87250 8935 29827 38762 In the following table we introduce timings for the second data set with m=2601. The serial assembly and solution steps took 285780ms and 40893ms, respectively. The parallel solution step with two processors is 32384ms. As is explained in the previous section, the parallel solution stage employs only two processors. number of Parallel Assembly processors Computation Communication Total 3 97231 1123 98354 7 41653 1524 43177 10 28187 1800 30987 In the following table we introduce timings for the second data set with m=10201. The serial assembly and solution steps took 284939ms and 624287ms, respectively. The parallel solution step with two processors is 548180ms. number of Parallel Assembly processors Computation Communication Total 3 98143 1762 99905 7 41856 4382 46238 10 29374 5769 35143 In order to illustrate the TPSFEM algorithm we presented a subset of the rst elevation data in Figure 1. The surface obtained from the TPSFEM algorithm is given in Figure 2. The units in Figures 1 and 2 denotes the longitudes and latitudes of the region.

6 Conclusions In this work, we concantrated on a parallel version of the TPSFEM algorithm. The TPSFEM method can handle very large data sets to t smooth surfaces. There are two main parts of the algorithm that consume main processing time: forming matrices M, F, E, Ny, Xy in the system (2) and solving the system (3). We demonstrated that an almost ideal speedup can be achieved in the formation of matrices for very large data sets. Unfortunately, the solver part of this parallel implementation is not scalable contrast to the assembly part. Our future research therefore will concentrate on how to achieve scalability by using distributed algorithms to solve the linear systems. On the other hand, as we deal with sparse matrices, parallel solver routines generally do not scale well. In order to solve the system (2) we are developing a generalized cross validation technique. The work is under progress and will be reported elsewhere.

References [1] Data Mining: an Introduction, Data Distilleries, DD19961, 1996, (http://www.ddi.nl) [2] A. A. Freitas and S. H. Lavington, "Mining Very Large Databases with Parallel Processing", Kluwer Academic Publishers, 1998. [3] T. J. Hastie and R.J. Tibshirani, Generalized Additive Models, Monographs Chapman and Hall, 1990. [4] M. Hegland, S. Roberts and I. Altas, Finite Element Thin Plate Splines for Data Mining Applications, in Mathematical Methods for Curves and Surfaces II, M. Daehlen, T. Lyche and L.L. Schumaker, eds., pp. 245-253, Vanderbilt University Press, 1998. ( ftp://farrer.riv.csu.edu.au:/pub/irfan/hegra97a.ps.gz) [5] M. Hegland, S. Roberts and I. Altas, 1997, Finite Element Thin Plate Splinesfor Surface Fitting, in Computational Techniques and Applications: CTAC97,B.J. Noye, M.D. Teubner and A.W. Gill, Eds., pp. 289-296, WorldScienti c, 1997. ( ftp://farrer.riv.csu.edu.au:/pub/Hegland/hegra97b.ps.gz) [6] M. Fortin and R. Glowinski, Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems, North-Holland, 1983. [7] H.C. Elman and H.G. Golub, Inexact and Preconditioned Uzawa Algorithms for Saddle Point Problems, SIAM J. Numer. Anal., 31, 1994, pp. 1645-1661. [8] L. Lang, GIS Goes 3D, Computer Graphics World, March 1989, pp. 38-46.

Suggest Documents