A Framework for Finding Distributed Data Mining Strategies That are Intermediate Between Centralized Strategies and In-Place Strategies Andrei L. Turinsky Laboratory for Advanced Computing University of Illinois at Chicago 851 South Morgan Str 322 SEO Chicago IL 60607 USA
[email protected] ABSTRACT
Distributed data mining is emerging as a fundamental computational problem. A common approach with distributed data mining is to build separate models at geographically distributed sites and then to combine the models. At the other extreme, all of the data can be moved to a central site and a single model built. With the commodity internet and large data sets the former approach is the quickest but often the least accurate, while the latter approach is more accurate but generally quite expensive in terms of the time required. Of course there are a variety of intermediate strategies in which some of the data is moved and some of the data is left in place, analyzed locally, and the resulting models are moved and combined. These intermediate cases are becoming of practical signi cance with the explosion of ber and the emergence of high performance networks. In this paper, we examine this intermediate case in the context in which high performance networks are present and the cost function represents both computational and communication costs. We reduce the problem to a convex programming problem so that standard techniques can be applied. We illustrate our approach through the analysis of an example showing the complexity and richness of this class of problems.
Keywords
Distributed data mining, high performance networks, linear programming.
http://www.ncdm.uic.edu yPoint of Contact. R. Grossman is also with Magnify, Inc.
Robert L. Grossman
y
Laboratory for Advanced Computing University of Illinois at Chicago 851 South Morgan Str 322 SEO Chicago IL 60607 USA
[email protected]
1. INTRODUCTION
Because moving large data sets over the commodity internet can be very time consuming, a common strategy today for mining geographically distributed data is to leave the data in place, build local models, and combine the models at a central site. Call this an in-place strategy. At the other extreme, when the amount of geographically distributed data is very small, the most naive strategy is simply to move all the data to a central site and build a single model there. Call this a centralized strategy. Given geographically distributed data, we can either a) move data, b) move the results of applying algorithms to data (models), or c) move the results of applying models to data (result vectors). It is not uncommon for there to be a 10x100x dierence in the size of the data, model and result vectors. Consider a cost function that measures the total cost to produce a model and includes both the communication and processing costs. As the size of the data grows and the speed of the link connecting two sites decreases, an in-place strategy is, generally speaking, less expensive but also less accurate. Conversely, the centralized strategy is generally more expensive but also more accurate. Given a minimally acceptable accuracy, it is plausible that there is an intermediate strategy which produces a model with this level of accuracy with the minimum possible cost. Call these intermediate strategies. In this paper, we show that this is indeed the case and describe an algorithm called the OPTDMP (OPTimal Data and Model Partitions) for nding such a strategy. We also give an example to show that such intermediate strategies occur rather naturally. Today, the capability of the broadband communications infrastructure is doubling every 9-12 months, faster than the 18 month doubling of processor speeds (Moore's law). For example, a 155 Mb/s OC-3 link can move 10 Gigabytes of data in about 15 minutes. Given this infrastructure and the growing importance of large distributed data sets, intermediate strategies between in-place and centralized strategies
of the type described here should be of growing interest. This paper makes the following contributions: 1. We introduce the problem of computing intermediate strategies in distributed data mining and point out these type of strategies will become more and more important with the emergence of wide area high performance networks. 2. We provide a mathematical framework for analyzing intermediate distributed data mining strategies. 3. We introduce an algorithm called OPTDMP for nding intermediate strategies in the case of a linear cost function. 4. We show with an example that intermediate strategies are interesting, even for the simple case of linear cost functions. Given the analysis of the paper, it is straightforward to de ne versions of OPTDMP for a variety of other cost functions. In future work we will investigate the class of strategies that result and present experimental studies using the OPTDMP algorithm. Our point of view is to reduce nding intermediate strategies to a mathematical programming problem which minimizes a cost function subject to an error constraint. The framework of Section 3 and Section 4 holds for a wide range of cost functions and mathematical programming algorithms. The purpose of this paper is to introduce these ideas with a simple cost function and a simple example. Section 2 of this paper describes related work. Section 3 describes the computational model. Section 4 describes the OPTDMP Algorithm and Section 5 works out an example in detail. A fuller version of this paper containing additional details and more examples is in preparation [6].
2.
BACKGROUND AND RELATED WORK
As mentioned above, a common approach to distributed data mining is centralized learning, where all the data is moved to a single central location for analysis and predictive modeling. Another common approach is local learning, where models are built locally at each site, and then moved to a common site where they are combined. Ensemble learning [3] is often used as a means of combining models built at geographically distributed sites. Methods for combining models in an ensemble include voting schemas [3], meta-learning [11], knowledge probing [7], Bayesian model averaging and model selection [10], stacking [12], mixture of experts [13], etc. Several systems for analysis of distributed data have been developed in recent years. These include the JAM system developed by Stolfo et al [11], the Kensington system developed by Guo et al [7], and BODHI developed by Kargupta et al [8], [9]. These systems dier in several ways. For example, JAM uses meta-learning that combines several models by building a separate meta-model whose inputs are the
outputs of the collection of models and whose output is the desired outcome. Kensington employs knowledge probing that considers learning from a black box viewpoint and creates an overall model by examining the input and the output of each model, as well as the desired output. BODHI system employs so-called collective mining that relies in part on ideas from Fourier analysis to combine dierent models. In terms of data and model transfer, JAM, Kensington and BODHI all use local learning. A variety of load balancing techniques have been utilized for a long time in parallel computing applications. Load balancing is aimed at nding an optimal regime of moving data to the nodes of a supercomputer or, more recently, of a network of workstations. Zaki [14] provides an example of a load balancing method that optimizes the eÆciency of parallel computation on a network of compute nodes. Other examples and motivating discussion can be found in [2] and [4]. These techniques, however, do not directly target the speci c issues that arise in distributed data mining, such as ways of combining predictive models and the accuracy of the resulting predictive system. The work presented in this paper, to our knowledge, is the rst attempt to identify a fundamental trade-o in distributed data mining; namely, the trade-o between the ef ciency and cost-eectiveness of a distributed data mining application on one side, and the accuracy and reliability of the resulting predictive system on the other side. We provide evidence that the most eÆcient application may give an unacceptably inaccurate predictive results, while the most accurate predictions may require an ineÆcient data processing strategy. We also explore a variety of intermediate strategies in this paper. A new system for distributed data mining called Papyrus is now being developed at the National Center for Data Mining [5]. Among other features, it is designed to support dierent data and model strategies, including local learning, centralized learning, and a variety of intermediate strategies, that is, hybrid learning. More work is under way to develop a methodology of choosing an information transfer strategy that is optimized for a particular data mining task. This paper presents some motivation and ideas behind these efforts.
3. COMPUTATIONAL MODEL
We use a computational model consisting of a collection of geographically distributed processors, each with dedicated memory and connected with a high performance network. We assume that network access is substantially more expensive than a local memory access. Naturally, this assumption may not hold for very fast networks where remote access to memory might in fact be faster than a local disk access. However, here we focus on a situation where the processors are connected via a high speed broadband type network that adheres to quality of service requirements.
Network Configuration
Formally, we assume that there are n dierent sites connected by a network. The cost of processing data at ith node into a predictive model is ci dollars per Gigabyte and the optimal cost of moving data from ith to j th node via the
cheapest route between the two nodes is cij dollars per Gigabyte. One of the nodes is the network root where the overall result will be computed.
We also de ne an in-place strategy to be a strategy Xp of processing all data locally. For such a strategy, xii = Di for each i and the rest of xij are zero.
Building Models
Cost Function
Our assumption is that at each node, a choice must be made: either ship raw data across the network to another node for processing, or process data locally into a predictive model and ship the model across the network. Given this viewpoint, building models consists of the following steps: 1. Re-distribute data across the network. 3. Re-distribute all local predictive models to the root. 4. At the root, combine all models into a single predictive model. Our objective is to determine the procedure of data transfers over the network that minimizes the overall cost of building models. Let D be the initial amount of data (in Gigabytes) at the i node. After the re-distribution in Step 1, this node ac~ of data. Let M be the size of the predictive cumulates D ~ in Step 2. It is later transferred model computed from D i
i
i
i
to the root. We assume that when data is processed into a predictive model, its amount is compressed uniformly for each node with a coeÆcient , so that Mj = D~ j for all j . Without loss of generality, we can assume that the K th node is chosen as a root.
4.
THE OPTDMP ALGORITHM
In this section, we describe an algorithm for nding OPTimal strategies for Data and Model Partitions called OPTDMP.
Strategies A strategy X is a matrix of numbers X = [x ] =1 where x is the amount of data D that is moved from the i node to the j node for processing. This portion of data ~ , is processed, and later transferred to the contributes to D ~ . root as a part of M n ij i;j
ij
th
i
th j
j
Note that
x
ij
0;
n X
x =D; ij
j =1
i
D~ = D~ (X ) = j
n X
j
In-place and Centralized Strategies
The overall cost function for a strategy X is easily computed as X X C (X ) = (cij xij + cj xij + cjK xij ) = c^ij xij ij
2. At every node, compute a predictive model.
th
The basic idea is to reduce the problem of nding optimal strategies for building models to a constrained optimization problem.
x
ij
i=1
We de ne a centralized or naive strategy to be a strategy X0 = X0 (K ) of moving data from all other nodes to the root in Step 1 for further computation. For such a strategy, xiK = Di for each i and the rest of xij are zero.
ij
The coeÆcients cij represent the cost of network communication between nodes i and j per unit volume of data, while the coeÆcients cj correspond to the algorithm processing cost at a speci c node per unit volume of data. Therefore, the rst component in the sum represents the cost of moving raw data from node i to node j for processing, the second components represents the cost of processing this data at the node j , and the third component represents the cost of moving the obtained results to the root node K for combining with other results. We do not include the cost of combining the results at the root because such cost component does not depend on a particular strategy. It would only change the cost function by a constant term, which has no eect on the optimization problem. Since data processing results in reducing the amount of storage by a factor as discussed before, the coeÆcient is applied to the amount xij of data in the third term of the cost equation. We assume a linear cost of data processing here, which will later lead to a linear programming problem. More generally, similar algorithms would work for convex non-linear cost functions. The actual values of the coeÆcients may be estimated based on network throughput capacities, cost of particular processing equipment, etc. Cost is dierent for dierent naive strategies. Let C0 be the best cost available under a naive strategy. Also, Cp = C (Xp ) is the cost of the in-place strategy.
Error
We assume that there are two factors that introduce error. First, the loss of accuracy may be due to the nature of the computational algorithm itself, giving an error level 0 regardless of the strategy. Secondly, accuracy is lost if the data is processed at dierent nodes. In this case, there is a loss from using a model built from local data, instead of a model built at a central location after moving the data. The ~ 1 (X ); : : : ; D~ n (X ) is over the more distributed D~ (X ) = D network, the higher the potential for this error. We use the following form of the error function for a strategy X : ~ (X ) = 0 + M 1 kD(X )k ; M = const
jDj
P where jDj = j D~ j is the overall amount of data in the qP network and kD~ k = D~ j2 is the usual Euclidean norm j ~ (X ). Various other forms of (X ) are availof a vector D able. Note that naive strategies have only the rst error component 0 and hence are the most accurate.
It follows that the data mining error takes its minimal value 0 when all data is available at the (single) processing node, and its largest possible value d when the data is evenly distributed among the nodes. These values may be computed once for a particular type of data mining application by placing the data in the appropriate fashion, ie, distributed or in-place, and processing it. The coeÆcient M is chosen so that 0 (X ) d , that is, to guarantee that the maximum possible value of the error function is d . Note that other (convex) forms of the error functions could be used without substantial changes to the algorithm.
The OPTDMP Algorithm
The OPTimal strategy Data and Model Partition strategy (OPTDMP) X = [xij ] is a solution of the following optimization problem: 8 P min[xij ] C (X ) = ij c^ij xij > > > > > < P xij 0; j xij = Di > > > > > : (X ) = 0 + M 1 kD~j(DXj )k max where max is the maximum error level allowed and vector P D~ is given by D~ j = i xij . The optimal solution is a strategy X that gives the least cost C = C (X ) among all suÆciently accurate strategies. The rst two equations de ne a linear programming problem with a convex bounded polyhedron domain B . It can easily be solved, and its solution X gives the best cost attainable in the absence of accuracy restrictions [1]. Geometrically, X is the \lowest" vertex of the polyhedron, where the direction is determined by the level of the linear cost function. Note that the error function (X ) has convex level surfaces. The third equation of the optimization problem is, therefore, an additional convex non-linear restriction. Geometrically, it discards a convex subset Q of the polyhedron B , thus producing a set of acceptable strategies A B . The lowest point of A is the optimal solution X . If X satis es the accuracy requirement and hence is not discarded, it is the lowest point of the set A. In this case, the optimal strategy X = X . Otherwise, X lies somewhere on the intersection of B and the non-linear max -error level surface. By convexity argument, we claim that the lowest point of A must be on the intersection of the error level surface with one of the edges of the polyhedron B . Therefore, to nd X , we can start at the used-to-be lowest point X of B and go \up" (in the direction of increasing cost) along each edge until it intersects the error level surface (X ) = max , which occurs when the lowest boundary of the set A of acceptable strategies is reached. It is then a simple task to choose X as the lowest of the intersection points. This algorithm is simple yet eective and, as shown in the following example, produces a non-trivial optimal cost solu-
tion. Note that the decision of which instances of the data to move is not covered by this algorithm, just the fraction of the data to move. In the simplest case, we would assume that the data is homogeneous or homogeneous for each of several strati ed subsets and move random samples of the data or of the strati ed subsets.
5. AN EXAMPLE – MINING CUSTOMER PROFILES DATA
Suppose a department store chain is interested in launching an advertising campaign. As a preparation, it wants to identify customers who are likely to respond positively. A collection of customer records is stored at several branches of the chain. Our data mining application must determine a 2%-set of \target" customers. Let there be n = 3 department store branches with Branch 1 as the Main Branch. All branches of the department store chain (nodes) are linked by a network. The following technique is adopted by the management: re-balance cluster records across the network for cheaper processing; at every branch, a classi cation algorithm chooses 2% of cluster records received at that branch as \target"; such local results are then shipped to the Main Branch, combined in a single set, and reported as the output of the data mining application. In terms of our model, Di 's are the initial amounts of customer records at each branch, D~ j 's are the amounts after the data has been re-balanced, Mj 's are the amounts of selected local targets, and = :02. Main Branch is the network root. Initial data distribution Di , computing cost ci at each node, and cost of data transfer for each network connection is P shown in gure below. Note that jDj = i Di = 1000 Gb = 1 Terabyte.
'$ '$ & %' $& % &% Branch 1
D1 =300 Gb c1 =4 $ Gb
@
Branch 3
$1/Gb
@@
$4.5/Gb
D3 =150 Gb c3 =2 $ Gb
@@
$3/Gb
Branch 2
D2 =550 Gb c2 =1.5 $ Gb
By simple observation or after applying the OPTDMP algorithm, we conclude that the connection between Branch 1 and Branch 2 is too expensive and should be ignored. Using the other two connections, we obtain the matrix c^ of the optimal cost (in dollars per Gigabyte) of moving customer records from Branch i to Branch j for processing and subsequent transfer of the results to Branch 1:
2 3 0 4 1 [cij ] = 4 4 0 3 5 1 3 0 2 3 4:00 5:58 3:02 [^cij ] = [cij + cj + cj 1 ] = 4 8:00 1:58 5:02 5 5:00 4:58 2:02 Assume the possible error of the application is known to be within the range from 0 = :03 to d = :11. Then M = 0:189282. Let the maximum acceptable error be max = 8% and the error function be X ~k k D (X ) = :03 + :189282 1 ; where D~ j = xij 1000 i
Consider the convex polyhedron domain of the linear program: 8 9 x11 + x12 + x13 = 300 = < B = : X : xij 0; x21 + x22 + x23 = 550 ; x + x + x = 150 31
2 3 0 x12 300 x12 5; 0 e2 : 4 0 550 0 0 150
The in-place strategy appears to be relatively cheap but too inaccurate: 2 3 300 0 0 Xp = 4 0 550 0 5 =) 0 0 150 8 P < Cp = ij c^ij xij = $ 2372
2 3 0 0 300 e3 : 4 x21 550 x21 0 5 ; 0 0 150 2 3 0 0 300 e4 : 4 0 550 x23 x23 5 ; 0 0 150
: (X ) = 9:73% > p max On the other side of the spectrum, there are 3 possible naive strategies. All of them are highly accurate giving the minimum possible error level of 0 = 3%. Consider the expenses: Strategy of moving all data to Branch 1 costs C = $ 6350 Strategy of moving all data to Branch 2 costs C = $ 3230 Strategy of moving all data to Branch 3 costs C = $ 3970 Therefore, the best naive strategy cost is C0 = $3230, which is considerably higher than the cost of the in-place strategy. This leads us to conclusion that there must be an optimally balanced strategy in between. In the absence of accuracy restrictions, our optimization problem translates into the following linear program: min C (X ); [xij ]
where C (X ) equals P c^ x = 4:00x11 + 5:58x12 + 3:02x13 ij ij ij +8:00x21 + 1:58x22 + 5:02x23 +5:00x31 + 4:58x32 + 2:02x33 and x11 + x12 + x13 = 300 x21 + x22 + x23 = 550 ; xij 0 x31 + x32 + x33 = 150 the solution to which is the cheapest of all possible strategies 2 3 0 0 300 X = 4 0 550 0 5 ; C (X ) = $2078 0 0 150 However, (X ) = 8:48% > max and hence X is unacceptable. We turn next to the OPTDMP solution:
32
33
The strategy X is the \lowest" vertex of B , and there are 6 edges emanating from it \upward". Their parametric equations are: 2 3 x11 0 300 x11 5 ; 0 x11 300 0 e1 : 4 0 550 0 0 150
2 0 e5 : 4 0
x31
3 0 300 5; 550 0 0 150 x31
2 3 0 0 300 5; 0 e6 : 4 0 550 0 x32 150 x32
0 x12 300
0 x21 550
0 x23 550
0 x31 150
0 x32 150
Now the procedure is as follows: on each edge ek , search for a strategy X 2 ek that rst reaches the required accuracy threshold max and hence might be the optimal solution. (If the acceptable set A is not reached, continue along all new edges upward.) After substituting the parametric equation of the edges into (X ) = max , we nd that the lowest intersection with the max -error surface occurs on edges e2 and e6 , both giving the same cost: 2 3 2 3 0 93:99 206:01 0 0 300 4 0 550 0 5 or 0 5 X = 4 0 550 0 0 150 0 93:99 56:01
C = $2318:6; (X ) = 8% Although it looks like there are two dierent optimal strategies, in reality both of them represent the same optimal transfer policy:
Ship all 300 Gb of customer records from Branch 1 to Branch 3 and combine with 150 Gb of customer records that have been there already.
Of the resulting 450 Gb of records at Branch 3, ship 93.99 Gb to Branch 2 and combine with 550 Gb of customer records that have been there already.
Analyze all customer records at Branch 2 and 3 by a data mining algorithm and select target records.
Move local classi cation results from Branch 2 to Branch 3, combine with local results that were there, and move the entire target records set to Branch 1.
$ ' ' & %' $& &% '& $% '& '$ &%
Its network geometry is depicted in the following gure: Step 1: Rebalance the customer records:
move everything
Branch 1 D1 = 300
-
Branch 3 D3 = 150
move 93.99 Gb
and one that moves all data to a single processing node (centralized mining). We call these intermediate strategies. The framework reduces the problem of nding intermediate strategies to a mathematical programming problem which minimizes a cost function incorporating both communciation and processing terms subject to an error constraint. We show by example that this problem is interesting even for linear cost functions. Finally we introduce an algorithm OPTDMP for nding intermediate strategies.
$ %
In future work, we will investigate variants of the OPTDMP algorithm for more complicated cost functions and present experimental studies. We will also investigate different strategies for selecting which data to move from node to node.
7. REFERENCES [1] V. Chavtal. 1983.
[2] A. Cheung and A. Reeves. High performance computing on a cluster of workstations. Proceedings
Branch 2
Step 3: Ship all target records to Branch 1:
Branch 1 M1 = 0
move everything
Branch 3
M3 = 7:12
move everything
[3] T. G. Dietterich. Machine learning research: Four current directions. AI Magazine, 18:97{136, 1997.
$ %
Branch 2
Unlike the in-place strategy, X is suÆciently accurate and hence acceptable. On the other hand, it provides 28% savings over the best of the (acceptable) naive strategies, thus being superior to both of the simplistic approaches to the distributed data mining.
CONCLUSION AND SUMMARY
, pages 152{160, September
1992.
M2 = 12:88
6.
of
the 1st International Symposium on High Performance Distributed Computing
D2 = 550
Step 2: Mine customer records at Branch 2,3 and produce target records.
. Freeman and Co.,
Linear Programming
In this paper, we have introduced a new framework and methodology for distributed data mining. It allows us to choose a cost-optimal balance between local computation and node-to-node communication and data transfer. We show that this framework eectively bridges two simple approaches to distributed data mining which are common today: one that computes all data locally (in-place mining)
[4] A. Grimshaw, J. Weissman, E. West, and E. Loyot. Metasystems: An approach combining parallel processing and heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing, 21(3):257{270, 1994.
[5] R. Grossman, S. Bailey, A. Ramu, B. Malhi, H. Sivakumar, and A. Turinsky. Papyrus: A system for data mining over local and wide area clusters and super-clusters. Proceedings of Supercomputing 1999, 1999. [6] R. L. Grossman and A. Turinsky. Optimal strategies for moving data and models in distributed data mining. In preparation, 2000. [7] Y. Guo, S. M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Meta-learnig for parallel data mining. Proceedings of the Seventh Parallel Computing Workshop, pages 1{2, 1997. [8] H. Kargupta, I. Hamzaoglu, and B. Staord. Scalable, distributed data mining using an agent based architecture. Proceedings the Third International Conference on the Knowledge Discovery and Data
, pages 211{214, 1997.
Mining
[9] H. Kargupta, B. Park, E. Johnson, E. Sanseverino, L. D. Silvestre, and D. Hershberger. Collective data mining from distributed vertically partitioned feature space. Workshop on distributed data mining, International Conference on Knowledge Discovery and
, 1998.
Data Mining
[10] A. E. Raftery, D. Madigan, and J. A. Hoeting. Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92:179{191, 1996. [11] S. Stolfo, A. L. Prodromidis, and P. K. Chan. Jam: Java agents for meta-learning over distributed databases. Proceedings of the Third International
,
Conference on Knowledge Discovery and Data Mining
1997.
[12] D. Wolpert. Stacked generalization. 5:241{259, 1992.
,
Neural Networks
[13] L. Xu and M. I. Jordan. Em learning on a generalized nite mixture model for combining multiple classi ers. In Proceedings of World Congress on Neural Networks
, 1993.
[14] M. Zaki, W. Li, and S. Parthasarathy. Customizing dynamic load balancing for a network of workstations. Journal of Parallel and Distributed Computing: Special Issue on Performance Evaluation, Scheduling,
, June 1997.
and Fault Tolerance