Performance Models for Co-ordinating Parallel Data Classi cation John Darlington
Moustafa M. Ghanem
Yike Guo
Hing Wing To
Email: fjd, mmg, yg,
[email protected] Department of Computing, Imperial College, 180 Queens Gate, London, SW7 2BZ.
Abstract In this paper we investigate the use of performance models for structuring parallel programs through a case study in data mining. Performance models have been shown to be an integral part of providing a more structured approach to the problems of performance portability and resource allocation in parallel programming. This is particularly true in the context of skeletons, where parallel programs are expressed as combinations of prede ned, often higher-order, functions. The use of performance models has, to some extent, been limited by the diculty in applying the approach to irregular and dynamic parallel algorithms. We explore this problem in the context of a well known data mining algorithm, C4.5, which exhibits both irregular and dynamic characteristics. C4.5 is rich in inherent parallelism making the choice of a suitable parallel implementation for a given architecture non-trivial. We demonstrate how a structured approach to developing the performance models enables a comparison to be made between dierent implementations without solving the irregular or dynamic components of the models. The predicted implementation choices are compared with execution times obtained for the parallel implementations on an AP1000.
1 Introduction In this paper we investigate the use of performance models for structuring parallel programs through a case study in data mining. Previ
This is an abridged version of Chapter 6 of [8]
ous studies into performance models have predominately focussed on regular applications or pipelines with known arrival and service times [1, 3, 6, 4, 7, 11, 12]. Little has been done in the eld of irregular dynamic problems which has limited the scope of applications for which the technique can be applied. Owing to their very nature it is very dicult to predict the execution time of irregular dynamic algorithms without performing computation similar to that required to execute them. In this paper we investigate techniques which enable certain implementation choices to be made for irregular dynamic algorithms without computing the irregular or dynamic components of the performance models of these algorithms. The proposed techniques for modelling irregular dynamic problems are investigated by applying them to a data mining algorithm. The algorithm we shall consider is a well known tree induction algorithm, C4.5 [10], which automatically derives a function for classifying items in a database into one of a number of prede ned classes. The algorithm is rich in inherent parallelism and a number of dierent parallel implementations are possible. The choice of an appropriate algorithm to use for a given architecture is not obvious owing to the great variations in the potential implementations. Section 2 gives an overview of the skeletons programming methodology used in this paper. Section 3 introduces the main algorithm used in C4.5 and describes how it is used in classifying data items. Sections 4 describes the dierent strategies that can be used to parallelise the algorithm and describes in detail two of the alternatives that can be used. The performance
P2-J-1
models for these two alternatives are also presented. In Section 5 a comparison of the two implementations is given based on the performance models and the predicted results compared to the experimental results acquired when the implementation were run on the AP1000 parallel machine.
2 Performance models in parallel programming Performance models have been proposed as a more structured approach to the problems of performance portability and resource allocation [1, 6]. This is particularly true in the context of skeletons, where parallel programs are expressed as combinations of prede ned, often higher-order, functions [13, 5]. Skeletons or combinations of them can have several dierent implementations, each suited to dierent circumstances such as machine type. To determine between the use of dierent implementations and to optimise the resource usage some indication of their potential performance is needed. Performance models enable this by providing a prediction of the execution time for a given program. As skeletons are prede ned components it is natural to associate performance models with them. Several authors have used performance models to guide resource allocation and to make implementation decisions [1, 3, 6, 4, 11, 12]. However, much of the existing work has focussed on regular applications. The level of detail captured by a performance model can vary from the asymptotic to modelling the intricacies of an interconnection network. Darlington, Ghanem and To suggests that no one particular level of detail is appropriate to all situations [7]. Indeed it is shown that for certain resource allocation decisions a more abstract model is appropriate. This concept is investigated in this paper by demonstrating how an abstract level of modelling enables decisions to be made without computing the predicted execution time of the irregular and dynamic components of a model.
3 A classi cation algorithm: C4.5 The irregular dynamic application studied is C4.5, a tree induction algorithm for data mining [10]. Given a training set and a set of prede ned classes C4.5 aims to generate a function which maps the data items in a database into the pre-de ned classes. In order to generate this function, C4.5 uses an inductive-learning approach. The approach examines the training set to generate a decision tree. Each data item in the training set must be pre-classi ed into one of the prede ned classes. Outlook Temp(F) Humid Windy sunny 71-75 false true sunny 76-80 true true sunny 81-85 true false sunny 71-75 true false sunny 66-70 false false overcast 71-75 true true overcast 81-85 true false overcast 61-65 false true overcast 81-85 false false rain 71-75 true true rain 61-65 false true rain 71-75 true false rain 66-70 true false rain 66-70 true false
Class Play Don't Play Don't Play Don't Play Play Play Play Play Play Don't Play Don't Play Play Play Play
(a) Example training set. Outlook
sunny
overcast
rain
Play Humid
false
Play
Windy
true
Don’t Play
true
Don’t Play
false
Play
(b) Resulting Decision Tree
Figure 1: Example Figure 1.a shows an example of a training data set. Each item in the training data is a record containing the values of the dierent attributes describing the data item together with the class to which it belongs. The output of the program is a decision tree such as the one shown in Figure 1.b. Each internal node represents a decision
P2-J-2
node or a test that can be carried out on a single attribute value. The decision tree provides a sequence of tests that can be applied to classify an unseen data item. This is done by starting at the root node of the tree and moving down the tree until a leaf is encountered. The branch taken at each decision node is decided based on the outcome of applying the node test to the given case. In general, after generating the tree several post-processing steps are usually applied in order to simplify the tree and to translate it into a set of production-rules [10]. In this paper we shall only concentrate on the tree generation phase which consumes the bulk of the algorithm's processing time.
3.1 The core algorithm To build a decision tree the algorithm rst selects the attribute to used as the test for the root node. The attribute is found by computing and comparing the information gain of each attribute. The information gain is some measure of the attribute's ability to classify the training set [10]. The attribute with the highest information gain is selected. A decision node is then created with one child for each possible value of the attribute. The training set is then partitioned amongst the children. A data item is placed with the child which has the same value for the root attribute as that of the child. The algorithm then proceeds in a depth rst divide-andconquer manner generating each of the branches as separate decision trees each with its partition of the training data set. The recursion in a branch stops when all the data items in that branch have the same class. These steps can be described as follows: 1. Using the current data-set, compute the information gain for each attribute. 2. Select the attribute which yields the most information gain and use that attribute as the node partitioning criteria for the current node of the tree. 3. Generate all possible branches of the current node according to the different values of the chosen attribute and partition the data between the branches according to the values of the selected attribute.
4. Perform steps 1-4 recursively on each branch of the tree until all data items at a particular node are of the same class. (At which case label that leaf with the class name).
The tree construction algorithm can be expressed succinctly in terms of a divide and conquer skeleton, DC. The meaning of DC is given in functional notation by: DC trivial solve divide combine x | trivial x = solve x | otherwise = (combine o map (DC trivial solve divide combine) o divide) x
Using DC we can express the tree construction algorithm as: treeConstruction ts = DC trivial solve divide combine ts where trivial x = singleClass x solve x = labelLeaf x divide x = partition (selectAttribute x) x combine t = createNode t
The function singleClass determines if all the data items in a training set belong to the same class. The function labelLeaf labels a leaf with the appropriate class. selectAttribute selects the attribute with the highest information gain in the current training set by computing the information gain of each attribute. The function partition partitions a training set by a given attribute. This is simply achieved by traversing the current training set and testing the value of the attribute under consideration.
3.2 Selecting an attribute
It is necessary to have an understanding of how selectAttribute functions to develop the performance models for the parallel implementations. The selectAttribute computes the information gain of each attribute. The information gain for a given attribute is computed by scanning the current training set, linearly, and accumulating information on the frequency distribution of classes. The attribute with the highest information gain is then selected. The function selectAttribute is captured by the following pseudo-code:
P2-J-3
intermediate recursive divides and tests for triviality. The cost for computing the ith recursive call (or node) is given by:
/* compute info gain for each attribute */ for att = 1 to maxAtt { gain[att] = 0; if notTested(Att) { /* compute frequency distribution */ for rec = 1 to MaxRec { fTable[att][rec.att][class(rec)]++; } gain[att] = calcInfoGain(fTable[att]); } } return att with highest gain
tnode (i) = tsingle (i) + tfreq (i) + tinfo (i) +tdiv (i)
3.3 Parameters aecting the performance
The time taken to generate the decision tree depends heavily on the shape of the tree. A detailed and accurate performance model of both the sequential and parallel implementations of this algorithm would thus depend on the shape of the tree. Unfortunately the shape of the tree cannot be determined by any simple study of the training set. However, certain parameters which can be directly extracted from the training set do contribute to the performance of the algorithm. In this paper we use these parameters to develop our performance models. Note that these parameters are not sucient to determine the shape of the decision tree, but do enable abstract performance models to be developed. The parameters which can be directly extracted from the training set are:
N Number of training data items. A Number of attributes. V Branching factor - number of
where tsingle is the execution cost of performing singleClass and tdiv is the cost of performing partition. The cost of performing selectAttribute consists of the time to compute the frequency distributions of each attribute and the cost of deriving the information gain from this (cf. Section 3.2). These can be expressed in terms of the parameters extracted from the training set:
tfreq tinfo tdiv tsingle
k 1 A i Ni k2 CAiV k3 Ai k4 Ni
where kj is some constant which can be derived from benchmarking the application. The number of data items considered at any given node of the tree together with the number of attributes of each vary from one node to another depending the depth of the recursion and the actual dataset being used. This is re ected by Ai and Ni which are the number of attributes, the and data items at the particular tree node being modelled. A performance model for the overall cost of the sequential algorithm can be derived by performing a summation of the costs of all the recursive calls made in building the decision tree.
X tnode (i) i X Tseq = (tfreq (i) + tinfo (i) + tdiv (i) + tsingle (i))
Tseq =
distinct values of an attribute. C Number of classes.
3.4 Sequential performance model
= = = =
i
A model of the sequential algorithm is required by the parallel performance models and for the purposes of comparison. The execution times of labelLeaf and createNode are negligible compared with the other functions and thus can be ignored. The cost of treeConstruction is then the sum of the cost of performing each of the
Acquiring an exact value for Tseq is dicult without knowing the exact shape of the decision tree which cannot be determined until runtime. This can be simpli ed by acquiring average values under certain assumptions for the shape of the tree and solving recurrence equations for each of the individual components of the model [8]. This is not pursued further in this paper.
P2-J-4
detailed study involving the rst approach is given in [8]. Implementations of the rst approach are described in [2].
A
B
D
E
P0
4.1 Intra-node parallelism
C
P1
In the intra-node parallelism approach, the computation involved in a recursive calls is parallelised in a SPMD manner. Much of the parallelism arises from performing selectAttribute. There are two methods for parallelising selectAttribute. Either the computation of the information gain of a single attribute can be parallelised across the dierent data items or the information gain of all the attributes can be computed concurrently. The two options form the basis for our two dierent implementations which are described in detail in Sections 4.2 and 4.3. Once a recursive call has been computed, all the processors then co-operate in computing the next recursive call and the process continues until the algorithm terminates. This behaviour is shown in Figure 2.b where all processors cooperate to expand node A followed by expanding node B and so on. Thus the general performance model for the intra-node parallel versions of C4.5 is:
G
F
P2
P3
(a) Inter-node scheme: Exploits parallelism available between the generation of independent sub-trees
A P0 P1 P2 P3
B
C
P0 P1 P2 P3
P0 P1 P2 P3
D
E
F
G
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
Tpar = tdist (D; P ) +
(b) Intra-node scheme: Exploits parallelism available in the work required in individual node expansion
X tnode(i; D; P ) i
where tnode is the cost of computing one recursive call. Note that the data must be initially distributed before the computation can begin. The cost of this is represented by tdist .
Figure 2: Parallelisation Approaches for C4.5
4 Parallel C4.5 There are two main approaches to exploiting parallelism in C4.5. In the rst approach, internode parallelism, a parallel task is created for each of the branches of a node. Thus separate recursive calls are computed in parallel. The second approach intra-node parallelism, computes each recursive call in parallel. Thus the functions singleClass, selectAttribute and partition are parallelised. The dierences between these two approaches are highlighted in Figure 2. A mixture of both approaches can also be used. In this paper we only consider alternative implementations of the second approach. A more
4.2 SPMD by record
In the rst intra-node parallel scheme, SPMDrec, the data items (records) of the training set are distributed evenly across the processors. A single recursive call can then be parallelised by computing the information gain of each attribute in parallel. This is achieved by each processor computing the frequency distribution array of its local records in parallel with the other processors. The global frequency distribution array can then be constructed from the local frequency distribution arrays by performing an elementwise addition of corresponding entries in the local arrays. This can be implemented eciently
P2-J-5
using a standard message passing global communication function AllReduce. The general intranode parallel performance model can thus be re ned to:
X TSpmdRec = tscatter (D; P ) + tnode (i; D; P ) i
The performance model for tnode is the sum of the components which constitute a recursive call. The major data structure being communicated between the processors when computing selectAttribute is the frequency distribution array of size CV Ai . Communication is also required in the computation of singleClass and partition. The communication involved in computing singleClass can be captured by the function AllReduce. Similarly the division of the data set into dierent branches for partition also requires an exchange of information which can be captured by AllReduce. The size of the data communicated for the these functions is constant for all the nodes in the tree and can be represented by the constant values k5, k6, k7. Thus the performance models of the four functions involved in the computation of one node can be given by:
tfreq (i) tdiv (i) tinfo (i) tsingle (i)
= = = =
k1 APN + Tallred(k5; P ) k3 NP + Tallred(k6; P ) k2 CV Ai + Tallred(CV Ai; P ) k4 NP + Tallred(k7; P ) i
i
i
i
4.3 SPMD by attributes The second scheme computes the information gain of each attribute in parallel with the other attributes. Hence each processor is responsible for calculating the information gain of some subset of the total set of attributes. Once the local information gains have been computed the global gain array can be constructed by performing an AllGather operation. The processors then proceed by choosing the attribute with the highest information gain and partitioning the current training set. Owing to the structure of the code it is not possible to distribute the data set by attributes, requiring the entire training set to be broadcasted to every node. Thus the program begins by broadcasting the entire training data set to all the processors. The general
intra-node parallel performance model can thus be re ned to:
TSpmdAttr = tbcast (D; P ) +
X tnode(i; D; P ) i
Note that the number of communication operations in this scheme is less than the number of communication operations in the SPMD-rec scheme. The only operation that requires communication is the gain calculation function tinfo . All the other functions do not incur any communication overheads since they operate on the entire data-set belonging to a node. The performance models for the components of a recursive call are given by:
tfreq (i) tdiv (i) tinfo (i) tsingle (i)
= = = =
k1 APN k 3 Ni k2 CVPA + tallgath( AP ; P ) k 4 Ni i
i
i
i
5 Implementation selection
5.1 Model predicted choice
A direct approach to selecting the appropriate implementation for a given problem and machine pair is to solve the performance models TSpmdRec and TSpmdAttr for the given parameters. However this would require solving the recurrence summation which requires the unknown shape of the resulting decision tree. This problem can be overcome by observing that both the performance models have the same recurrence structure, since for a given problem both implementations must result in the same tree. Hence we can determine which implementation is more suited for a given set of parameters by comparing the tnode component of both equations. However, this will only determine the relative performance of the two implementations. Using this technique it is not possible to quantify the dierence in performance between the two implementations. Fortunately for the purposes of selection a relative measure is sucient. Comparing the tnode components of the implementations we observe:
P2-J-6
The SPMD-rec implementation parallelises singleClass and partition and the frequency distribution calculation
in selectAttribute, while incurring communication overheads for each. It, however, does not parallelise the gain calculation in selectAttribute. SPMD-attr parallelises both components of selectAttribute while incurring communication overheads when calculating the gain information array. It, however, does not parallelise partition and singleClass. Thus the SPMD-rec scheme performs better for large training sets, (i.e. large N ) and moderate numbers of attributes, A, and classes, C . The SPMD-attr scheme requires less communications overheads per node and is more suited to training sets with large numbers of attributes, but performs relatively worse for large training data-sets. These observations can be used as guidelines for selecting an ecient implementation for any given data set.
connect-4 (27558) 140
Time (sec)
100
6 Conclusions and future work
80 60 40 20 0 0
5
10 15 Processors (P)
20
(a) Connect4: A=42, C=3, N=27558 mushroom (24372) 5 SPMD-rec SPMD-attr 4
Time (sec)
5.2 Experimental results on an AP1000
3
2
1
0 0
2
4
6
8 10 12 Processors (P)
14
16
18
(b) Mushroom A=22, C=2, N=24372 soybean (17075) 30 SPMD-rec SPMD-attr
25 20 Time (sec)
To test the accuracy of the selections, chosen using the performance models, the execution times for both implementations were compared using the Soybean, Mushroom and Connect4 training sets obtained from the UCI Repository of Machine Learning Database [9]. Figure 3 shows the results for a varying number of processors from 2 to 16. The training sets have very diering characteristics. As predicted by the models, the SPMD-attr version performs better for Connect-4 and Soybean, which have large numbers of classes and attributes. In contrast the SPMD-rec scheme oers better performance for the Mushroom training set, where the data is characterised by a large N and a low number of classes and attributes.
SPMD-rec SPMD-attr
120
15 10 5 0 0
2
4
6
8 10 12 Processors (P)
14
16
18
(c) Soybean A=35, C=19, N=17075.
Figure 3: Execution time of dierent training sets on an AP1000.
In this paper we have investigated performance models for structuring parallel programs through a case study in data mining. Data mining is a typical example of a irregular and dynamic algorithm rich in inherent parallelism. These characteristics make it dicult to select P2-J-7
a suitable implementation for a given problem and machine pair. We have shown how performance models at a relatively high level of abstraction can be use to compare and select between two parallel implementations of C4.5. There are however other decisions including resource allocation decisions, for example optimising the number of processors to use, which will require more detailed predictions of the performance of an implementation. Similarly more detailed models are required to compare implementations that have a dierent overall structure, for example implementations of the inter-node parallelism schemes. We are currently developing techniques to tackle these problems.
pages 160{169. IEEE Computer Society Press, September 1993. [5] J. Darlington, Y. Guo, H. W. To, and J. Yang. Functional skeletons for parallel coordination. In Seif Haridi, Khayri Ali, and Peter Magnussin, editors, Euro-Par'95 Parallel Processing, pages 55{69. Springer-Verlag, August 1995. [6] John Darlington, Moustafa Ghanem, Yike Guo, and Hing Wing To. Guided resource organisation in heterogeneous parallel computing. Submitted to the Journal of High Performance Computing, 1997.
[7] John Darlington, Moustafa Ghanem, and Hing Wing To. Accuracy in decision making with performance models. Technical Report IFPC-TR-97-3, Department of Computing, Imperial College, May 1997. [8] Moustafa Ghanem. Structured Parallel Pro-
Acknowledgements
gramming Using Performance Models and Skeletons. PhD thesis, Department of Computing,
We would also like to thank Fujitsu for providing the facilities at IFPC which made this work possible. This work has been conducted under a British Council Scholarship to the second author. The fourth author gratefully acknowledges support from the EPSRC funded project GR/K69988.
References [1] Peter Au, John Darlington, Moustafa Ghanem, Yi ke Guo, Hing Wing To, and Jin Yang. Coordinating heterogeneous parallel computation. In Luc Bouge, Piere Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par'96 Parallel Processing, volume I, pages 601{614. SpringerVerlag, August 1996. [2] Jaturon Chattratichat, John Darlington, Moustafa Ghanem, Yike Guo, Harald Huning, Martin Kohler, Janjao Sutiwaraphun, Hing Wing To, and Dan Yang. Large scale data mining: The challenges and the solutions. In The Third International Conference on Knowledge Discovery and Data Mining (KDD-97), Au-
gust 1997. [3] M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, and M. Vanneschi. A methodology for the development and the support of massively parallel programs. In D.B. Skillicorn and D. Talia, editors, Programming Languages for Parallel Processing, pages 205{220. IEEE Computer Society Press, 1994. [4] J. Darlington, M. Ghanem, and H. W. To. Structured parallel programming. In Programming Models for Massively Parallel Computers,
[9]
[10] [11] [12]
Imperial College, In perparation 1997. C. J. Merz and P. M. Murphy. UCI repository of machine learning databases. University of California, Department of Information and Computer Science, http:// www.ics.uci.edu/~mlearn/MLRepository.html, 1996. J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, Inc, 1993. Roopa Rangaswami. HOPP | A higher-order parallel programming model. In M. Moonen, editor, Algorithms and Parallel VLSI Architectures. Elsevier, 1995. Hing Wing To. Optimising the Parallel Be-
haviour of Combinations of Program Components. PhD thesis, Department of Computing,
Imperial College, September 1995. [13] Jin Yang. Co-ordination Based Structured Parallel Programming. PhD thesis, Department of Computing, Imperial College, In perparation 1997.
P2-J-8