not widespread because it takes a lot of time to ... computation time of a nonparametric classifier is ..... Table 1 Running time for parallel Cross Validation.
Proceedings of the International Conference on Computer, Communication and Control Technologies 2003
Parallel computation of kernel density estimates classifiers and their ensembles Elio Lozano and Edgar Acuña University of Puerto Rico at Mayaguez, Department of Mathematics Mayaguez, PR 00680
ABSTRACT Nonparametric supervised classifiers are interesting because they do not require distributional assumptions for the class conditional density, such as normality or equal covariance. However their use is not widespread because it takes a lot of time to compute them due to the intensive use of the available data. On the other hand bundling classifiers to produce a single one, known as an ensemble, leads in general to an improvement in the accuracy of the classification. Two popular methods for creating ensembles are Bagging introduced by Breiman, (1996) and, AdaBoosting by Freund and Schapire (1996). In this paper we have used parallel programming to compute the cross validated misclassification error for classifiers based on kernel density estimation as well as their ensembles. Key words: Bagging, AdaBoosting, kernel density estimation, ensembles, classification. 1. INTRODUCTION Research in nonparametric classifiers, which do not require distributional assumptions, has increased recently due to advances in computer technology that make their computation faster. But still the computation time of a nonparametric classifier is longer than the required for a parametric classifier. Classifiers based on kernel density estimation are a typical example of nonparametric classifiers [1], [2], [10][14]. On the other hand many researches are investigating the technique of combining the predictions of multiple classifiers to produce a single classifier (see [1], [3], [6], [9]). The resulting classifier, also known as an Ensemble, is generally more accurate than the individual classifiers making up the ensemble. Two popular methods for creating ensembles are: Bagging (Bootstrap aggregating) introduced by Breiman [6] and, AdaBoosting by Freund and Schapire [9]. These methods rely on resampling techniques to obtain different training sets for each of the classifiers.
Breiman [5] heuristically defines a classifier as unstable if a small change in the learning data L can make large changes in the classification. Unstable classifiers have low bias but high variance, meanwhile the opposite occurs for stable classifiers. Decision trees and neural networks classifiers are not stable, whereas linear discriminant analysis and knearest neighbor classifiers are stable. On the other hand experimentations have shown that Kernel density estimation (KDE) classifiers do not have a higher degree of instability [1]. Bagging and Adaboosting are very effective for unstable classifiers [3], [6], [9].. Adaboosting performs generally better that Bagging, but not uniformly better, sometimes they perform even worst than the single classifier. A parallel computation of the kernel density estimator has been proposed recently by Racine [13]. The Bagging algorithm can be implemented in parallel in a straightforward manner but Adaboosting is essentially sequential and its parallelization requires fine granularity. As far as we know parallel Boosting and Bagging computation has been done only for OC1 decision trees classifiers [16] using the BSP library as the parallel programming environment. In this paper we develop algorithms for a parallel computation of the misclassification error rates for kernel density estimates classifiers and their ensembles using the well known MPI library on a computer with 8 processors. We tested our algorithms in 12 datasets coming from The Machine leaning databases Repository at the University of California, Irvine. 2.
CLASSIFIERS BASED ON KERNEL DENSITY ESTIMATION
From a Bayesian point of view, supervised classification is equivalent to compare estimates of the probabilities of belonging to each class with each other, assigning an object with measurement vector x to the class with the largest fˆ(j/x), j=1,2… .J. In order to obtain such estimates, one can estimate them
indirectly via the class conditional density f(x/j) using the Bayes' theorem. Kernel density estimators can be used to carry out that task. For a given class j and a random sample X1,X2,… ..Xn of the p-dimensional random vector X with continuos components, the product kernel of the class conditional density at the point x is given by
fˆ(x) =
1 nh1h2 ....h p
n
p
∑∏
i =1 v =1
K(
x v − X iv ) (1) hv
where the kernel K will be usually a symmetric unimodal density function, for instance the normal density, and hv represents the bandwidth of the v-th predictor. There are several approaches to select the adequate bandwidth[14]. In the Statlog Project [10], where 23 classifiers are compared in 22 datasets, classifiers based on kernel density estimators (ALLOC80) performed better than decision trees classifiers. Classifiers based on kernel density estimation may be unstable due to two reasons, first the presence of singularities in the log-likelihood function, and second the effect of outliers in the selection of the bandwidth. In this paper we have used kernel density estimation with both fixed and adaptive bandwidth. The fixed bandwidth is obtained by assuming that the class conditional density is normally distributed [10]. 1 /( p+ 4) This is, h j = s j ( 4 / n( p + 2)) , where p is the number of predictors, n is the number of instances and sj is the standard deviation of the j-th predictor. In the adaptive kernel, the bandwidth varies from one point to another, and its value will depend on the concentration level of data in the neighborhood of the point. The basic idea of this approach is to combine the standard kernel density estimation with the nearest neighbor approach [14]. The estimation of the density function using adaptive kernel is given by
fˆ(x) =
n 1 p x − X iv 1 ) ∑ p ∏ K( v nh1h2 ....h p i =1 λ v =1 λi hv i
where λj is a bandwidth factor obtained by using a pilot density estimator of f(x). 3.
PARALLEL COMPUTATION OF THE MISCLASSIFICATION ERROR RATE BY CROSS VALIDATION
The preferred method to estimate the misclassification error (ME) of a classifier is the n-fold cross validation technique. Here each fold is considered as the test sample whereas the remaining ones are considered as the training sample. Then the numbers of errors are averaged. For each instance in the test sample we need to use the whole training sample to compute the kernel classifier. Hence, we divide the test sample according to the number of processors We are applying here the SPMD style of parallel programming. Each processor computes its respective misclassification rate and after finishing send its result to the master process which adds up the results and obtains the misclassification error rate of the whole classifier. The steps to compute the cross validation error rate using a parallel algorithm are the following :(see figure 1): 1.
2. 3. 4.
The master process reads the instances of the datasets and send them to all the salves processes. Every slave process calculates its own density estimate. Every processor calculates its own cross validation error. All the slave processes send their results to the master process which add them up and yield the final result.
The complexity time of the parallel algorithm depends on the number of processors, the number of instances in the training sample, the number of classes and the number if features. Hence the complexity time will be 2 O (ntrain nclass n feat / n process ) .
Figure 1. Parallel Computation of the Cross validation error rate
4.
PARALLEL COMPUTATION OF THE BAGGED CLASSIFIER
In the Bagging algorithm we use bootstrapping (sampling with replacement from the original ) to obtain a given number of new training samples. Then a classifier is built for each of the bootstrap samples. The Bagged classifier is obtained by voting and its misclassification error can be estimated through cross validation. The computations required to obtain the classifiers in each bootstrap sample are independent among them. Therefore we can assign tasks to each processor in a balanced manner. By the end each processor has obtained a part of the bagged classifier. In this case we use the master-slave parallel programming technique . The method starts with the master splitting the work to be done in small tasks and assigning them to each slave. Then the master performs an iteration in which if a slave returns a result (this means it finished its work) then the master assigns it another task if there are still tasks to be executed. Once that all that tasks have been carried out the master process obtains the results and orders the slaves to finish since there are not more tasks to be carried out. The steps of the algorithm are the following: 1. 2. 3. 4.
5.
The master process sends the instances of the dataset to the slaves. The master process sends one tasks to each slave. Each slave process executes its tasks and returns the result. The master process receives the result computed by a given slave and assigns it another task if there is a task to be executed. Otherwise sends it a message in order to finish. The master process adds up the results obtained by the slaves and yields the final result.
Figure 2. Parallel Bagging Algorithm
The complexity time of the parallel Bagging algorithm depends on the number of bootstrap samples, the number of processors, the number of instances in the training sample, the number of classes and the number if features. Hence the complexity time will be 2 O (ntrain nclass n feat nboot / n process ) .
5. PARALLEL COMPUTATION BOOSTED CLASSIFIER
OF
The main difference between Boosting and Bagging is that in the former the bootstrap samples are drawn using weights for each instance. These weights do not change for a correctly classified instance. Otherwise their values depend of the misclassification error of the classifier built using the previous bootstrap sample. By the end the boosted classifier is obtained by a weighted voting. We use the domain decomposition parallel programming technique. In this method the dataset is partitioned in equal parts according to the number of process. Each process computes one part of the classifier and the corresponding misclassification error rate. Once that the partial errors are obtained all the processes send their results to the master process which add them up and send then the total error to the processes to update the weights of the misclassified instances. The algorithm is as follows: 1.
The master process sends the instances of the dataset with their corresponding weights to the slaves processes. 2. In each iteration do: i) The master process obtains the bootstrapped sample of the learning dataset and sends it to the slaves processes. ii) All the slave processes compute the corresponding error rates and send their results to the master process. iii) The master process adds up the partial errors received from the salve processes and sends back the total error to update the weights. 3. Each process sends his result to the master process, which obtains the final result.
Figure 3. Parallel Boosting Algorithm The complexity time of the parallel Boosting algorithm depends on the number of bootstrap sample, the number of processors, the number of instances in the training sample, the number of classes and the number if features.
Table 1. Datasets used in this paper Dataset
Cases Classes Features C
B
N
0
Iris
150
3
4
-
-
-
Sonar
208
2
60
Heartc Bupa
303 345
2 2
5 6
3 -
3 -
2 -
Ionosphere 351 Crx 690
2 2
34 6
4
5
-
Breastw
699
2
9
Diabetes Vehicle
768 846
2 4
8 18
-
-
-
German Segment
1000 2310
2 7
7 19
2 -
10 -
1 -
Landsat
4435
6
36
-
-
-
6. RESULTS Our study was carried out using 12 datasets coming from the Machine Learning Database Repository at University of California Irvine (UCI). In table 1 we show the 12 datasets used in this paper including the number of instances, the number of classes and the number of each type of feature.
For each dataset we have performed the following procedure: The dataset is randomly divided in 10 parts, the first one is taken as the test sample and the remaining is considered as the learning sample. Next, 50 bootstrapped samples are taking from the learning sample and a KDE classifier is constructed with each of them. Finally, each instance of the test sample is assigned to a class by voting using the 50 classifiers previously constructed. The proportion of instances incorrectly assigned will be the bagged misclassification error. We repeat the steps considering now the second part as the test set and in this way we continue until the tenth part is considered as the test set. The procedure is repeated 10 times. The misclassification error of a single classifier is estimated by a 10-fold cross validation and averaged over 50 runs. The bagged misclassification error is averaged on 50 repetitions. We also computed the ratio of the misclassification errors of the bagged classifier versus the single one. Similar process is done for Boosting but using a weighting resampling scheme and a weighted voting. The computation running times of cross validation error for the single, the bagged and the boosted classifiers are shown in tables 3, 4, and 5 respectively. For the experiments, we used a SGI Origin 2000 computer located at the high performance computing facility of the University of Puerto Rico at Mayaguez. It is an 8 processor cache coherent nonuniform memory access (NUMA) machine. Each RISC R10000 processor posses a 1GB main memory running at 195 Mhz and capable of 390 Mflops (2 ops per cycle) [8]. Furthermore, each processor has 64 floating point registers and 64 integers registers. The machine has separately level one instruction and data caches, but only one level two cache. The level one caches are 32KB each with two way set associativity. The instruction cache has a line size of 64 bytes, while the data cache line size is smaller at 32 bytes. The 4 MB level two cache is two way associative with a 128 byte line size. In this work we have used the MPI (Message Passing Interface) library to communicate among the processors. MPI was developed in 1994 and its functions can be called from programs written in either C, C++ or Fortran[12]. In particular we have used MPICH the version of MPI available from the Argone National laboratory. The programs were written in C language and are available in math.uprm.edu/~edgar/software.html.
Table 1 Running time for parallel Cross Validation Number of Processors 1
2
3
4
5
6
7
8
Crx
2.903
1.504
1.030
0.814
0.659
0.564
0.489
0.392
Iris
0.053
0.032
0.023
0.020
0.020
0.018
0.017
0.016
Diabetes
2.364
1.229
0.847
0.668
0.550
0.463
0.405
0.387
Ionosphere 1.996
1.052
0.728
0.578
0.468
0.416
0.358
0.343
Breawst
1.948
1.013
0.707
0.548
0.443
0.392
0.352
0.338
German
8.739
4.474
3.075
2.319
1.932
1.662
1.417
1.189
Heart Bupa
0.539 0.374
0.297 0.199
0.198 0.144
0.164 0.114
0.130 0.095
0.120 0.083
0.101 0.073
0.093 0.069
Sonar
1.357
0.730
0.505
0.429
0.373
0.337
0.274
0.262
Vehicle
6.327
3.249
2.222
1.705
1.379
1.193
1.053
0.979
Segment
42.227
21.443
14.415
11.018
9.035
7.577
6.506
5.961
Landsat
343.180
172.801
115.875
87.424
70.422
58.947
51.163
46.114
7.
b) We found out that the SPMD style parallel algorithm to compute cross validation error is well suitable for the Origin 2000 configuration. c) We found that the Bagging parallel algorithm gains from coarse granularity since it requires low communication. d) Boosting was parallelized successfully in spite its sequential appearance. Parallelization of Boosting requires fine granularity.
CONCLUDING REMARKS
Our experiments have leaded us to the following conclusions a)
The parallel algorithms presented in this paper achieved almost linear speedup. Due to the lack of homogeneity in most of the datasets we use symmetric load balance to avoid the latency problem
Table 2 Running Time for Parallel Bagging Number of Processors 1
2
3
4
5
6
7
8
Crx
141.186
141.197
70.66
48.129
36.751
28.327
22.543
18.678
German
441.356
440.534
224.784
149.409
115.88
89.989
72.456
61.431
Heart
25.851
25.879
12.957
8.813
6.748
5.216
4.398
3.645
Iris
2.37
2.39
1.2
0.82
0.63
0.51
0.45
0.39
Diabetes
114.95
115.55
57.54
39.16
29.97
23.11
20.76
18.46
Ionosphere 95.65
95.54
47.85
32.55
24.89
19.53
17.26
15.34
Breawst
95.05
95.16
47.7
32.51
24.68
19.03
17.12
15.21
Bupa
17.79
17.81
8.92
6.08
4.66
3.76
3.24
2.88
Sonar
63.82
63.84
31.96
21.77
16.66
12.88
11.56
10.26
Vehicle
311.19
310.7
155.44
105.75
80.9
62.38
56.07
49.79
Segment
2075.83
2075.96
1037.99
706.71
540.07
415.89
373.95
332.45
Landsat
17019.03
17020.85
8511.34
5792.4
4443.51
3409.03
3064.14
2728.21
Table 3. Running Time for Parallel Boosting Number of Processors 1
2
3
4
5
6
7
8
Crx
367.35
166.45
113.63
87.30
63.77
56.30
48.67
38.15
German
439.89
237.51
156.24
118.55
93.80
78.44
69.82
60.57
Heart
105.06
52.22
35.54
26.89
21.74
18.36
16.67
15.23
Iris
9.43
5.87
3.64
2.96
2.45
1.94
1.85
1.78
Diabetes
457.10
233.88
158.07
119.69
97.21
81.69
72.65
64.42
Ionosphere 383.33
193.58
129.67
98.33
78.30
65.99
60.83
56.89
Breawst
419.31
192.48
136.01
96.79
90.64
66.41
58.90
49.56
Bupa
70.32
36.21
24.69
18.82
16.23
13.07
11.69
10.23
Sonar
225.64
127.55
85.67
65.03
52.97
44.80
38.25
36.97
Vehicle
977.24
594.20
302.32
168.97
127.80
107.07
92.36
82.25
Segment
8267.93
4193.64
2765.05
2117.20
1667.18
1427.94
1225.91
1078.21
Landsat
59521.78
34270.76
22282.03
16737.64
13739.07
11466.32
9543.75
8432.56
4 e)
Bagging classifiers get better results when it is applied to datasets where the single classifier performs poorly. In most of the datasets of this paper kernel density classifiers have performed quite well.
Acknowledgment This work was supported by the Office of Naval Research grants N00014-00-1-0360, N00014-00-10360 and by the PRECISE group of the Engineering School at the UPR-Mayaguez funded by NSF grant EIA 99-77071. REFERENCES 1
2
3
Acuña,. E, (2002) Combining classifiers based on kernel density estimators. Submitted to the Journal of Statistical Computation and Simulation. Available at www.uprm.edu/~edgar/comkde.pdf. Acuña,. E, Rojas, A. (2001) Bagging Classifiers based on kernel density estimators. Proceedings of the International Conference on New Trends in Computational Statistics with Biomedical Applications. Osaka University, Japan. 343350. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, Boosting and variants. Machine Learning, 36, 105-139.
Blake, C and Merz, C. (1998). UCI repository of machine learning databases. Department of Computer Science and Information. University of California, Irvine, USA. 5 Breiman, L. (1996a), "Heuristics of instability and stabilization in model selection". The Annals of Statistics, 24, 2350 -2383. 6 Breiman, L. (1996b). Bagging Predictors. Machine Learning, 26,123-140. 7 Efron, B. and Tibshirani, R. (1993). An introduction to Bootstrap. Chapman and Hall. London. 8 Fox, G.C., Williams, R.D. and Messina, P.C. (1994). Parallel computing works. Morgan Kaufmann Publishers Inc. San Francisco, CA. 9 Freund, Y and Schapire, R. (1996). Experiments with a new boosting algorithm. In Machine Learning, Proceedings of the Thirteenth International Conference, 148156., San Francisco, Morgan Kaufman.. 10 Michie, D., Spigelhalter, D.J. and Taylor, C.C. (1994). Machine Learning, Neural and Statistical Classification. London: Ellis Horwood. 11 Minami, H., Komiya, Y, and Mizuta, M. (2001). Evaluation of execution time on data analysis with parallel virtual machine. Proceedings of the International Conference on New Trends in Computational Statistics with Biomedical Applications. Osaka University, Japan. 321-326.
12 Pacheco, P. (1997). Parallel Programming with MPI. Morgan Kaufmann Publishers Inc. San 2Francisco, CA. 13 Racine, J. S. (2002) Parallel distributed kernel density estimation. Computational statistics and data analysis.40, 293-302. 14 Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall. London.
15 Wilkinson, B. and Allen, M (1999). Parallel Programming techniques and applications using networked workstations and parallel computers. Prentice Hall, Inc. 16 Yu, C. and Skilliom, D.B. (2001). Parallelizing Boosting and Bagging. Techincal Report Department of Computing and Information Science. Queen’s University, Kingston, Canada.