Exploitation of a parallel clustering algorithm on commodity hardware ...

J Supercomput (2008) 43: 21–41 DOI 10.1007/s11227-007-0136-2

Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI Stéphane Genaud · Pierre Gançarski · Guillaume Latu · Alexandre Blansché · Choopan Rattanapoka · Damien Vouriot

Published online: 5 May 2007 © Springer Science+Business Media, LLC 2007

Abstract The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. We have proposed the MACLAW method, a cooperative coevolution algorithm for data clustering, which has shown good results (Blansché and Gançarski, Pattern Recognit. Lett. 27(11), 1299–1306, 2006). However the complexity of the algorithm increases rapidly with the number of clusters to find. We propose in this article a parallelization of MACLAW, based on a message-passing paradigm, as well as the analysis of the application performances with experiment results. We show that we reach near optimal speedups when searching for 16 clusters, a typical problem instance for which the sequential execution duration is an obstacle to the MACLAW method. Further, our approach is original because we use the P2P-MP1 grid middleware (Genaud and Rattanapoka, Lecture Notes in Comput. Sci., vol. 3666, pp. 276–284, 2005) which both provides the message passing library and infrastructure services to discover computing resources. We also put forward that the application can be tightly coupled with the middleware to make the parallel execution nearly transparent for the user. Keywords Clustering · Evolutionary algorithms · Grid · Parallel algorithms · Java

S. Genaud · G. Latu · C. Rattanapoka LSIIT-ICPS, Louis Pasteur University, Strasbourg – UMR 7005 CNRS-ULP, Blvd. S. Brant, BP 10413, 67412 Illkirch, France S. Genaud e-mail: [email protected] P. Gançarski () · A. Blansché · D. Vouriot LSIIT-AFD, Louis Pasteur University, Strasbourg – UMR 7005 CNRS-ULP, Blvd. S. Brant, BP 10413, 67412 Illkirch, France e-mail: [email protected]

22

S. Genaud et al.

1 Introduction Classification algorithms are generally divided into two types, supervised and unsupervised. A supervised algorithm requires labeled training data i.e. its requires more a priori knowledge about the training set: the aim of supervised learning is to discover a function or model (also called classifier) from these training data which classifies new previously unseen examples correctly. A unsupervised classification also called clustering, is an optimization problem aiming at partitioning a data set into subsets (clusters), so that the data in each subset share some similarity, where similarity is often defined as proximity according to some defined distance measure. A clustering algorithm does not need any a priori information about the data, except (in most approaches) the expected number of clusters. In this paper we only focus on data clustering. Algorithms designed to solve such problems often have a high complexity and hence parallel processing is a natural idea to speed-up computations. Thus, several ad hoc parallelizations of clustering algorithms have been proposed [6, 8, 21] based on the message-passing programming paradigm (all implemented with MPI [23]). In these works, experimental runs often achieve quasi linear speed-ups on specialized hardware (e.g. a parallel supercomputer, an homogeneous PC cluster, . . . ). Such speed-ups are appealing as most data-clustering applications rarely run for less than several hours. However, in practical usage, the user must overcome the burden of using dedicated computing facility and subsequent constraints to benefit from accelerated runs. Constraints may include remote login to the computing facility site, mandatory reservation for an exclusive access (e.g. via batch schedulers such as LSF or PBS), limited number of processors on small or overloaded multi-processors computers. Another alternative maybe to install an MPI implementation (e.g. mpich, LAM/MPI) on some individual PCs connected to the LAN. In that case, keeping up-to-date the fixed configuration of available machines is so time-consuming that it is an unacceptable maintenance task. In [2], a new clustering method called MACLAW (a Modular Approach for Clustering With Local Attributes Weighting) has been introduced. This method is based on the evolutionary approach. However, it is well-known that the evolutionary algorithms require much computing time and thus that their parallelization is even more crucial. In this paper we present the improvements we bring to the Java based implementation of the MACLAW clustering method, called Mustic, in order to decrease its execution times while preserving its ease of use. The contribution of this paper is two-folds. First, we propose a parallelization of MACLAW. We explain the method in Sect. 2.2, we analyze its complexity in Sect. 2.3 and we propose a parallelization based on the message-passing paradigm in Sects. 3.1 and 3.2. The second contribution lies in the implementation with P2P-MPI [12], an original approach which enables to exploit MACLAW on computing grids. P2P-MPI is introduced in Sect. 3.3 and we discuss its advantages over some other projects, in particular its dynamic resources discovery capabilities. We also put forward the ease of use with the integration of parallel runs invocations in the Mustic GUI. Finally some experimental results are presented in Sect. 3.4 to illustrate the performances of our parallel version of MACLAW both in homogeneous and heterogeneous environments.

Parallel clustering algorithm with P2P-MPI

23

2 MACLAW 2.1 Context The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. Clustering is an unsupervised classification process in the sense that it operates with the only intrinsic properties of the objects (and without predefined notion of the cluster). Figure 1 shows an example of an image clustering: on the left is the image to be classified and on the right are the four images which corresponds each to a cluster. Note that the clusters have no semantics. Pixels’ colors have been manually chosen by an expert thanks to his knowledge of this area. In these images, building and road are shown in grey (top-left), vegetation in green (top-right), water in blue (bottom-right) and soil pixels in brown (bottom-left). A complete panorama of unsupervised classification methods is given in [1, 26]. In this paper, we mainly focus on hard clustering in which each object belongs to one and only one cluster, hence the result is a set of clusters that forms a partition of the data set and on soft clustering in which each object belongs to one or several clusters with a membership degree equal to one each time. Today, data are becoming more and more complex since data items (or objects) are described by an increasing number of features (e.g. the number of bands in hyperspectral images) which may be correlated, or appear noisy or irrelevant. A possible approach to deal with this complexity is to associate a weight with each feature (in a global weight vector) in order to give relative importance to the features: ideally, noisy attributes are assigned low weights whereas relevant attributes must have a higher weight. In [15, 16, 24], the authors assume that although all features are rele-

Fig. 1 Remotely sensed image from the city of Strasbourg (France): S POT 4 data with three channels (XS1, XS2, XS3) at standard resolution (200 × 250 pixels–20 meters/pixel) (left) and the 4 clusters extracted (right)

24

S. Genaud et al.

vant, their relative importance depend on the classes to extract. Thus, instead of assigning a global weight vector for the entire data set, recent methods such as COSA (Clustering On Subsets of Attributes) [9] and LAC (Locally Adaptive Clustering) [7] algorithms assign a local weight vector for each cluster (i.e., a weight is assigned to each attribute in each cluster) which will be applied to the objects during classification. A family of unsupervised methods based on the K-means algorithm [22] has been also developed [5, 10, 17]: the aim of these algorithms is to simultaneously optimize the (local) weights and the classes built using these (local) weights. Like the K-means algorithm, these methods try to minimize a cost function based on the distance between the objects and the cluster centres through hill-climbing search. Thus, the hybrid method presented by [5] wraps a feature optimization step into the weighted K-means algorithm which is an extension of the well-known K-means clustering paradigm [22]. By an iterative process, it attempts to optimize the cost function: CostW KM (D, K, W, C) =

K d t wk · distt (xj , ck ) k=1 xj ∈Sk t=1

1 , . . . , w d )) is the lowhere W = ((w11 , . . . , w1d ), . . . , (wk1 , . . . , wkt , . . . , wkd ), . . . , (wK K cal weight vector, C is a partition of the dataset D and distt (xj , ck ) the distance between object xj and center ck of the k-th cluster on the t-th attribute. This process is based on a hill-climbing approach where each step consists in three partial optimizations. The first two are the same as in the weighted K-means algorithm (each object is assigned to the nearest center according to the distance measure, then each center is re-calculated as the gravity center of all objects which are associated with it) while the last optimization consists in recomputing the local feature weights according to the new centers.

2.2 The MACLAW method In [2] we have proposed and validated an improved clustering method called MACLAW which integrates a local feature weighting algorithm. Whereas [5] uses an hill-climbing approach, our clustering method is based on a cooperative coevolution algorithm with K populations of individuals. In MACLAW, we search for a global solution which consists of K clusters together with their associated weight vectors. For any of these clusters to come out, at the g-th generation each individual i,g i of a population k extracts from the dataset D, one and only one cluster called Xk . The extraction consists in a weighted K-means execution, parameterized by the individual’s feature weights coded by its chromosome. The best extracted cluster (i.e. the cluster which yields the best clustering quality) in population k is then chosen as the k-th cluster for generation g. As clusters are independently computed, an object may temporarily belongs to zero or more clusters in the global solution. Hence, such a global solution is called a weighted partial soft clustering (WPSC) and is represented by a vector of K elements, the k-th element being a cluster called Ck together with its associated weight vector Wk . Finally, the goal of MACLAW is to produce the best WPSC as possible. Ideally, the clustering is a partition but in many cases there remains a few objects which are


25

classified into several clusters or which are unclassified. The MACLAW implementation integrates a final postprocessing stage to transform the WPSC obtained into a partition by assigning each misclassified object to the cluster with nearest center. The above mentioned goal translates into an optimization problem which consists in finding some appropriates sets of weights, and is solved through the evolution process. A WPSC is evaluated according to the new quality criterion Q which takes into account two criteria. The first one, called Qp (WPSC) judges the structure of the WPSC: if the clustering is a partition of the data set the quality will by high, whereas overlapping and unclassified objects will yield poor quality. Its evaluation consists in counting the number noj of clusters out of the K clusters Ck in the global solution to which each object oj belongs: 1 (1 − |noj − 1|) . (1) Qp (WPSC) = max 0, Card(D) oj ∈D

One can notice that Qp (WPSC) = 1 if and only if WPSC is a partition otherwise Qp (WPSC) < 1. The second one, called Qc (WPSC), judges the clusters quality. It is often based on the cluster inertia as follows: Qc (WPSC) = r(Ck , Wk ), (2) (Ck ,Wk )∈WPSC

where the inertia r of a cluster is defined as the sum of all the distances between objects belonging to the cluster and the center of the cluster and where the distance measure takes into account the local feature weights associated with clusters. Finally, the quality of a WPSC is defined by Q(WPSC) = Qp (WPSC) × Qc (WPSC).

(3)

In our context, the problem’s objective is thus to find which individuals (and hence which weights sets) lead to the clustering with maximum quality. A straight way to find an optimal solution from the local solutions to the above problem could be exhaustive evaluation: finding the optimal WPSC for K populations with I individuals each, would then involve to evaluate I K combinations. But as typical values for K are between 5 and 20, and I typically ranges between 20 and 50, an exhaustive evaluation is untractable in most real cases. To circumvent this obstacle the MACLAW method proposes to only evaluate an individual of the g-th generation according to improvements it produces on the following generations. g g g Let B g = (B1 , . . . , Bk , . . . , BK ) be the best solution at any generation g (also g g g g called the current best WPSC) where Bk = (Ck , Wk ) with Ck is the k-th cluster and g Wk is its associated local feature weight vector. An overview of the method highlights three phases that are iterated at each generation g: – Phase 1: Each individual extracts a cluster and is evaluated according to the improvement it brings to the best current WPSC B g . The best of each population is selected.

26

S. Genaud et al.

– Phase 2: All combinations made of best individuals from the previous generation and from the current generation are evaluated and the combination yielding to the highest quality is selected as the new current best WPSC B g+1 . – Phase 3: A classical reproduction process is performed to diversify individuals. The method is detailed in Algorithm 1 hereunder.1 Algorithm 1 The MACLAW method

1 We use argmax to denote the value which maximizes a function f :

argmax f (x) = {x ∈ S | ∀y ∈ S ∧ y = x, f (y) < f (x))}. x∈S


27 i,g

The aim of Phase 1 (lines 6 to 18) is to determine the best cluster Xk in each population, in the context of the current best WPSC. Some preliminary steps are i,g performed to extract the cluster Xk : – A clustering is performed on the data set D, by performing the weighted K-means algorithm using the feature weights W coded by the chromosome of the individual, to obtain a set of K clusters {S1 , . . . , SK }, – The quality of each of these clusters is then evaluated (line 13): the intrinsic cluster quality assessed through r, the inertia criterion and cluster’s adequation with current best WPSC is evaluated with Qp (1). In Phase 2, we iterate over all of the 2K possible combinations (line 22) built by replacing one or several clusters in B g by their corresponding best local solutions from Y g . Each of them is evaluated (line 23). The combination with the highest quality is chosen as the new best WPSC B g+1 (line 25) and will be used for next generation. In Phase 3, the genetic reproduction makes each individual evolve within its own population: a roulette-wheel method (fitness proportional selection) is used to select individuals according to their evaluations, crossing-over are carried out by combining seeds and their associated weights, and mutation operators are used to disturb weights or seeds. These three phases are iterated while the average evaluation of the individuals varies from one generation to the next more than the threshold given by the user or the number of iterations is less than a maximum number of iterations specified by the user. 2.3 Complexity analysis The time complexity of the sequential MACLAW algorithm depends on the following parameters: N and d are respectively the number of objects to be classified and the number of features, I is the number of individuals in each population and K is the number of clusters. Note that generally, N d and N K. The global complexity of the method can be calculated by evaluating the successive phases complexities. To evaluate computation cost, we first evaluate the time complexity of a WPSC building. A distance calculation on an attribute between an object and a cluster center is considered as an elementary operation (tdist ). – Phase 1: individuals solutions production and evaluation • (line 11) It is known that the complexity of the weighted K-means algorithm is in (dKN ). • (line 13) The inertia evaluation of a cluster Cj requires dNj distance calculations, where Nj = Card(Cj ). The evaluation of Qp consists in counting the number of clusters to which the object belongs (1), that isKN comparisons and increments. Then, the evaluation of all clusters requires K k=1 (dNk + KN ) arithmetic elementary operations (tarth ) (i.e. N (d + K 2 ) because a result produced by a K-means based algorithm is always a partition).

28

S. Genaud et al.

• (line 15) The evaluation of the argmax function requires (K) operations, which is negligible compared to the previous computation of Phase 1. From several benchmarks, we evaluated the time of an elementary distance calculation in the K-means algorithm to be 8 times longer than an elementary operation of the Qp evaluation (tdist ≈ 8 × tarth ). This factor can not be neglected given the values of the parameters K and d. Given that I K individuals compose all the populations, the time complexity to extract and evaluate all the clusters is T1 ≈ α1 I N(K 3 + 8dK 2 + dK),

(4)

with α1 an unknown that depends on the deployment platform. – Phase 2: best WPSC updating As explained in previous section, we can form 2K candidate combinations for the new best WPSC (line 22). For each combination, the most expensive computation is to evaluate Qp , that means (KN) operations. Then, the time complexity of Phase 2 (depending on an unknown α2 ) is T2 ≈ α2 2K KN.

(5)

– Phase 3: reproduction process The time needed to calculate a new population has a complexity linear in d and I . Then the time required by the third phase, which is negligible compared to previous phases (depending on an unknown α3 ), is T3 ≈ α3 I Kd.

(6)

In summary, because T3 is usually small, the overall time complexity T could be estimated as: T = T1 + T2 + T3 = (α1 (I K 3 + 8dK 2 + I Kd) + α2 (2K K))N.

Phase 1

(7)

Phase 2

One can notice that Phase 1 depends only on local information such as the best g g current WPSC, whereas Phase 2 needs all the best individuals Y g = (Y1 , . . . , YK ) that are outputs of Phase 1. Consequently, a parallel version of the algorithm will eventually imply the communication of data Y g at the end of Phase 1. Phase 1 complexity mainly depends on the total number of individuals (i.e. I K) while in Phase 2, the complexity depends only on the number of clusters K, since only one individual is selected within each population. This observation is useful to assess the role of parameters K and I in the duration of Phases 1 and 2, and to understand the results of experiments presented in Sect. 3.4. For small values of K, Phase 2 has an inexpensive computation cost. If the overhead in communication associated with Phase 2 is not prohibitive, this phase will require a small amount of time compared to Phase 1. Nevertheless, for K chosen big enough, Phase 2 will be far more costly than Phase 1 because of the multiplicative term 2K in T2 . Thus, for a given value of I , the ratio of computation cost between Phases 1 and 2 mainly depends on K. This fact will be used in the following to explain the performances results of the parallel version of MACLAW.


29

3 Parallelization The MACLAW method is implemented in the Mustic software tool which provides a convenient graphical user interface (GUI) to specify all classification parameters. From a user point of view, even if it is agreed that runs may last long, a strong requirement was to keep using the same tool whatever the computing resources available. We first developed a multi-threaded version of the application in order to benefit from shared memory multi-processor computers capabilities. The speed-ups obtained are satisfying but are limited: though SMP (Symmetric Multiprocessor) computers are quite common and affordable nowadays, they typically have at most two or four CPUs. To go beyond this limit we have developed a parallelized version of the MACLAW method. The parallelization follows a message-passing paradigm and uses the MPI constructs. Thus, we can potentially run this version on any multi-processor hardware, ranging from SMP to clusters or massively parallel computers. Our primary target however is what is commonly called NOWs (Networked Of Workstations), that is a set of computers connected to the LAN. In our context, keeping a limited budget for hardware is not the only reason for this target. As will be shown later, the software solution we propose, enables a user to invoke parallel executions directly from the Mustic GUI. 3.1 Parallelization strategy Let us consider our application may use P processors, for K populations, with a total of I K individuals. Note that often, P > K since the number of clusters is typically less than 20, whereas having 20 available processors is quite common in targeted environments. A natural parallelization strategy is to distribute individuals and associated computations on processors. At first glance, the choice remains between two alternatives: (i) Distribute whole populations to processors, with the objective of minimizing inter-populations interactions. Indeed, in Phase 1 the method finds its best individual which requires to compare locally all individuals scores for a given population. (ii) Distribute individuals without consideration of the population they belong to (in other terms, populations may be scattered across different processors) with the objective of balancing computations as evenly as possible. We have noticed that the latter solution performs better, so we adopt this parallelization scheme. Figure 2 depicts the parallelization for 6 processors and 4 populations. The adopted strategy has good performances because, even if a population can elect its best individual (Phase 1 in Algorithm 1 without communication in solution (i), the counterpart is the imbalance in the number of individuals per processor. Solution (ii) offers a good load-balance at the cost of a small communication overhead, namely processors owning a part of a population scattered onto different processors, must send their best candidate individual so that a consensus is made on the best individual for this population (as shown in Phase 1(b) of Fig. 2).

30

S. Genaud et al.

Fig. 2 Execution of the parallel MACLAW application with 6 processors and 4 populations

3.2 Parallel algorithm Thus, with P processors we map IPK or IPK individuals per processor (hence the imbalance is at most one individual). Note also the mapping of populations onto P P processors: if we assume P ≥ K, populations are scattered onto K or K processors. Let us now review how the main phases of Algorithm 1 translate into a parallel implementation following the steps sketched in Fig. 2.


31

• An initialization phase (Phase 0) is required to transmit the data, the classification parameters, as well as an initial WPSC from a chosen master processor (processor P0 here) to all processors. Independently of the population they belong to, IPK or

IPK individuals are assigned to each processor. This communication takes place only once per run. • In the cluster evaluation (Phase 1(a)), all processors proceed to the cluster extraction associated with each individual they own, that is they evaluate the score of each of their assigned individuals with the fitness function. On each processor and for each population or population part, the individual yielding the best score is selected as best individual candidate for the whole population. • Next step (Phase 1(b)) aims at electing only one best individual for each population. We decide that each population has a reference processor which is the processor where the first individual of the population resides. The scores and the indices of the various candidates are sent to their reference processors which determines which is the best candidate. In the example on Fig. 2, P0 , P1 , P3 and P4 are reference processors for populations 0, 1, 2 and 3 respectively. All population’s parts that are not hosted on their reference processor are sent to it. For example, P1 has a part of population 0 and sends the score of the best individual of that population part to its reference processor P0 . During this step, K messages of 8 bytes are exchanged (0 or 1 message sent per processor in the case P > K). • Phase 1(c): reference processors must then inform processors owning the best individual of the population they control, to broadcast the individual itself. This again requires K messages of 8 bytes (sent in parallel). • Phase 1(d): in the last step of this phase, processors instructed to broadcast the best individual(s) they have locally, send to all other processors a message whose size depends on the dataset to classify. It is about 450 KB for the image used in the experiments (Sect. 3.4). Hence each processor receives K such messages (1 message from K reference processors). • Phase 2 aims at finding the best WPSC for this generation by evaluating all the 2K distinct combinations of previous and current best individuals. In Phase 2(a), each K processor independently evaluates a part of the 2P combinations. • Phase 2(b): when all processors have finished their evaluations, they all send their best candidate WPSC with its corresponding score to all other processors. Each processor sends P messages made of a byte array of length K. Each processor then sorts all its candidates WPSC and keeps the best one. • Last, in Phase 3, the reproduction process takes place in parallel on each processor. The algorithm then iterates with next generation, branching back to Phase 1. Notice that the communication costs described above for Phases 1 and 2 are linear in the number of populations and processors, while the computation costs for the phases are respectively polynomial and exponential in the number of populations. Also, we observe that Phase 1 has a higher communication overhead than Phase 2: the comparison of communications in Phase 1(d) and Phase 2(b) shows that the volume of data sent per processor (4.5 · 105 K bytes vs. P K bytes respectively) is largely bigger in first phase since P 105 . We now come to the implementation of the parallelization described above. The implementation is based on P2P-MPI, developed in our team, which provides a

32

S. Genaud et al.

message-passing library as well as a middleware. In the following, we give a synthetic description of P2P-MPI. 3.3 P2P-MPI P2P-MPI may be simply viewed as the MPI implementation we choose to code the parallel version of MACLAW. However, the functionalities of P2P-MPI go largely beyond a simple communication library, which makes the exploitation of application more comfortable than with traditional environments devoted to parallel programming. We only give here a short overview of P2P-MPI and the reader is referred to [12, 14] for details. P2P-MPI overall objective is to provide a grid programming environment for parallel applications. P2P-MPI has two facets: its is a middleware and as such, it has the duty of offering appropriate system-level services to users, such as finding requested resources, transferring files, launching remote jobs, etc. The other facet is the parallel programming API (Application Programming Interface) it provides to programmers. API Most of the current research projects which target grid computing on commodity hardware enable the computation of jobs made of independent tasks only, and the underlying proposed programming model is a client-server (or RPC) model. The advantage of this model lies in its suitability to distributed computing environments but lacks expressivity for parallel constructs. P2P-MPI offers a more general programming model based on message passing, of which the client-server can be seen as a particular case. Contained in the P2P-MPI distribution is a communication library which exposes an MPI-like API. Actually, our implementation of the MPI specification is in Java and we follow the MPJ recommendation [4]. Though Java is used for sake of portability of codes, the primitives are quite close to the original C/C++/Fortran specification [23]. Middleware A user can simply make its computer join a P2P-MPI grid (it becomes a peer) by typing mpiboot which runs a local gatekeeper process. The gatekeeper plays two roles: – It advertises the local computer as being available to the peer group, and decides to accept or decline other peers job request as they arrive. – When the user issues a job request, it has the charge of finding the requested number of peers and to organize the job launch. Launching a MPI job requires to assign a unique number to each task (or process) and then synchronize all processes at the MPI_Init barrier. When a user issues a job request involving several processes, its local gatekeeper initiates a discovery to find the requested number of resources during a limited period of time. P2P-MPI uses the JXTA library [18] to handle all usual peer-to-peer operations such as discovery. Resources can be found because they advertised their presence together with their technical characteristics when they joined the peer group. Once enough resources have been selected, the gatekeeper transfers the program to execute along with the input data or URL to fetch data from, to each selected host. Once


33

all hosts have acknowledged the transfer, a list of numbers assigned to each process (in MPI terms, the rank of each process) is broadcasted to participating hosts. On reception of its rank, the process passes the MPI_Init barrier and enters the effective application part. Thus, P2P-MPI dynamically builds at each execution request and without any intervention from the user, an execution platform (i.e. selected hosts gathered for the execution) to fulfill the user request. Users do not have to bother about which hosts are currently available since all available resources advertise their presence themselves. The drawback of this seamless management lies in the random selection of hosts, since several execution requests may lead to different execution platforms each time. However, we believe that this is an acceptable price to pay for alleviating the burden of locating resources. Moreover, next release of P2P-MPI will provide an optional scheduler capable of selecting hosts so that the execution time of an application is nearly optimal. Note that such a scheduling requires precise information about the application behavior as well as network and available processors performances, which is only realistic in a small size set of hosts. P2P-MPI can be compared to batch schedulers present on most parallel supercomputers which select appropriate resources so as to complete the job request. Integration P2P-MPI also enables for a tight integration of the Java code of MACLAW using the P2P-MPI library with the middleware layer. In the MACLAW implementation, a call to the MPI.Init() function with arguments like the number of processors is sufficient to trigger in the middleware the resource discovery and selection mechanisms, the transfer of programs and remote executions on selected hosts. In practice, this allows to keep using the same desktop PC that served for the sequential application, from where the middleware starts searching for other CPUs. By contrast, the parallelization of an application usually requires to move the application to the parallel system to run it. In our case, from the user point of view, a sequential or parallel execution is simply requested via a simple menu in the usual graphical interface. Related work Let us briefly review what were the alternatives to the use of P2PMPI. In the last decade, several projects have proposed message-passing libraries designed for grid environments. Amongst these are MagPie [20], PACX-MPI [11] or mpich-G2 [19] which relies on the Globus toolkit. However, the effort is put here on the optimization of communications depending of the type of links, but little attention is paid to the integration of the communication library with the middleware. For instance none of the above projects offers the feature of finding automatically available processors—even mpich-G2 requires to write a description of the resources location (the RSL file) for each execution despite the presence of a directory of resources (MDS) in Globus. In addition, these MPI implementations require to distribute and maintain OS-dependent binaries at each host, which is very error-prone. An alternative approach, close to P2P-MPI is the P3 project [25]. To the best of our knowledge, this is the only other project to combine a message-passing programming model with middleware services. In P3, JXTA is also used for discovery: hosts entities automatically register in a peer group of workers and accept work requests according to the resource owner policy. Secondly, a master-worker and message passing

34

S. Genaud et al.

paradigm, both built upon a common library called object passing, are proposed. Unlike P2P-MPI, P3 uses JXTA for its communications (JxtaSockets). This allows to communicate without consideration of the underlying network constraints (e.g. firewalls) but incurs an overhead when the logical route established goes through several peers. On the contrary in P2P-MPI, we have favored performance by implementing the MPI primitives with Java sockets. The collective operations also benefit from well-known optimizations. Last, P2P-MPI provides some transparent fault-tolerance facility [13] which is important for long runs and that P3 lacks. 3.4 Experiments Experiment settings In every day use, P2P-MPI enables MACLAW to accelerate runs by picking available processors automatically. Though very useful, using an heterogeneous set of processors makes impossible to evaluate to which degree the parallelization can accelerate the clustering execution. In this paper, we focus on application deployments using one to three clusters. The tests on a single cluster serve to assess the application’s scalability, while multi-cluster executions mainly aim at finding how much overhead is paid for wide-area communications. The platform we use to evaluate the application is the Grid’5000 testbed. Grid’5000 [3] is a federation of dedicated computers hosted across 9 campus sites in France, and organized in a VPN (Virtual Private Network) over Renater, the national education and research network. Each site has currently about 100 to 700 processors arranged in one to several clusters at each site. The total number of processors is currently around 3000 and will be funded to grow up to 5000 processors. The testbed is partly heterogeneous concerning the processors since 75% are AMD Opteron, and Itanium2, Xeon and G5 for the remainder. The first test which addresses the scalability of the application on a single cluster uses the cluster in Nancy, composed of AMD Opteron 2 GHz, 2 GB RAM. In Fig. 3 are reported the performances for a clustering in K = 4, 8 and 16 clusters. We use logarithmic scales to facilitate reading of the curves slopes. Right column figures show the corresponding speedups. The data set to cluster is an image such as the one in Fig. 1 for which our parameters are I = 20 individuals, d = 3 attributes and N = 50000 pixels. We iterate for G = 20 generations. These values correspond to the most common usage of MACLAW even if they are slightly lower than what would be used to get accurate results. Note also that the computation time for a given image only depends on these parameters, and not on the image itself. Results First, the figures give an illustration of typical execution time depending on the number of clusters: the sequential execution takes 24 minutes for 4 clusters, 85 minutes for 8 clusters and more than 8 hours for 16 clusters. For such problem instance, we immediately see the practical benefit from the parallelization: the longrunning clustering of 16 clusters has been done in 12 minutes in our tests with 60 processors. Concerning the general behavior of the parallel application, the situation is contrasted. For 4 clusters, the speedup is linear up to 16 processors, then decreases slowly, stabilizes at 30–32 processors and dramatically drops at 48 processors. The plot is similar for 8 clusters, except that the speedup is linear until 24 processors and


35

Fig. 3 Performances for 4, 8 and 16 clusters extractions

drops at 48. For 16 clusters, the speedup remains linear up to 60 processors, with a speedup of 47 using 60 processors. We should also note that the initialization phase in the parallel implementation, induces an overhead due to the transfer of the dataset to the remote hosts. This overhead is specific to the problem and may hinder the scalability of the parallel program if the computation time is small relatively to data transmission overhead. In our case, the time to broadcast the image (less than 10s) is nearly constant and small on the whole and does not raise more complex issue about data-management. Analysis The measurements strongly corroborates the complexity analysis in Sect. 2.3. Thus, considering (4), we can theoretically show (given that d = 3) T1(K=8) ≈ 4.5 and T1(K=4)

T1(K=16) ≈ 4.9 T1(K=8)

36

S. Genaud et al.

which are close to the ratios 3.5 and 4.2 we observe in sequential executions. For Phase 2, we predict T2(K=8) ≈ 16, T2(K=4) and T2(K=16) ≈ 512 T2(K=8) , which agrees with our measurements: we find 22 and 500 respectively for execution times on a single processor. If we consider the configuration with K = 4 and K = 8, Phase 1 is largely dominating and scales well up to 48 processors and 56 processors for 4 and 8 clusters respectively. For such values of K Phase 2 has very few computations relatively to Phase 1. Note that a heavy computation unbalance appears quickly. For instance, for K = 4 there is a total of 80 individuals, so that using more than 40 processors involves that each processor has a load of either 1 or 2 individuals. There is no speedup in Phase 2 when we use more processors because the communications (Phase 2(b)) is too important relatively to the small amount of computations. Moreover, it induces important idle times because of collective communications and resulting synchronizations, when the number of processors increases. In this application, there is no parallel overhead due to extra computation or large load imbalance. The extra communications when using many processors and collective synchronizations are the cause of speedup decrease. However, as the most time consuming phase benefits from a good speed-up, acceptable execution times for K = 4 are reached with 16 processors (120s) while the minimum is 93s with 32 processors. Similarly for K = 8 a good trade-off seems to be 24 processors to classify the image in 246s. The figure for K = 16 is quite different: the execution times for the two phases are nearly equal due to the increase of Phase 2 computation complexity. As a consequence, the overhead due to communications tend to be small relatively to the computation cost, resulting in a very good speed-up throughout the tests up to 60 processors. Thus, in this case, the parallelization allows to achieve the clustering in tens of minutes instead of several hours, which enhances the usability of the method. 3.5 Multi-site test The second test consists in running the application distributed across a couple of geographically distant sites (Nancy, Lille, Nice or Bordeaux, parts of Grid’5000). We test the case K=16 since we have seen that many processors are of no use with 4 and 8 clusters. Among the multitude of possible configurations, we choose a few executions which we assume representative runs across several sites. We report the multi-site performances in the table of Fig. 4, on rows where the test number and total number of processors used are in bold-face font. For the sake of comparison, we also report underneath the performances for the same number of processors at a single cluster. Two factors should make the performances drop in this test. First, the wide-area network communications (sites are distant from 1000 to 2000 kms) induces incompressible latency. As both local and wide-area links are used simultaneously, we may expect an imbalance for communication times. Secondly, multi-cluster executions may involve heterogeneity of processors used.


37

test id 1 2 3 4 5

total procs 24 24 24 24 24

Nancy

Lille

Nice

Nice

Bordeaux

type A 9 5 2 24 -

type A 15∗ -

type A 0 -

type B 7∗ 8∗ -

type C 12 14 24

total time (s) 1940 2122 2080 1609 2053

6 7 8 9

32 32 32 32

17∗ 3 32 -

10 -

5 -

9∗ -

20 32

1602 1632 1294 1484

10 11 12 13

48 48 48 48

38 10∗ 48 -

0 22 -

10∗ 16 -

-

48

1198 1261 881 1128

14 15 16 17

56 56 56 56

15∗ 10 56 -

20 -

21 -

20∗ -

26 56

1001 1219 771 1020

Fig. 4 Distribution of processes across sites. A ‘-’ sign means the site was not solicited. A ‘ ’ sign indicates the input data is owned by a processor at this site

However, it is out of the scope of paper to study the behavior of all possible configurations. In this respect, we essentially limit heterogeneity to the kind of network links used. For about half of the executions involving several sites, we restrict the available resources to a set of homogeneous processors, namely AMD Opteron 2.0 GHz (type A), to isolate latency effects from processor heterogeneity. These are tests numbered 1, 6, 10, 11 and 14 in Fig. 4. The other half set of tests uses different processor types: in addition to type A, we use dual-core AMD Opteron 2.2 GHz (type B) and Intel Xeon EM64T 3 GHz (type C) processors. The aim of mixing three types of processors is to determine wheter involving some faster processors on distant sites may improve the total execution time or, on the contrary, the gain due to processor power is absorbed into wide-area communication overheads. These are tests number 2, 3, 7 and 15. In order to get an overview of the performances, we plot on a linear scale in Fig. 5 (left) the average execution times (and the vertical line indicates the standard deviation) for a given number of processors for multi-site runs. In order to compare, we also plot on these figures the corresponding performances obtained on homogeneous clusters (of types A and C) with the same number of processors. Right side of Fig. 5 presents the speedups on multi-clusters mean times compared to the best speedup obtained with an homogeneous cluster (type A processors). The overall performances of the application distributed across two or three sites are quite similar for a given number of processors, whatever the configuration. The curves make appear clearly that the use of distant sites induces a constant penalty due to latency. Furthermore, this overhead can be precisely observed on Fig. 4. For processors of type A, one can compare execution times for three sites versus one site experiments for 24, 32, 48 and 56 processors. The respective ratios for execution times (corresponding to tests 1/4, 6/8, 10/12 and 14/16) indicate that executions on three sites are about 30% slower (20%, 23%, 43% and 32% respectively). However,

38

S. Genaud et al.

Fig. 5 Performances for K = 16 on one cluster and several clusters

even if the multi-site speedups between 24 and 56 processors are lower than for one site, the speedup increases steadily in this range. The second observation is related to processors heterogeneity. We compare single site execution times on type A cluster versus type C cluster (tests 4/5, 8/9, 12/13, 16/17). Type A configurations are from 14% to 32% faster than type C executions (27%, 14%, 28% and 32% for 24, 32, 48 and 56 processors respectively). So, it happens that this difference due to CPU power is similar to the penalty paid in multi-site communications. Not surprisingly, comparing 3-sites type A configurations and type C single-cluster results (tests 1/5, 6/9, 10 or 11/13, 14/17) show nearly equal execution times (ratios of the former to the latter show a difference ranging from −6% to +7%). Last, we investigate 3-sites configurations involving either only type A processors or all types of processors (with nearly half of the processors being of type C each time). The algorithm that we designed, only deals with homogeneous distribution of work between processors. Considering a platform made of processors with varied capacities hence induces idle times on most powerful machines. The sequence of tests 1/2, 6/7 and 14/15 exemplifies this situation. Whatever the number of processors, the use of some processors of type C slows down the execution in comparison to platform with only type A processors. Let us notice that the presence of type B processors does not affect performance at all, because these are the fastest processors. At last, the difference of performance is relatively small between these A and C processor types. So, depending on the availability of machines, the user could benefit from such a mixed-processor configuration.

4 Conclusion We have described in this paper the parallelization of MACLAW. MACLAW is a clustering method which integrates a local feature weighting through a cooperative coevolution process, and whose aim is to eventually produces a hard clustering. After a complexity analysis of the algorithm main phases, we have proposed a parallelization strategy based on the message-passing paradigm. Besides the parallelization, we propose an original implementation in the P2P-MPI framework. In this context, P2P-MPI provides both the communication library and a convenient


39

service for handling execution requests with almost no intervention from the user. We have carried out experiments with this implementation based on P2P-MPI. On a single cluster, the application’s speedup is linear up to 16 or 24 processors for a small number of clusters and drops afterwards because of the disadvantageous ratio of communications to computations in the second phase of the application. For a larger number of clusters, the speedup is linear due to the increasing computing weight of this second phase. For this number of clusters, an experiment involving three distant sites shows a similar speedup curve evolution despite the overhead due to the latency on WAN network links. Finally, the parallelization enhances the usability of the MACLAW method, allowing for clusterings with a large number of classes in tens of minutes instead of hours in the sequential version. In addition, a noteworthy aspect of P2P-MPI is that it improves usability in practice by allowing users to keep running the application from their usual computer, as the middleware transparently discovers available computing resources.

References 1. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA 2. Blansché A, Gançarski P (2006) MACLAW: a modular approach for clustering with local attribute weighting. Pattern Recognit Lett 27(11):1299–1306 3. Cappello F et al (2005) Grid’5000: a large scale, reconfigurable, controlable and monitorable grid platform. In: Proceedings of the 6th IEEE/ACM international workshop on grid computing Grid’2005, November 2005. http://www.grid5000.org 4. Carpenter B, Getov V, Judd G, Skjellum T, Fox G (2000) MPJ: MPI-like message passing for Java. Concurr Pract Experience 12(11), September 5. Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952 6. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems, SIGKDD Springer, New York, pp 245–260 7. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan1 M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97 8. Forman G, Zhang B (2000) Linear speedup for a parallel non-approximate recasting of centerbased clustering algorithms, including k-means, k-harmonic means, and em. In: ACM SIGKDD workshop on distributed and parallel knowledge discovery, KDD-2000 9. Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Roy Stat Soc 66(4):815–849 10. Frigui H, Nasraoui O (2004) Unsupervised learning of prototypes and attribute weights. Pattern Recognit 34:567–581 11. Gabriel E, Resch M, Beisel T, Keller R (1998) Distributed computing in an heterogeneous computing environment. In: EuroPVM/MPI. Lecture notes in comput sci, vol 1497. Springer, New York, pp 180– 187 12. Genaud S, Rattanapoka C (2005) A peer-to-peer framework for robust execution of message passing parallel programs. In: Di Martino B et al (eds) EuroPVM/MPI 2005. Lecture notes in comput sci, vol 3666. Springer, New York, pp 276–284, September 13. Genaud S, Rattanapoka C (2007) Fault management in P2P-MPI. In: Proceedings of international conference on grid and pervasive computing, GPC’07. Lecture notes in comput sci. Springer, May 14. Genaud S, Rattanapoka C (2007) P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs. J Grid Comput 5:27–42 15. Gnanadesikan R, Kettenring JR, Tsao SL (1995) Weighting and selection of variables for cluster analysis. J Classif 12(1):113–136 16. Howe N, Cardie C (1997) Examining locally varying weights for nearest neighbor algorithms. In: ICCBR, pp 455–466

40

S. Genaud et al.

17. Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(2):657–668 18. JXTA http://www.jxta.org 19. Karonis NT, Toonen BT, Foster I (2003) MPICH-G2: a grid-enabled implementation of the message passing interface. J Parallel Distributed Comput special issue on Comput Grids 63(5):551–563, May 20. Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Notices 34(8):131–140, August 21. Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July 22. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley, CA, 1967. University of California Press, pp 281–297 23. MPI (1995) A message passing interface standard, version 1.1. Technical report, University of Tennessee, Knoxville, TN, USA, Jun 24. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD explorations, newsletter of the ACM special interest group on knowledge discovery and data mining 6(1):90–106 25. Shudo K, Tanaka Y, Sekiguchi S (2005) P3: P2P-based middleware enabling transfer and aggregation of computational resource. In: 5th intl workshop on global and peer-to-peer computing, in conjunc with CCGrid05. IEEE, May 26. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678 Stéphane Genaud received a Ph.D. in Computer Science from Strasbourg University (France) in 1997. He has been an associate professor at Strasbourg University since 1998. His research interests involve languages and methods for parallel programming, cluster and Grid computing.

Pierre Gançarski received his Ph.D. in Computer Science from the Strasbourg University (France) in 1988. He has been an associate professor of Computer Science at Strasbourg University since 1992. His current research interests include collaborative multi-strategical clustering with applications to complex data mining and remote sensing analysis.

Guillaume Latu received his Ph.D. in Computer Science from Bordeaux University (France) in 2002. He has been an associate professor of Computer Science at Strasbourg University (France) since 2003. He is interested in parallel algorithmics, simulation techniques, and high-performance computing.


41

Alexandre Blansché received his Ph.D. in Computer Science from Strasbourg University (France) in 2006. His main research interests include unsupervised learning and feature weighting for clustering with a application to remote sensing analysis. He is currently on a post-doctoral position at the University of Tokyo and works on data mining for material design.

Choopan Rattanapoka received his B.Eng. in Computer Engineering from Kasetsart University (Thailand) and his master degree in Computer Science from Strasbourg University (France) in 2004. He is currently a Ph.D. candidate in Strasbourg University. His current research interests include peer-to-peer, grid and parallel computing.

Damien Vouriot received his Master degree in Computer Science and Embedded Systems from Strasbourg University (France) in 2006. His master’s work was devoted to the parallelization of clustering algorithms on various architectures. He is now working in the industry.

Exploitation of a parallel clustering algorithm on commodity hardware ...

Exploitation of a parallel clustering algorithm on commodity hardware ...

Suggest Documents

A Parallel Algorithm for Hardware Implementation of Inverse ...

Special Issue on Exploitation of Hardware Accelerators

Parallel Algorithm for the Chameleon Clustering ...

Genetic Clustering on a Pipelined Hardware

Parallel K-Means Clustering Algorithm on DNA Dataset

Immersive Virtual Reality on commodity hardware - CiteSeerX

a novel parallel clustering algorithm implementation ... - Varun Jewalikar

a novel parallel clustering algorithm implementation ... - Google Sites

A parallel point cloud clustering algorithm for subset ... - CiteSeerX

A Parallel Algorithm for Record Clustering - Northwestern University

a novel parallel clustering algorithm implementation ... - Google Sites

Algorithm-Hardware Codesign of Fast Parallel Round ... - CiteSeerX

A New Overlapping Clustering Algorithm Based on

A Hardware Implementation of Hierarchical Clustering

Scalability of Parallel Genetic Algorithm for Two-mode Clustering

Spatial Clustering Algorithm Based on

Parallel Algorithm Based on a Frequential ... - cs.UManitoba.ca

Selective Jamming of LoRaWAN using Commodity Hardware

Virtues and Limitations of Commodity Hardware ... - CiteSeerX

FPGA-BASED PARALLEL HARDWARE

A Fast Incremental Clustering Algorithm

Parallel Genetic Algorithms on Programmable Graphics Hardware

A Parallel Implementation of K-Means Clustering on GPUs

Efficient parallel spectral clustering algorithm design for large data ...