Towards Parallel and Distributed Computing in Large-Scale Data Mining: A Survey Han Xiao Technical University of Munich D-85748 Garching near Munich, Germany
[email protected] April 8, 2010∗
Abstract The implementation of data mining ideas in high-performance parallel and distributed computing environments is becoming crucial for ensuring system scalability and interactivity as data continues to grow inexorably in size and complexity. This paper is a survey on the parallelization of well-known data mining techniques covering classification, link analysis, clustering and sequential learning, which are the most important topics in data mining research and development. Basic terminology related to data mining and parallel computing is introduced. With each algorithm, we provide a description of the algorithm, review and discuss current research on parallel implementations of the algorithm.
1 Introduction The wide availability of large-scale data sets from different domains such as collections of images, text, and related data have created a demand to automate the process of extracting information from them. Data Mining and Knowledge Discovery are commonly defined as the extraction of patterns or models from observed data, usually the ability to explore much richer and more expressive models, as well as providing new and interesting domains for the application of learning algorithms. Examples ranging from books recommendation by Amazon, social connection mining on Facebook, to large collections of images clustering on Flickr. Meanwhile, real-time and archival data increase as fast as or faster than computing power. Researchers are realizing that parallel processing is a novel technique for scaling up the algorithms. Although there are various reasons for performing data mining algorithms in a distributed manner, the most immediate and practical motivation is that developing learning algorithms that are able to take advantage of the increasing availability of multi-processor and grid computing technology. For instance, in Web ∗ For
latest revision, please download from http://home.in.tum.de/˜xiaoh
1
mining the input data is usually large and has high dimensional nature, the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. On a deeper level, there are fundamental questions about distributed learning from the viewpoints of artificial intelligence and cognitive science. However, using parallelization for speeding up and scaling up data mining implementations involves a number of challenges. First, the appropriate parallelization strategy could depend upon the data mining task. There are no general parallelization techniques for data mining. Thus, specialized implementations of popular algorithms rarely lead to widespread use. Secondly, maintaining, debugging, and performance tuning a parallel application are extremely time consuming tasks. Usually, it is not easy to modify existing codes to achieve high performance on parallel systems. An inappropriate parallelism makes the original simple computation with large amounts of complex code to deal with distributing the data, and handling failures. In this survey, we attempt to review work on parallelization of state-of-art data mining algorithm, which is relevant for researchers using or trying to introduce parallel technique into data mining. The survey is restricted to the application of parallel computing in the solution of complex tasks in data mining and knowledge discovery involving large data sets. The rest of this paper is organized as follows: In Section 2, we introduce basic terminology on data mining, parallel environment and parallel programming model. In Section 3, we describe previous work on parallelization of 9 most well-known data mining algorithms in the research community. They are: k- Nearest Neighbor, Decision Tree, Na¨ıve Bayes Classifier, k-Means Clustering, Expectation-Maximization, PageRank, Support Vector Machine, Latent Dirichlet Allocation, and Conditional Random Field. With each algorithm, different parallelization are discussed, as well as the experimental works. Section 4 gives a overall view of performance analysis and comparison between two different parallelism. Conclusions are presented in Section 5.
2 Basic Concept and Terminology In this section, we introduce basic concepts and terminology in data mining and some widely used programming model, framework and techniques in parallel computing, which will help reader to understand the parallel algorithm in Section 3.
2.1 Data Mining and Machine Learning Data Mining is: “The nontrivial extraction of implicit, previously unknown and potentially useful information from data” [39]. The analysis of data using machine learning and statistical techniques aims at finding hidden patterns and connections in these data. Machine Learning is an area of artificial intelligence concerned with the development of techniques which allow computers to “learn” by the analysis of data sets. The focus of most machine learning methods is on automatically recognizing complex patterns and making intelligent decisions based on data. It is also concerned with the algorithmic complexity of computational implementations. [67] presents many of the
2
commonly used Machine Learning methods. Statistics has its grounds in mathematics and deals with the science and practice for the analysis of empirical data. As we will see in Section 3, many methods of statistics are used in the field of Data Mining. Good overviews are given in [50, 10, 52]. Some other areas related with Data Mining are Databases [20] and Information Visualization [61]. Data Mining methods have different goals and commonly involves four classes of task [44]: regression, classification, clustering, identifying meaningful associations between data attributes. It should also be noted that several methods with different goals may be applied successively to achieve a desired result. For example, to recommend the books which customers are likely to buy, a business analyst might need to analyze the association between user profiles and the descriptions of books, and then apply content-based filtering approach matching the new book’s descriptions to those books known to be of interest to the user.
2.2 Parallel and Distributed Computing Parallel computing is a form of computation in which many calculations are carried out simultaneously [3]. It is based on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. Some of the more commonly used terms associated with parallel computing are listed below [87, 78]. Speedup: It is one of the simplest and most widely used indicators for a parallel program’s performance. It is defined as: sn = ts /tp , where ts is the execution time using only one processor and tp is the execution time using n processors. The maximum speedup that can be reached is linear speedup. Scaleup: It captures how well the parallel algorithm handles larger data sets when more processors are available. Scaleup study measures execution times by keeping the problem size per processor fixed while increasing the number of processors. Shared Memory System: Multiple processors can operate independently but share the same memory resources. Thus, data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs. However, adding more CPUs can geometrically increases traffic on the shared memory-CPU path. Distributed Memory System: Each processor has its own local memory, a communication network is used to connect the inter-processor memory. Increasing the number of processors can increase size of memory proportionately. Since the concept of cache coherency and global address do not apply in a distributed memory system, the programmer is responsible for many of the details associated with data communication between processors. There are two Parallel Programming Models in common use: Thread and Message Passing. Threaded implementations are not new in computing. Two different but wellknown implementations of threads are POSIX Threads [16] and OpenMP [26]. In Message Passing Model, multiple tasks can reside on the same physical machine as well across an arbitrary number of machines. Tasks exchange data through communications 3
Data D0 D1 D2 D3
Data D0 D1 D2 D3 P0
P0 Processors
Processors
broadcast
P1 P2 P3
P2 P3
Data D0 D1 D2 D3 P0
Data D0 D1 D2 D3 P0 Processors
Processors
scatter
P1 P2
allgather
P1
gather
P3
P1
allreduce
P2
sum(), min(),max()
P3
Figure 1: Collective communications in MPI. Each row represent a processor with 4bytes memory, one byte per box. The box filled in different textures represents different data. In the case of broadcast, the processor P0 send its data to all other processors, which gives a rise to the same data on every processors. For gather operation, P0 will receive the data from other processors and write into its local memory. The reduction operation like allreduce basically perform an allgather with some extra computation, which can be either trivial as sum, and, or delicately designed with respect to specific algorithm. In the context of data mining, broadcast and scatter operations are often used at the beginning of algorithm, where we have to distribute data from a master node to all other slave nodes. gather and allgather operations are often called at the end of algorithm to combine values from all processors and update the model parameters. by sending and receiving messages. The most widely used message-passing library is MPI, which is the ”de facto” industry standard now [38]. Collective communication is one of the remarkable feature in MPI, since it can transmit data among all processors efficiently. A group of global reduction operations (such as sum, max, min, and etc.) are also supported in MPI. In some cases, a more complex reduction operation must be defined for computing or updating the global parameters of the model. Figure 1 gives a pictorial representation of four basic collective functions. Recently, there is a distributed programming model called MapReduce [28] raise researchers and developers concerns. MapReduce is developed by Google, intended for processing massive amounts of data in large clusters. It is implemented as two functions, Map which applies a function to all the members of a collection and returns a list of results based on that processing, and Reduce, which collates and resolves the results from two or more Maps executed in parallel by multiple threads, processors, or stand-alone systems. Both Map and Reduce may run in parallel, though not necessarily in the same system at the same time.
4
The recent improvements of Graphics Processing Units (GPU) offer a powerful processing platform for both graphics and non-graphics applications [84, 63]. GPU has a parallel “multi-core” architecture, each core capable of running thousands of threads simultaneously - if an application is suited to this kind of an architecture, the GPU can offer large performance benefits. However, a typical computation running on the GPU must express thousands of threads in order to effectively use the hardware capabilities. Therefore, finding large scale parallelism is important for programming on GPU. The introduction of the NVIDIA CUDA (Compute Unified Device Architecture) brought, through a C-based API, an easy way to take advantage of the high performance of GPUs for parallel computing. CUDA also exposes a fast shared memory region (16KB in size) that can be shared amongst threads. We will see that some data mining algorithms can be significantly accelerated with benefit of GPU architecture in the next Section.
3 Parallel Data Mining 3.1
k-Nearest Neighbors
k-Nearest Neighbors (k-NN) algorithm is an easy to understand and to implement classification technique in data mining. It is based on a majority vote of the k closest training examples in the feature space [54]. If k = 1, then the object is simply assigned to the class of its nearest neighbor. In practical applications, k is in units or tens rather than in hundreds or thousands. Various measures (e.g. Euclidean distance, cosine measurement, KL-divergnce) can be used to compute the distance between two data points, the most desirable distance metrics may differ in different applications. k-NN is a type of lazy learning where the function is only approximated locally and all computation is deferred until classification. Thus, building the model is cheap, but classifying unknown objects is relatively expensive since it requires computing the distance of the unlabeled object to all the objects in the labeled set. Unlike other data mining mentioned in this paper, the parallelization of k-NN is not applied on training period but on the prediction the unobserved instances, which is given as follows: 1. Partition the dataset D into P blocks D1 , · · · , DP , each processor handles roughly ∥D∥/P 2. Given an unknown object, processor Pr calculates the k nearest neighbors Nr with the local training samples Dr . 3. A global reduction computes the overall k nearest neighbors Nglobal from N1 , · · · , NP , and then assign the object to the class which most common amongst Nglobal . The main drawback of k-NN lies in its computation burden, as it grows polynomially with the data size. A number of techniques have been developed for efficient computation that make use of the structure in the data and do some preprocessing to avoid an “exhaustive search”. A typical example is BBD-tree [4] an approximate nearest neighbor searching algorithms, is commonly used in practical because of its better performance. 5
[55] evaluated the parallel k-NN on top of Active Data Repository developed at University of Maryland [18]. They used 8 Sun Microsystem Ultra Enterprise 450s, each of which has 4 250MHz Ultra-II processors, 1 GB RAM and connected by a Myrinet switch. The data set is 2.7 GB with points in a 3-dimensional space, the value of k used is 10. They reported the speedups on 2, 4, and 8 nodes with single thread are 1.93, 4.04 and 7.70, respectively. They also measured the time taken by a version of the code that only performs I/O and no computation, and showed that the code is I/O bound and cannot benefit from additional threads for computation. It’s worth to highlight that k-NN algorithm can be significantly accelerated using GPU architecture. The comparison [41] was between standard k-NN implemented in C, CUDA and BBD-tree implemented in C, on a Pentium 4 3.4 GHz with 2GB of DDR memory and a NVIDIA GeForce 8800 GTX graphic card. This graphic card has 128 stream processors clocked at 1.35 GHz, a core clock of 575 MHz, and 768 MB of 384-bit GDDR3 memory at 1.8 GHz. The results on 38400 points with 96-dimensions showed that k-NN-CUDA is 100 times faster than k-NN-C, and 40 times faster than BBD-tree-C. Additionally, researchers from computational geometry also find this algorithm intriguing, a k-Nearest Neighbor Graph (k-NNG) is defined as a graph in which two vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other vertices. k-NNG is widely used in meshing, rendering and geometric embedding, thus how to build k-NNG in an efficient manner becomes a curial problem in computational geometry. [24] presented a parallel algorithm for k-NNG construction that uses Morton ordering. They performed experiments on multi-core processors with Intel, AMD and Sun Architecture, showed that the algorithm performs best on point sets that use integer coordinates and scaled well as more processing power becomes available.
3.2 Decision Tree Decision tree algorithms are widely used in Data Mining because they can be expressed in rule based mannerand can be easily converted into SQL statements that can be used to access databases efficiently [1]. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A Decision tree can be learned by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner. C4.5 is an algorithm used to generate a decision tree using divide-and conquer strategy [74]. Given a training set with N instances. each object is represented as a feature vector (x1 , · · · , xd ), with a class label v. The general case of C4.5 algorithm is described as follows: ∑V 1. For each feature fi , compute the information gain by gi = v=1 pv log2 pv . Suppose v takes on values in {1, · · · , V }, pv is the fraction of items labeled with value v in the set. 2. Let fbest be the feature with the highest normalized gi , create a decision node nm that splits on fbest
6
3. Recurse on the sublists obtained by splitting on fbest , and add those nodes as children of nm 4. Recurse this algorithm on every node, until the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. The parallelization of decision tree construction algorithms lie in task parallelism, data parallelism and hybrid parallelism. The task-parallelism approach proposed by [27] dynamically distributed the decision nodes among the processors for further expansion. A single processor using all the training set starts the construction phase. When the number of decision nodes equals the number of processors the nodes are split among them. At this point each processor proceeds with the construction of the decision subtrees rooted at the nodes of its assignment. They experimented on two data set: CensusIncome1 contains 27,222 cases of 14 attributes (five out of 14 attributes are continuous attributes), and Letter-Recognition2 contains 20,000 cases of 16 continuous attributes. They observed the average speedup of [8 nodes, 1.5x] and [16 nodes, 2.5x] on a largescale MIMD parallel computer Fujitsu AP1000. After 16 processors the performance remains either static or degrades because of poor load balancing, that is, the sizes of the subtrees allocated to the processors varied leading to an uneven distribution of work between the processors. [77] proposed a data parallelism that distributed the instances in data set evenly by the processors. Each processor keeps in its memory only a distinct subset of examples of the training set. The possible splits of the examples associated to a node are evaluated by all the processors. A global communication is performed at the end to find the global values of the criteria used and, by this, the best split. They implemented its parallelization on a 16-node IBM SP2 Model 9076 using MPI. Each node in the multiprocessor is a 370 Node consisting of a POWER1 processor running at 62.5MHz with 128MB of RAM, and communicate with each other through the HighPerformance-Switch with HPS-tb2 adaptors. The speedup on 1.6M examples showed 1, 1.9, 3.7, 5.7 on 2, 4, 8 and 16 nodes respectively. On the other hand, [40] splitted the data by attributes. Each processor keeps in its memory only the whole values for the set of attributes assigned to him and the values of the classes. During the evaluation of the possible splits each processor is responsible only for the evaluation of its attributes. Due to the evaluation of continuous attributes which requires more processing than the evaluation of discrete attributes, this parallelism still suffers from load imbalance. The parallel decision tree construction algorithms, The hybrid parallelism, combines both horizontal or vertical data distribution and task parallelism. For the nodes covering a significant amount of examples, data parallelism is used to avoid the problems of load imbalance. For the nodes covering fewer examples, one of the processors continues alone the construction of the tree rooted at the node (task parallelism). Two parallel decision tree construction algorithms using hybrid parallelism are described in [79, 59]. [79] evaluated their implemented hybrid algorithm on the same distributed environment and same data set used in [77]. The speedup on 1.6M examples were 2, 3.9, 7.4, 13 on 2, 4, 8 and 16 nodes respectively, which are significantly better than data parallelism 1 http://archive.ics.uci.edu/ml/support/Census+Income 2 http://archive.ics.uci.edu/ml/datasets/Letter+Recognition
7
and task parallelism.
3.3 Na¨ıve Bayes Na¨ıve Bayes is an important supervised classification method. It is based on Bayes’ theorem with the assumption that the presence of a particular feature of a class is unrelated to the presence of any other feature. Na¨ıve Bayes classifiers often work much better in many complex real-world situations than one might expect. [89] explained why even with strong dependencies, Na¨ıve Bayes still works well. General discussion of the Na¨ıve Bayes method and its merits are given in [34, 37]. Given a training set with n instances in k classes, each object is represented as a feature vector (x1 , . . . , xd ), with a class label v. the general case of Na¨ıve Bayes Classifier is described as follows: 1. Estimate P (y = v) directly from the proportion of class v objects in the training set 2. Estimate P (xi = u|y = v). If xi is categorical, which take only a few values, then this estimation can be done simply as fraction of “y = v” records that also have xi = u. If xi is continuous, a common strategy is to assume that xi have a Gaussian probability distribution. To predict the value y given observations of a new object with (x1 , . . . , xd ), compute y = arg max P (y = v|x1 = u1 , . . . , xm = ud ) v
= arg max v
d ∏
P (xi = ui |y = v)P (y = v)
i=1
The parallelization of Na¨ıve Bayes is straightforward. Under the assumption of the components of x are independent, each of distributions P (xi = u|y = v) can be estimated separately. [33] ran a MPI implemented algorithm on a cluster with 6 nodes, where each node has a 1.6GHz CPU, 256MB physical memory and connected by the Ethernet. They evaluated the performance on Reuters dataset3 with 9603 training and 3299 test documents, the speedup was [2 nodes, 1.3x], [4 nodes, 1.6x], [6 nodes, 1.8x]. On the other hand, [22] parallelized the Na¨ıve Bayes using MapReduce model on Shared-memory system. They specify different sets of mappers to calculate them, and then the reducer sum up intermediate result to get the final result for the parameters. Their experiment was on a 16 way Sun Enterprise 6000 running Solaris 10. They evaluated the average speedup on ten datasets from the UCI Machine Learning repository4 with different size (from 30000 to 2500000), which makes their report more convincing. The result showed that the speedup was [4 nodes, 4x], [8 nodes, 7.8x], [16 nodes, 13x]. 3 http://www.daviddlewis.com/resources/testcollections/reuters21578/ 4 http://www.ics.uci.edu/$\sim$mlearn/
8
3.4
k-means
k-means is an unsupervised method of cluster analysis that aims to partition n observations into k clusters, in which each observation belongs to the cluster with the nearest mean. [46] provided a nice historical background for k-means placed in the larger context of hill-climbing algorithms. A detailed history of k-means alongwith descriptions of several variations are given in [35]. Given a set of d-dimensional vectors D = {xi |i = 1, . . . , N }, the algorithm is initialized by picking k “centroids” randomly or by some heuristic, the algorithm proceeds by alternating between two steps till convergence: 1. Data Assignment. In iteration t, each data point is assigned to its closest centroid.
(t) (t) (t) Si = {xj : xj − mi ≤ xj − mi∗ for all i∗ = 1, . . . , k} The default measure of closeness is the Euclidean distance. In some applications, KL-divergence is used to measure the distance between two data points representing two discrete probability distributions 2. Relocation of “means”. Calculate the new centroid of the data points in the cluster. ∑ 1 (t+1) mi = xj (t) Si xj ∈S (t) i
There is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. Therefore, it is common to run the algorithm multiple times with different initial centroids. There has been some works for studying the advantages of the parallelism in the k-means procedure. Parallel k-means has been studied by [32, 80, 88, 57] previously for very large databases. The speedup and scale-up variation with respect to number of documents (vectors), the number of clusters and the dimension of each of the documents has been studied [32]. The parallel k-mean algorithm in Master-slave mode is described as follows: 1. Partition the dataset into P blocks D1 , . . . , DP , each processor handles roughly N/P . 2. Processor P0 builds the initial k centriods (m1 , . . . , mk )global , and broadcasts them to all processors. 3. Processor Pr reads the part of dataset Dr based on its responsibility, and determines the centroid to which its set of xi is closest using global parameters, one xi at a time. Then Pr computes its local centroids (m1 , . . . , mk )local , and then send it to P0 . 4. After P0 collects all local centroids, computes the new global centriods and broadcast it to all processors.
9
Iterate 3,4 until convergence. [80] tested k-means clustering algorithm on a PC cluster with 10 MBits Ethernet. The data set consists of 100,000 objects in 20 clusters, each with 20 continuous attributes. They were able to achieve about 90% efficiency for a configuration up to 32 processors. [32] implemented a SPMD version of this algorithm using MPI on IBM POWERparallel SP2 with a maximum of 16 nodes. Each node in the multiprocessor is a Thin Node 2 consisting of a IBM POWER2 processor running at 160 MHz with 256 megabytes of main memory. The processors all run AIX level 4.2.1 and communicate with each other through the High-Performance Switch with HPS-2 adapters. They report three sets of experiments using artificial generated data set, with different N , d, and k, respectively. They observed a speedup of 15.62 on 16 processors for a largest data set n = 221 , and a flattened speedup of 6.22 on 16 processors for n = 211 . By studying the performance on different data set, they also found that the speedups are essentially independent of d and k. They also reported that their implementation of parallel k-means has linear scaleup in n and k, and surprisingly better than linear scaleup in d. Moreover, [9] used k-mean as a test case to investigate scalable implementations of out-of-core I/O-intensive data mining algorithms on clusters of workstations. [22] developed the same algorithm using MapReduce and got linear speedup when processors up to 16. It is well known that the k-means algorithm is a hard threshold version of the expectation-maximization (EM) algorithm. Meanwhile, EM algorithm has a natural connection to Gibbs sampling methods for Bayesian inference, which is widely used in Mixture model. One can envision that the EM algorithm and Gibbs Sampling can be effectively parallelized using essentially the similar strategy as that used in this parallel k-means.
3.5 Expectation-Maximization In statistics, Expectation-Maximization (EM) algorithm is used for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on latent variables [31, 51]. It iteratively alternates between performing an expectation (E) step and maximization (M) step. EM algorithm has become a popular tool in data clustering and mixture estimation problems involving incomplete data [65]. In document classification problem, the cost of labeling documents is usually expensive, while unlabeled documents are commonly available. By applying the EM algorithm, we can use the unlabeled documents to augment the available labeled documents in the training process [70]. Given a likelihood function L(θ|x, z) , where θ is the parameter vector, x is the observed data and z represents the latent variables or incomplete data. The maximum likelihood estimate (MLE) is determined by the marginal likelihood of the observed data L(θ|x), however this quantity is often intractable. The EM algorithm seeks to find the MLE by iteratively applying the following two steps: E-step In iteration t, calculate the expected value of the log likelihood function, with respect to the current estimate of the parameters θ(t) : Q(θ|θ(t) ) = Ez|x,θ(t) [log L (θ|x, z)]
10
M-step Find the parameter which maximizes this quantity Q(θ|θ(t) ): θ(t+1) = arg max Q(θ|θ(t) ) θ
The new parameters are then used to determine the distribution of the latent variables in the next E-step. Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration. Like k-means, EM algorithm only gives a local optimized solution. The parallel EM algorithm in SPMD model could be described as follows: 1. Partition the dataset into P blocks D1 , . . . , DP , each processor handles roughly N/P (0)
2. Processor P0 builds the initial global parameters centriods θglobal , and broadcasts them to all processors. 3. Processor Pr reads the part of dataset Dr based on its responsibility, and iterate step 4 5 6 until convergence. (t)
4. E-step: Each processor estimates quantity Qr (θ|θglobal , Dr ) (t+1)
5. M-step: Each processor re-estimates its own local parameters θr (t) mizing Qr (θ|θglobal , Dr )
by maxi-
6. Use collective communication operation to obtain the new global parameters (t+1) (t+1) (t+1) (t+1) θglobal from local parameters θ1 , . . . , θP , and then return θglobal to all processors. Many parallel implementations of the EM algorithm in different domain have been proposed in recent years. [58] employed the SPMD model of parallel EM for text classification on PIRUN Cluster. PIRUN Cluster consists of 72 nodes connected with Fast Ethernet Switch 3COM SuperStackII. Each node is a 500 MHz Pentium III with 128 Mb Memory and runs on Linux, the processors communicate with each other by using MPI. They did 3 groups of experiment using 10000, 5000, 2500 documents drawed from 20 Newsgroups dataset5 . They claimed that their parallel algorithm yields better performance for the larger data sets. It achieved the speedups of [2 nodes, 1.97x], [4 nodes, 3.72x], [8 nodes, 7.16x], and [16 nodes, 12.16] on the largest set with 10000 documents. When it accessed to a smaller set of documents 5000, 2500, the speedup curves tend to drop from the linear curve. [43] reported a hybrid-memory parallelization of the EM algorithm using the FREERIDE middleware [55]. Their experiments were conducted on a cluster with 6 nodes, each nodes has 700 MHz Pentium CPU, 1 GB memory and connected through Myrinet LANai 7.0. In experiment, they generated three different size of dataset, containing millions of 10-dimensional points to be clustered. Each dataset was partitioned into thousands of chunks to make it diskresident. They reported the average speedup on three dataset were [2 nodes, 1.76x], 5 http://people.csail.mit.edu/$\sim$jrennie/20Newsgroups/
11
[4 nodes, 3.47x], [8 nodes 6x] with 1 thread. When adding the number of threads up to 3, their parallelization demonstrated 10% additional speedup due to the reduction object is small enough to be cached in some of the instances. However, adding the 4th thread creates CPU contention and thus doesnt result in additional speed-ups. [22] also showed the average speedup of parallel EM using MapReduce model on a 16 cores server was [2 nodes, 2x], [4 nodes, 3.8x], [8 nodes, 7x], [16 nodes, 10x]. The reason for the sub-unity slope is increasing communication overhead.
3.6 PageRank PageRank is a link analysis algorithm [14], named after Larry Page, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents. The intuition behind PageRank is that a web page is important if several other important web pages point to it. Generally, this algorithm can be used in any graph with link structure to measure the relative importance of the nodes. In addition to rank the webpages, PageRank has recently been proposed as a replacement for the traditional Institute for Scientific Information impact factor [29]. In text mining, PageRank has also been used to automatically rank WordNet synsets according to how strongly they possess a given semantic property, such as positivity or negativity [36]. Even in ecosystem, a modified version of PageRank may be used to determine species that are essential to the continuing health of the environment [2]. Given a direct graph G = (V, E) consisting of a set of pages V (vertices) and a set of directed links E (edges) that connect pages. PageRank score vector of the pages can be worked out as follows: r = α · T · r + (1 − α) ·
1 · 1N N
where α is damping factor, which is generally set around 0.85 [14]. T is the transition matrix: { 0 if (q, p) ∈ /E T(p, q) = 1 if (q, p) ∈ E outdegree(q) Parallelization of PageRank has great significance, since the connection graph of real Web has usually one billion or more links, a system that can compute results within minutes is desideratum. Fortunately, parallelizing PageRank is not a new problem, PageRank can be computed by basic linear algebra operations. One approach is to parallel solve this linear system (I − U) · r = b where U = α · T and b = (1 − 1 α) · N · 1N . The linear system solvers are accessible from well-developed Scientific Computation package, such as Portable,Extensible Toolkit for Scientific Computation (PETSc) [8, 7, 6]. The complete distributed algorithm based on algorithmic ways is given as follows: 1. Partition the matrices U and b into P sub matrices {U1 , . . . , UP }, {b1 , . . . , bP } by dividing the rows. (0)
2. Processor P0 builds the initial global rank vector rglobal , and broadcasts it to all processors. 12
3. In iteration k, each processor Pi use linear solvers to calculate local ri , such as Power iterations, Jacobi iterations and Krylov subspace methods. If we take (k) (k−1) Jacobi iterations, ri = Ui · rglobal + bi . 4. Concatenate the local rank vectors into a global rank vector by collective com(k) (k−1) (k−1) munications: rglobal = [r1 , r2 , ...]T Iterate 3,4 until convergence. [42] did experiments on their parallel computer, which was a Beowulf cluster of RLX blades connected in a star topology with a gigabit Ethernet. They had seven chassis composed of 10 dual processor Intel Xeon blades with 4 GB of memory each (140 processors, and 280 GB memory total). Each blade inside a chassis was connected to a gigabit switch and the seven chassis’ were all connected to one switch. Their parallel PageRank codes use PETSc to implement basic linear algebra operations and basic iterative procedures on parallel sparse matrices. Their experimented on datasets with 1.4 billion nodes took 35.5 minutes (2128 secs) for PageRank and 28.2 minutes (1690 secs) for BiCGSTAB on the full cluster of 140 processors, while the most efficient implementation of the serial PageRank algorithm took 12.5 hours on this graph using a quad-processor Alpha 667MHz server and approximately 10 hours using a 800MHz Itanium [15]. They studied the parallel performance of many linear solvers, including basic Power iterations, Jacobi iterations, Generalize Minimum Residual (GMRES), Biconjugate Gradient (BiCG) and Biconjugate Gradient Stabilized (BiCGSTAB). They argued that BiCGSTAB and GMRES have the highest rate of convergence, nevertheless, the actual runtime can be longer than the run time for simple Power iterations. They also indicated by the experiment that when the communication and work load balance is approximately preserved, increasing the number of processors leads to a smaller computation time, but that speedup eventually gets down. Another experiment done by [62] was run on a PC cluster of eight Opteron 240 machines, networked via the Gigabit Ethernet and running the Linux operating system. Each machine is equipped with 3GB of main memory, and a UW-SCSI hard disk. Their proposed parallel algorithm was written in C language using the standard MPICH v1.2.5 library. They used a web graph derived from a crawl during January 2003 within the thailand (.th) domain, which contained around 10.9 million web pages, 97 million links. They reported the speedup of their parallelization are [2 nodes, 1.5x], [4 nodes, 2x], [8 nodes, 2.9x]. They also created additional artificial sets of web graphs by concatenating several copies of the base graph, and connecting those copies by rerouting some of the links. Their experiment on this artificial dataset (roughly 174.4 million web pages, 1.55 billion links) gave [2 nodes, 1.9x], [4 nodes, 3.8x], [8 nodes 6.5x]. Their slope of speedup curves get closer to the linear one when the size of the virtual data becomes larger.
3.7 Support Vector Machine Support Vector Machine (SVM) is a supervised learning method used for binary classification [13, 25, 49]. Intuitively, the aim of SVM is to find optimal separating hyperplane by maximizing the margin between the two classes, which offer the best generalization ability for future data. When training data are not linearly separable, the 13
kernel function k(x⃗i , x⃗j ) can be used to define a variety of nonlinear relationship between its inputs. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. Much study in recent years have gone into the study of different kernels for SVM classification [76]. Given a training set with N instances, where each object is represented as a feature vector x⃗i , with a class label yi ∈ {1, −1}. The general SVM training algorithm can be summarized as follows: 1. Choose a kernel function k(x⃗i , x⃗j ) 2. Maximize the function below, subject to αj ≥ 0 and N ∑
∑N i=1
αi yi = 0
1∑ W (⃗ α) = αi − αi αj yi yj k(x⃗i , x⃗j ) 2 i=1 i=1 N
where αi are non-negative Lagrange multipliers, it indicates the “support level” of instance x⃗i to the hyperplane. In case of αi = 0 means that removing x⃗i from training set does not interfere the position of hyperplane. 3. The bias b is found as follows: ∑ ∑ 1 min αi yi k(x⃗i , x⃗j ) + max αi yi k(x⃗i , x⃗j ) b= 2 i|yi =1
i|yi =−1
4. Given a new object ⃗z, the optimal αi go into the decision function: D(⃗z) = ) (∑ N ⃗i , ⃗z) + b sign i=1 αi yi k(x There are several important extensions on the above basic formulation of SVM. The “soft margin” idea was introduced to extend the SVM algorithm [83] so that the hyperplane allows a few of such noisy data to exist. To solve the problems that involve more than two classes. we can repeatedly use one of the classes as a positive class, and the rest as the negative classes to train several SVM models, which known as the 1-vs-all method. Moreover, SVM can be easily extended to perform regression analysis [83]. The core of SVM is the quadratic programming problem (QP). Although several approaches for accelerating the QP such as “chunking” [13, 56], Sequential Minimal Optimization (SMO) [73], “shrinking” [56] have been proposed, improving computespeed through parallelization is difficult due to dependencies between the computation steps. [23] used mixture of several SVMs, each of them has a weight and trained only on a part of the data set. The training method could be implemented by Master-Slave model given as follows: 1. Partition the dataset into P blocks D1 , . . . , DP , each processor handles roughly N/P 2. Processor Pr reads the part of dataset Dr based on its responsibility, and builds local SVM Sr . 14
can be spread over multiple processors, yet the ensemble is guaranteed to converge to the globally optimal solution.
2 T h e Ca s cad e S VM As mentioned above, eliminating non-support vectors early from the optimization proved to be an effective strategy for accelerating SVMs. Using this concept we developed a filtering process that can be parallelized efficiently. After evaluating multiple techniques, such as projections onto subspaces (in feature space) or clustering techniques, we opted to use SVMs as filters. This makes it straightforward to drive partial solutions towards the global optimum, while alternative techniques may optimize criteria that are not directly relevant for finding the global solution. TD / 8
TD / 8
TD / 8
TD / 8
TD / 8
TD / 8
TD / 8
TD / 8
1st layer SV1
SV2
SV3
SV4
SV5
SV6
SV7
SV8
2nd layer SV9
SV10
SV11
SV12
3rd layer SV13
SV14
4th layer SV15
Figure 1: Schematic of a binary Cascade architecture. The data are split into subsets and each one is evaluated individually for support vectors in the first Figure Schematic of acombined binary Cascade architecture. The data are splitsets intofor subsets layer.2:The results are two-by-two and entered as training the and each one isThe evaluated individually for support vectorsfor in the first convergence layer. The results next layer. resulting support vectors are tested global by arefeeding combined as training setsfirst for the next together layer. Thewith resulting thetwo-by-two result of and the entered last layer into the layer, the support vectors are tested TD: for global convergence by feeding result ofproduced the last layer non-support vectors. Training data, SVi: Supportthevectors by optimization i. together with the non-support vectors. TD: Training data, SVi: into the first layer, Support vectors optimization We initialize theproduced problem by with a numberi.of independent, smaller optimizations and combine the partial results in later stages in a hierarchical fashion, as shown in Figure 1. Splitting the data and combining the results can be done in many different ways. 3. Processor P0 trains the weight matrix w ∈ RP ×N by minimizing cost function
C=
N ∑ i=1
[ tanh
( P ∑
) wri Sr (x⃗i )
]2 − yi
r=1
where Sr (x⃗i ) is the output of Sr given input x⃗i Their experiment on 100000 examples showed that even on single processor the training time decrease from 3231 minutes to 237 minutes by using mixture SVMs instead of single SVM. They claimed that the reason is their algorithm scale linearly with the number of training examples, whereas the standard SVM complexity scaling much closer to O(N 3 ). They also trained the mixture of SVMs on 50 machines, and the time decayed to 73 minutes (about 3.2 speedup). It’s not clear what parallel environment they used. In addition, they observed a significant improvement in generalization of mixture SVMs. Another notable parallelization is Cascade SVM proposed in [45], which filtered the non-support vectors from the optimization in a hierarchical fashion, as shown in Figure 3.7. The Cascade provides several advantages over a single SVM because it can reduce compute- as well as storage-requirements. Their experiment was on a Linux cluster with 16 nodes, each nod has a AMD 1.8GHz dual processors and 2GB RAM. The data set consists 1,016,736 vectors. A Cascade with 1-5 layers was executed to find global solution. The fully converged solution was found in 3 iterations, and the average speedup compare to standard SVM is 1.5, 3,3, 4.0, 4.5 on 2, 4, 8, 16 nodes. The main limitation of this algorithm is that it only works on 2k processors and in higher layer 15
it manipulates fewer processors to do the optimization. This is why the acceleration saturates at a relatively small number of layers. Moreover, [19] developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Their experiment showed that PSVM did enjoy linear speedup when the number of machines is up to 30, and gave 90, 120 speedup on 150, 250 machines, respectively. [17] described a solver for SVM training running on a GPU, using Platt’s Sequential Minimal Optimization algorithm. They implemented the parallel SVM using MapReduce and CUDA. The experiment was conducted on Nvidia GeForce 8800 GTX, and got 81-138 speedup compared to LIBSVM an Intel Core 2 Duo 2.66 GHz processor.
3.8 Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) is a Bayesian network that generates a document using a mixture of topics [12]. It assumes a generative probabilistic model in which documents d⃗ are represented as random mixtures over latent topics ⃗z, where each topic z is characterized by a probability distribution over words w. ⃗ The complete generation process and equivalent graphical model are showed below. ϕ ∼ Dirichlet(β) θ zdi |θd wdi |zdi , ϕzdi
∼ Dirichlet(α) ∼ M ultinomial(θd ) ∼ M ultinomial(ϕzdi )
LDA can capture the heterogeneity in grouped data which exhibit multiple patterns. In the recent past, LDA has emerged as an attractive framework to model, visualize [53] and summarize large document collections in a completely unsupervised fashion. Several extensions to LDA model have been proposed, such as the Topics Over Time model that permits us to analyze the popularity of various topics as a function of time [85], Hidden Markov-LDA that integrates topic modeling with syntax [48], Author-Persona-Topic model that models words and their authors [66], etc. In each case, graphical model structures are carefully designed to capture the relevant structure and co-occurrence dependencies among the data. Although LDA is still a relatively simple model, exact inference is generally intractable. The solution to this is to use approximate inference algorithms, such as mean-field variational EM [12] and Gibbs sampling [47]. Gibbs sampling is a typical MCMC method for Bayesian inference. It directly follows the generative process, which make it easier to understand and implement compare to variational EM. The general description of Gibbs sampling estimator of standard LDA is given as follows: 1. Give an initial (random) topic zdi assignment on every word in every document. 2. In each iteration, update the topic assignment zdi by sampling from full condi-
16
α
θ β
z
φ
w d
T D
Figure 3: Graphical representation of LDA. LDA is a hierarchical generative model, it describes the complete generative process of a given corpus. It encapsulates three plates, or repetitive processes. The outer most plate describes the generation process for each document. This repeats D times, where D is the number of documents in the corpus. The embedded smaller plate shows the generation process for each word in document. The process repeat Nd times, once for each word in the document d. Another small plate on the right side denotes the generation process of all T topics, in which each topic is described by a multinomial distribution over vocabulary. In general, T, α, β have to be hand-tuned for different data set. tional posterior distribution: P (zdi |⃗z−di , w, ⃗ α, β) ∝ (nddi ,zdi + βzdi ) ×
nzdi ,wdi + αwdi V ∑ (nzdi ,v + αv ) v=1
where nd,z is the number of tokens in document d that assigned to topic z, nz,v is the number of tokens of word v are assigned to the topic z, the number of topics is T . 3. After the burn-in period the sampling algorithm gives direct estimates of z for every word. The word-topic distributions θ and topic-document distributions ϕ n +αz n +βv can be obtained from θdz = ∑Z d,z , ϕvz = ∑V z,v respectively. (n +α ) (n +β ) z=1
d,z
z
v=1
z,v
v
In each iteration, the sampler has to go through the whole corpus and assign a topic on each word, which gives Gibbs sampling a poor efficiency. When dealing with a large-scale document collection, standard serialized Gibbs sampling is computationally infeasible. Prior work has explored multiple alternatives for speeding up LDA, including both parallelizing Gibbs sampling and variational EM across multiple machines. [5] presented an asynchronous distributed Gibbs sampling algorithm. [69] presented two synchronous methods, AD-LDA and HDLDA, to perform distributed Gibbs sampling. AD-LDA is similar to parallel EM as we hereinbefore mentioned from data-flow perspective. The AD-LDA algorithm works as follows: 17
1. Partition the dataset into P blocks D1 , . . . , DP , processor Pr works with its own word content w ⃗ |r and corresponding topic assignment ⃗z|r , and maintains local |r |r counts of nd,z and nz,v . |global
2. Processor P0 builds the initial global counts of nz,v all processors.
, and broadcasts them to
Iterate until stop: |r
3. Each processor Pr sampling every zdi ∈ Z |r from the approximate posterior distribution: ( ) |r |r P (zdi |⃗z−di , w ⃗ |r , α, β) ∝ nddi ,zdi + βzdi ×
|global
nzdi ,wdi + αwdi ) V ( ∑ |global nzdi ,v + αv v=1
|r
|r
4. Each processor updates local counts of nd,z and nz,v according to the new topic assignment. |global
5. Use collective communication to obtain the new global counts nz,v |P |1 counts nz,v , ..., nz,v , and then return it to all processors.
from local
[86] implemented the AD-LDA algorithm in both MPI and MapReduce models and named it as PLDA. They applied PLDA on document summarization on the Wikipedia dataset, which consists of 2, 122, 618 articles with 447, 004, 756 words. The experiments were conducted on 256 machines at Googles distributed data centers, in which each machine is configured with a CPU faster than 2GHz and memory larger than 4GB. Their results showed that both MPI-PLDA and MapReduce-LDA enjoyed approximately linear speedup when the number of machines is up to 100. However, when the number of machines continues to increase, MPI-PLDA achieved speedup with [128 nodes, 94x], [256 nodes, 169x] and MapReduce-PLDA yields [128 nodes, 57x], [256 nodes, 72x]. They claimed that in the absence of machine failures, MPI-PLDA is more efficient because no disk IO is required between computational iterations. When the number of machine is large, and the meantime to machine failures becomes a legitimate concern, the target application should either use MapReduce-LDA or force checkpoints with MPI-PLDA. [21] applied this MPI-PLDA on the Orkut data set for the community recommendation task, the data set consists of 492,104 users and 118,002 communities. The result showed that the speedup approximately linear on up to 8 machines. But after that, adding more machines yields diminishing returns: [16 nodes, 7.45x], [32 nodes, 10.66x], since communication time takes up more and more in total running time. With 32 machines they reduced the training time from 8 hours to less than 46 minutes. Additionally, [68] built parallel implementations of the variational EM algorithm for LDA in a multiprocessor architecture as well as a distributed setting. They used a Linux machine with four 2.40GHz CPUs sharing 4GB RAM for shared-memory system, and a 96 nodes cluster that equipped with a Transmetta Efficeon TM8000 1.2GHz processor with 1MB of cache and 1GB RAM on each for distributed-memory system. They showed the multiprocessor implementation achieved a speedup of only 1.85 from 1 to 18
yt −1
yt
yt +1
xt −1
xt
xt + 1
Figure 4: Graphical representation of a linear-chain CRF in which the transition score depends on the its’ neighboring observations. 4 threads, while the distributed implementation achieved a significantly higher speedup of 14.5 from 1 to 50 nodes. They claimed that the multiprocessor implementation may not scale to large collections, since it stores the entire data in memory and the readconflict between various threads.
3.9 Conditional Random Fields Conditional Random Fields (CRF) is a framework for building graphical probabilistic models to segment and label sequence data [60]. It inherits characteristics of discriminative models and can be applied to encode non independent features. Instead of modeling the subtle dependencies in the input space of p(x), CRFs concentrate directly on modeling the conditional distribution p(y|x) between observation data and state sequence. Let G(V, F, E) be a factor graph that V is the set of vertices connected by the edges in E, F represent the set of weights. y is indexed by the vertices and T = card(V ). Supposing Y is a set of all state sets, so that y ∈ Y . Given a vertices yt , y˜t is a set of all the vertices tied to yt and x˜t is a set of all the observations tied to yt . By using Bayes rule, the general conditional random field can be written in the form: } { T ∑∑ ( ) p(x, y) 1 ∏ c c c c ˜ ˜ p(y|x) = ∑ λk fk yt , yt , xt = exp p(x, y) Z(x) t=1 c∈C
y∈Y
where Z(x) =
∑ ∏ y∈Y c∈C
{ exp
T ∑ ∑
t=1 k∈K
k∈K
} λck fkc (yt , y˜tc , x˜ct )
is a normalization term. C is
a set of clique template, K is defined as a set of all the state-state pairs and the stateobservation pairs. fk (yt , y˜tc , x˜ct ) is the feature function weighted by λck [82, 81, 75]. In a specified problem, {fk (yt , y˜t , x˜t )} depends on the structure of graph. Consider a Linear-Chain structure which have been used in sequential data mining like named-entity recognition [64] and part-of-speech tagging [60], yt is just tied to yt−1 and xt , there is only one clique template, the feature function can be therefore written as fk (yt , yt−1 , xt ). Parameter estimation of CRFs is aimed to determine best parameters Λ = {λck } for i i given data , y ),i ∈ [1, s] by maximizing the conditional log likelihood: ∑ssequences (x i i l(Λ) = i=1 log p(y |x ). In general, numerical approaches like stochastic gradient 19
ascent, steepest ascent or quasi-Newton methods(BFGS) [11] are used to solve this optimization problem. A stochastic gradient training process can be described as follow: 1. Initialize Λ0 as a set of random number. 2. For each given data sequence (xi , y i ), calculate the marginal distributions p(yti , yt′ |xi , Λn ) from the forward and backward recursions. 3. the gradient of Λn can be then calculated by: s ∑ T s ∑ T ∑ ∑ ∑ ∑ λc ∂l i,c i,c c i i,c c i i,c i i k ′ |x , Λn )− = (y , y ˜ f , x ˜ )− f (y , y ˜ , x ˜ )p(y , y t t t k t t k t t t 2 ∂λck σ i,c i=1 t=1 i=1 t=1 yt′ ∈˜ yt
4. Update the parameter Λn+1 = Λn + w∇(Λn ), where w is the learning step. 5. If stop conditions are met, such as |∇(Λn )| < δ or n > nM ax , end the training; otherwise, n ← n + 1 and go to step 2. Finally, given a new observation sequence x, the most probable assignment is defined as y ∗ = arg maxy p(y|x), which can be calculated by Viterbi recursion. From the formulas above for the marginal distribution, the computational complexity is O(T L2 N G), where T is the length of set, L is the number of labels, N is the number of training examples, and G the number of gradient. Due to the dimension of large training set, parallel works are needed to ameliorate the performance. As reported in [72], a team from Tohoku University has implemented a parallel training of CRF. the main idea of the parallel algorithm is to divide the training dataset into P sub-dataset. A main process will gather those values calculated by slave processes in order to obtain a global value of gradient of l(Λ). After apply the optimization algorithm and update the l(Λ) for all processes, each process will pass to next step. Their parallel algorithm can be described as follow [72]: 1. Generate features with initial weights Λ = [λ1 , λ2 , ...] and each process loads its own partition Di 2. The root process broadcast Λ to all parallel processes 3. Each process Pi computes the local log-likelihood li and local gradient vector ∂l [ ∂λ , ∂l , ...]i on Di 1 ∂λ2 ∂l 4. The root process gather and sums over all li and [ ∂λ , ∂l , ...]i by computing 1 ∂λ2
lglobal [
] ∂l ∂l , , ... ∂λ1 ∂λ2 global
=
∑
li
i
=
] ∑ [ ∂l ∂l , , ... ∂λ1 ∂λ2 i i
to obtain the global log-likelihood and gradient.
20
k,c
5. The root process performs L-BFGS optimization search to update the new feature weights Λ 6. If iterations < m then goto step 2, stop otherwise Their test environment is a Cray XT3 system(a MPI system) with 180 AMD Opteron 2.4GHz processors and 8GB RAM per each [72]. The results of NP chunking and chunking tasks on the CoNLL2000-L dataset show that in one hand CRF models have less prediction error than other models such as SVMs by Kudo & Matsumoto. Compared to the previous best system, their model reduces error by 22.93% on NP chunking. In another hand, also for the CoNLL2000-L, training time of a single process is over 61 times than the time costed by 45 parallel process. Similarly, they also examined cross-validation test on Wall Street Journal data set, which took 1h21′ on 45 processors while it is estimated to take 56h on one processor. In the last experiment, the speed-up ratio is 41.5. In addition, the speed-up ratio is nearly linear to the number of parallel processes as predicted.
4 An Overall Vision In this section, we summarize the parallel data mining techniques described above in an overall vision. We draw analogies of parallelization from different data mining algorithm and analyze the complexity, characteristic and parallelism of them. It will offer a high-level support for creating scalable data mining implementations in an effective and efficient way.
4.1 Performance Analysis Table 4.1 shows the time complexity analysis for the nine algorithms we remarked above. In general, we assume that the input training data set has N instances with dimension of D, the test data set has only one instance and there are P processors for both parallel training and testing. For CRF model, we assume the length of given test data is L. In clustering and classification task, the number of expected classes or labels is K. For iterative algorithms, we assume there needs I steps till convergence. In collaborate communications such as Broadcast or Scatter, T is the transmission time for the model parameters, which accounts for log(P ) factor. The experimental results showed that many parallelized algorithm can achieve almost linear speedup when the number of machines is small, which agrees well with theoretic analysis. However, when the number of machines continues to increase a threshold, the speedup will slow down or even decrease. There are many factors that limit the speedup of a parallel algorithm. In data mining scenario, two factors should come into notice, which are load balancing and communications overhead. Speedup is generally limited by the speed of the slowest node. Writing an algorithm that evenly distributes its workload across all the processors is known as load balancing. It is possible that an unparallelisable serial component presence within the parallel algorithm due to the computation dependency. Such a serial component would only allow one processor to work on it while the others processors 21
Table 1: Time complexity analysis in training and testing Time Complexity Training Sequential Parallel k-NN n/a n/a 2 N Decision Tree O(DN log N + N log2 N ) O(D N log + N log N ) P P Na¨ıve Bayes + DK) O(DN + DK) O(D N P k-means O(IN DK) O(IDK( N P + T )) N2 2 PageRank O(IN ) O(I( P + T )) EM O(IN DK) O(IDK( N P + T )) SVM O(KN 3 ) O(K 2 ( N P + T )) LDA O(IDN K) O(IDK( N P + T )) 2 CRF O(IN K 2 ) O(I N PK ) Algorithm
Testing Sequential O(DN ) O(K) O(KD) O(KD) n/a O(KD) O(N ) O(KD) O(LK 2 )
remained idle, as we saw in the training step 3 of mixture of SVMs. A similar bottleneck happens if the data set is not uniform density and the workload on each processor is not evenly distributed. Take the sparse connection matrix in PageRank algorithm for example, if one processor is assigned some empty rows which does not need computation, nonetheless, this processor has to wait for others finishing before it enter next iteration. Some load balancing heuristics algorithms for SVM, CRF, PageRank were purposed in [45] [72] and [42]. Communication overhead means the increase in the absolute time spending in communication between machines, and the increase in the fraction of the communication time in the entire execution time. As pointed in [30], since the improvement of CPU performance outpaces the improvement of IO/communication performance, communication cost increasingly dominates a parallel algorithm. To reduce the amount of communication overhead parallel algorithm designers make sure that the grain size is as large as possible or avoid communication whenever possible. Several data mining works [86, 58] has listed the improvement of the disk I/O as their next target.
4.2 Parallelism Comparison From all parallelizations shown in section 3, we observe that most of them were using data parallelism (e.g. k-NN, Na¨ıve Bayes, k-means, EM, ADLDA), few of them were using task parallelism (e.g. Decision tree, HDLDA) or hybrid parallelism (e.g. Cascade SVM). Table 4.2 summarized different parallelism for all algorithms in Section 3. Since most of the work in data mining focus on performing operations on a data set, it is not at all surprising that data parallelism is most commonly used. Data parallelism emphasizes the distributed nature of the data. It is achieved by distributing the training set among the processors where each processor is responsible for a distinct set of examples. In the context of data mining, an example often have several dimensions for representing the features, thus the data set is typically organized into a matrix. The distribution of the data can therefore be performed in two different ways, horizontally or vertically. As shown in Figure 5, horizontal distribution refers to these cases where 22
Parallel O(D N P + T) n/a O(K( D + T )) P O(K( D + T )) P n/a O(K( D + T )) P O( N + T ) P O(K( D + T )) P 2 O( LK P )
Table 2: Feasible parallelism in data mining algorithms. Filled dot indicates that the algorithm in such parallelism is a feasible scheme, while soft dot means such parallelization is elusive and hard to design. Algorithm Data Parallelism Task Parallelism Hybrid Parallelism Horizontal Vertical k-NN # # Decision Tree Na¨ıve Bayes # # k-means # # PageRank # # # EM # # # SVM # LDA # # CRF # #
x1
f1 , f 2 , f3 , f 4 , f5 ,..., f n
x2 x3 x6
x5
.
x4
.
. x4
x3
x2
x1
f1 , f 2 , f 3 , f 4 f 5 , f 6 , f 7 , f8
xn
Figure 5: Data parallelism: horizontal (left) and vertical (right) different database records reside in different places, while vertical data distribution, refers to the cases where all the values for different features reside in different places. Among the mentioned algorithms in Section 3, k-NN, Decision Tree, Na¨ıve Bayes, k-means can be parallelized in vertical manner. Typically, if the model assume that features are independent to each other, or if the dimensions are greater than the number of examples (i.e., D ≫ N ), then vertical distribution can be applied. However, if the features are not independent, as we saw in many hierarchical probabilistic models (e.g. LDA), splitting the data set by features is not an proper way. On the other hand, horizontal distribution is usual tactics in parallelization, since most of the probabilistic models assume that the observations are independent and identically distributed (i.i.d.) . In case of the data are dependent with each other or the computing of local estimator conducts global parameters, it will only give a approximate solve since the model parameters estimated on their local subsets of the data. Apparently, 10 probabilistic models estimated from local examples of size N/10 would produce worse quality estimate than the sequential model estimated from data set of size N . A typical example has given in Section 3.8, where Gibbs sampling draw topic assignment from an ap|r proximate full conditional distribution P (zdi |⃗z−di , w ⃗ |r , α, β). Therefore, it is weak in predictive power and lack of theoretical background. A more principled way to model 23
parallel processes is to build them directly into the probabilistic model, and thus, makes the estimation in parallelized model theoretically equivalent to serialized model or in a pseudo-sequential manner [71, 69]. Task parallelism focuses on distributing execution processes across different parallel computing nodes, as opposed to the data parallelism. The construction of decision trees as introduced in Section 3.2 followed a task-parallelism approach can be viewed as dynamically distributing the decision nodes among the processors for further expansion. This approach suffers from bad load balancing generally, since it is hard to divide the task with equally complexity to different processor. Data parallelism and task parallelism complement each other, and can be used together to handle large-scale training.
5 Conclusion Data mining is a broad area that integrates techniques from several fields including machine learning, statistics, pattern recognition, artificial intelligence, and database systems. For the analysis of large volumes of data, speeding up and scaling up data mining implementations by introducing parallel and distributed computing emerge as an effective solution. In this paper, we surveyed research on the use of parallel and distributed techniques involving mining on large-scale data. We motivated this field of research, gave formal definition of the terms used herein and presented a brief overview of several state-of-art data mining algorithms. We introduced their properties, application to specific problems, parallel implementation and experimental results. Most of the parallel data mining algorithms can be parallelized by building a local model on each processor and then combining these models to obtain a global model using collective communication. We discussed the theoretic complexity and bottleneck in parallelization and gave a general view about the parallelism used in this paper. We believe the ideas discussed and the provided references could inspire the interested reader for further studies in this field.
Acknowledgement We thank Prof. Dr. Amitava Gupta and Dr. Thomas Stibor for helpful comments and constructive criticism on a previous draft.
References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier for database mining applications. In Proceedings of the International Conference on Very Large Data Bases, pages 560–560. Citeseer, 1992. 6 [2] S. Allesina and M. Pascual. Googling Food Webs: Can an Eigenvector Measure Species’ Importance for Coextinctions? 2009. 12
24
[3] G.S. Almasi and A. Gottlieb. Highly parallel computing. Benjamin/Cummings Pub. Co., 1994. 3 [4] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM), 45(6):891–923, 1998. 5 [5] A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. Advances in Neural Information Processing Systems, 20:20, 2008. 17 [6] Satish Balay, Kris Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith, and Hong Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.0.0, Argonne National Laboratory, 2008. 12 [7] Satish Balay, Kris Buschelman, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith, and Hong Zhang. PETSc Web page, 2009. http://www.mcs.anl.gov/petsc. 12 [8] Satish Balay, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkh¨auser Press, 1997. 12 [9] R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of I/O intensive data mining applications on clusters of workstations. Parallel and distributed processing: 15 IPDPS 2000 workshops, Cancun, Mexico, May 1-5, 2000: proceedings, page 350, 2000. 10 [10] M. Berthold and DJ Hand. Intelligent data analysis: an introduction. Springer Verlag, 2003. 3 [11] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999. 20 [12] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. 16 [13] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM New York, NY, USA, 1992. 13, 14 [14] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998. 12 [15] A.Z. Broder, R. Lempel, F. Maghoul, and J. Pedersen. Efficient pagerank approximation via graph aggregation. Information Retrieval, 9(2):123–138, 2006. 13
25
[16] D.R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1997. 3 [17] B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008. 16 [18] C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP (13th International Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Processing). 6 [19] E.Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Psvm: Parallelizing support vector machines on distributed computers. Advances in Neural Information Processing Systems, 20, 2007. 16 [20] M.S. Chen, J. Han, and P.S. Yu. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6):866– 883, 1996. 3 [21] W.Y. Chen, J. Luan, H. Bai, Y. Wang, and E.Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In Proceedings of the 18th international conference on World wide web, pages 681–690. ACM New York, NY, USA, 2009. 18 [22] C.T. Chu, S.K. Kim, Y.A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, page 281. The MIT Press, 2007. 8, 10, 12 [23] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of SVMs for very large scale problems. Neural computation, 14(5):1105–1114, 2002. 14 [24] M. Connor and P. Kumar. Parallel construction of k-nearest neighbour graphs for point clouds. In Eurographics Symposium on Point-Based Graphics, 2008. 6 [25] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995. 13 [26] L. Dagum and R. Menon. Open MP: An Industry-Standard API for SharedMemory Programming. IEEE Computational Science and Engineering, 5(1):46– 55, 1998. 3 [27] J. Darlington, YK Guo, J. Sutiwaraphun, and H.W. To. Parallel induction algorithms for data mining. Lecture Notes in Computer Science, 1280:437–446, 1997. 7 [28] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. 4
26
[29] R.P. Dellavalle, L.M. Schilling, M.A. Rodriguez, H. Van de Sompel, and J. Bollen. Refining dermatology journal impact factors using PageRank. Journal of the American Academy of Dermatology, 57(1):116–119, 2007. 12 [30] JW Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-avoiding parallel and sequential QR and LU factorizations. Submitted to SIAM Journal of Scientific Computing, 2008. 22 [31] A.P. Dempster, N.M. Laird, D.B. Rubin, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. 10 [32] I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. Lecture Notes in Computer Science, 1759:245–260, 2000. 9, 10 [33] W. Ding, S. Yu, Q. Wang, J. Yu, and Q. Guo. A Novel Naive Bayesian Text Classifier. In Proceedings of the 2008 International Symposiums on Information Processing-Volume 00, pages 78–82. IEEE Computer Society, 2008. 8 [34] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2):103–130, 1997. 8 [35] R.C. Dubes and A.K. Jain. Algorithms for clustering data, 1988. 9 [36] A. Esuli and F. Sebastiani. PageRanking WordNet synsets: An application to opinion mining. In ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 424, 2007. 12 [37] E. Fix and JL Hodges Jr. Discriminatory analysis. Nonparametric discrimination: Consistency properties. International Statistical Review/Revue Internationale de Statistique, 57(3):238–247, 1989. 8 [38] Message Passing Interface Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Applications, 8:159–416, 1994. 4 [39] W.J. Frawley, G. Piatetsky-Shapiro, and C.J. Matheus. Knowledge discovery in databases: An overview. Ai Magazine, 13(3):57–70, 1992. 2 [40] A.A. Freitas and S.H. Lavington. Mining very large databases with parallel processing. Springer, 1998. 7 [41] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using gpu. In CVPR Workshop on Computer Vision on GPU, pages 1–7, 2008. 6 [42] D. Gleich, L. Zhukov, and P. Berkhin. Fast parallel PageRank: A linear system approach. Yahoo! Research Technical Report YRL-2004-038, available via http://research. yahoo. com/publication/YRL-2004-038. pdf, 2004. 13, 22 [43] L. Glimcher and G. Agrawal. Parallelizing EM clustering algorithm on a cluster of SMPs. Lecture notes in computer science, pages 372–380, 2004. 11
27
[44] M. Goebel. A survey of data mining and knowledge discovery software tools. ACM SIGKDD Explorations Newsletter, 1(1):20–33, 1999. 3 [45] H.P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade svm. Advances in neural information processing systems, 17(521-528):2, 2005. 15, 22 [46] R.M. Gray and D.L. Neuhoff. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998. 9 [47] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl 1):5228, 2004. 16 [48] T.L. Griffiths, M. Steyvers, D.M. Blei, and J.B. Tenenbaum. Integrating topics and syntax. Advances in neural information processing systems, 17:537–544, 2005. 16 [49] S.R. Gunn. Support vector machines for classification and regression. ISIS technical report, 14, 1998. 13 [50] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer, 27(2):83–85, 2005. 3 [51] R.V. Hogg, A.T. Craig, and J. McKean. Introduction to mathematical statistics. 1978. 10 [52] J.R.M. Hosking, E.P.D. Pednault, and M. Sudan. A statistical perspective on data mining. Future Generations in Computer Systems, 13(2):117–134, 1997. 3 [53] T. Iwata, T. Yamada, and N. Ueda. Probabilistic latent semantic visualization: topic model for visualizing documents. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 363– 371. ACM, 2008. 16 [54] H. Jiawei and M. Kamber. Data mining: concepts and techniques. San Francisco, CA, itd: Morgan Kaufmann, 2001. 5 [55] R. Jin and G. Agrawal. A middleware for developing parallel data mining implementations. In Proceedings of the first SIAM conference on Data Mining. Citeseer, 2001. 6, 11 [56] T. Joachims. Making large-scale support vector machine learning practical, Advances in kernel methods: support vector learning, 1999. 14 [57] M.N. Joshi. Parallel K-Means Algorithm on Distributed Memory Multiprocessors. Computer, 2003. 9 [58] C. Kruengkrai and C. Jaruskulchai. A parallel learning algorithm for text classification. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 201–206. ACM New York, NY, USA, 2002. 11, 22 28
[59] R. Kufrin. Decision trees on parallel processors. Parallel Processing for Artificial Intelligence 3. Elsevier Science, pages 279–306, 1995. 7 [60] J. Lafferty, A. McCallum, and F. Pereira. Conditional random felds: Probabilistic models for segmenting and labeling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning, 1:282–289, 2001. 19 [61] J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer Verlag, 2007. 3 [62] B. Manaskasemsak and A. Rungsawang. Parallel PageRank computation on a gigabit PC cluster. In Proceedings of the 18th International Conference on Advanced Information Networking and Applications, 2004. 13 [63] S. Manavski and G. Valle. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC bioinformatics, 9(Suppl 2):S10, 2008. 5 [64] Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In In Seventh Conference on Natural Language Learning (CoNLL), 2003. 19 [65] G.J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley New York, 1997. 10 [66] D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, page 509. ACM, 2007. 16 [67] T.M. Mitchell. Machine learning. WCB. Mac Graw Hill, page 368, 1997. 2 [68] R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. ICDMW, 7:349–354. 18 [69] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. Advances in Neural Information Processing Systems, 20:1081–1088, 2007. 17, 24 [70] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2):103–134, 2000. 10 [71] J. Ocenasek, J. Schwarz, and M. Pelikan. Design of multithreaded estimation of distribution algorithms. Lecture Notes in Computer Science, pages 1247–1258, 2003. 24 [72] X.H. Phan, L.M. Nguyen, Y. Inoguchi, and S. Horiguchi. High-Performance Training of Conditional Random Fields for Large-Scale Applications of Labeling Sequence Data. IEICE TRANSACTIONS on Information and Systems, E90D(1):13–21, 2007. 20, 21, 22 29
[73] J.C. Platt. Fast training of support vector machines using sequential minimal optimization. 1999. 14 [74] J.R. Quinlan. C4. 5: programs for machine learning. Morgan Kaufmann, 2003. 6 [75] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62:107–136, 2006. 19 [76] B. Scholkopf and A.J. Smola. Learning with kernels. Citeseer, 2002. 14 [77] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proceedings of the International Conference on Very Large Data Bases, pages 544–555. Citeseer, 1996. 7 [78] H. Shan, J.P. Singh, L. Oliker, and R. Biswas. A comparison of three programming models for adaptive applications on the Origin2000. Journal of Parallel and Distributed Computing, 62(2):241–266, 2002. 3 [79] A. Srivastava, E.H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery, 3(3):237–261, 1999. 7 [80] K. Stoffel and A. Belkoniene. Parallel k/h-Means Clustering for Large Data Sets. Lecture notes in computer science, pages 1451–1454, 1999. 9, 10 [81] C. Sutton. Conditional probabilistic context-free grammars. Masters thesis, 2004. 19 [82] B. Taskar, P. Abbeel, and Koller D. Discriminative probabilistic models for relational data. In In Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), 2002. 19 [83] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000. 14 [84] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In Proceedings of RAID, volume 5230, pages 116–134. Springer. 5 [85] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, page 433. ACM, 2006. 16 [86] Y. Wang, H. Bai, M. Stanton, W.Y. Chen, and E.Y. Chang. PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. AAIM, June, 2009. 18, 22 [87] B. Wilkinson and M. Allen. Parallel programming: techniques and applications using networked workstations and parallel computers. Prentice Hall, 1998. 3 [88] S. Xu and J. Zhang. A parallel hybrid web document clustering algorithm and its performance study. The Journal of Supercomputing, 30(2):117–131, 2004. 9 [89] H. Zhang. The optimality of naive Bayes. A A, 1(2):3. 8
30