Billion-Particle SIMD-Friendly Two-Point Correlation on ... - Google Sites

0 downloads 111 Views 770KB Size Report
Nov 16, 2012 - Correlation on Large-Scale HPC Cluster Systems. Jatin Chhugani† .... high performance computing cluster
Billion-Particle SIMD-Friendly Two-Point Correlation on Large-Scale HPC Cluster Systems Jatin Chhugani† , Changkyu Kim† , Hemant Shukla? , Jongsoo Park† , Pradeep Dubey† , John Shalf? and Horst D. Simon? ? Lawrence Berkeley National Laboratory † Parallel Computing Lab, Intel Corporation

Abstract— Two-point Correlation Function (TPCF) is widely used in astronomy to characterize the distribution of matter/energy in the Universe, and help derive the physics that can trace back to the creation of the universe. However, it is prohibitively slow for current sized datasets, and would continue to be a critical bottleneck with the trend of increasing dataset sizes to billions of particles and more, which makes TPCF a compelling benchmark application for future exa-scale architectures. State-of-the-art TPCF implementations do not map well to the underlying SIMD hardware, and also suffer from load-imbalance for large core counts. In this paper, we present a novel SIMDfriendly histogram update algorithm that exploits the spatial locality of histogram updates to achieve near-linear SIMD scaling. We also present a load-balancing scheme that combines domainspecific initial static division of work and dynamic task migration across nodes to effectively balance computation across nodes. Using Zin supercomputer at Lawrence Livermore National R R Laboratory (25,600 cores of Intel Xeon E5-2670, each with 256-bit SIMD), we achieve 90% parallel efficiency and 96% SIMD efficiency, and perform TPCF computation on a 1.7 billion particle dataset in 5.3 hours (at least 35× faster than previous approaches). In terms of cost per performance (measured in flops/$), we achieve at least an order-of-magnitude (11.1×) higher flops/$ as compared to the best known results [1]. Consequently, we now have line-of-sight to achieving the processing power for correlation computation to process billion+ particles telescopic data.

I. I NTRODUCTION Correlation analysis is a widely used tool in a variety of fields, including genetics, geology, etc. In the field of astronomy, Two-Point Correlation Function (TPCF) helps derive the parameters and the underlying physics from the observed clustering patterns in large galaxy surveys that can trace back to the conditions at the very beginning of the creation of the Universe. The current sizes of observed datasets are in the order of hundred of millions of particles and are rapidly increasing and expected to cross tens of billions of particles soon [2], [3]. However, processing this data to compute the correlation analysis is lagging far behind. For example, TPCF computation on a 1 billion particle dataset would typically consume more than a few exaflops of computation, and take more than 50 years on a single-threaded scalar code. Hence, computing TPCF in a reasonable amount of time on massive datasets requires large scale HPC machines, and with the trend of growing sizes, is a tailor-made application for exa-scale machines. SC12, November 10-16, 2012, Salt Lake City, Utah, USA c 978-1-4673-0806-9/12/$31.00 2012 IEEE

One of the biggest challenges in mapping TPCF to large node clusters is load-balancing across the tens of thousands of cores spread across thousands of nodes. TPCF implementation typically involves constructing acceleration data structures like kd-trees, and traversing the resultant trees to compute correlation amongst particles of the nodes. The computation involved for these nodes varies dramatically across tree-nodes, and cannot be predicted apriori. Thus, it is challenging to evenly distribute work across cores, and implementations typically under-utilize the computational resources [1], even for as low as 1K cores. Hence there is a need to develop loadbalanced algorithms for TPCF computation to fully utilize the underlying resources. Another challenge for large-scale cluster computation is the total expenditure, which is being dominated by the electricity costs related to energy consumption [4], [5]. This is projected to get even more expensive with the increasing number of nodes in the future. Thus, chip manufacturers have been increasing single-node compute density by increasing the number of cores, and more noticeably, the SIMD width. SIMD computation is an energy efficient way of increasing single-node computation, and has in-fact been getting wider – from 128-bit in SSE architectures, 256-bit in AVX [6] TM R to 512-bit in the Intel Xeon Phi [7], [8]. Hence, it is crucial to design algorithms that can fully utilize the SIMD execution units. A large fraction of TPCF run-time is spent in performing histogram updates, which do not map well to SIMD architectures [9], and current implementations achieve negligible SIMD scaling. Thus, lack of SIMD scaling poses a serious challenge to scaling TPCF on current and upcoming computing platforms. Contributions: We present the first SIMD-, thread- and node-level parallel friendly algorithm for scaling TPCF computation on datasets with billions of particles on Petaflop+ high performance computing cluster systems with tens of thousands of SIMD cores. Our key contributions are: •



Novel SIMD-friendly histogram update algorithm that exploits the spatial locality in updating histogram bins to achieve near-linear SIMD speedup. Load-Balanced computation comprising an initial static division of work; followed by low-overhead dynamic migration of work across nodes to scale TPCF to tens

of thousands of SIMD cores. Communication overhead reduction by overlapping computation with communication to reduce impact on run-time and improve scaling. Additionally, we also reduce the amount of data communicated between nodes, thereby reducing the energy consumption, and increasing the energy efficiency of the implementation. • Performing TPCF computation on more than billion particles on a large-scale Petaflop+ HPC cluster in a reasonable amount of time of a few hours (details below). To the best of our knowledge, this is the first paper that reports TPCF performance on this scale of dataset and hardware. We evaluate our performance on a 1600-node Intel Xeon E5-2670 Sandybridge cluster, with a total of 25,600 cores each with a peak compute of 41.6 GFLOPS/core for a total theoretical peak compute of 1.06 PFLOPS. On our largest dataset with 1.7 X 109 particles, we achieve a SIMD scaling of 7.7× (max. 8×), a thread-level scaling of 14.9× (max. 16×) and a node-level scaling of 1550× (max. 1600×) to achieve a performance efficiency of 86.8%. Our SIMD friendly algorithm and the dynamic load-balancing scheme coupled with communication/computation overlap help improve the performance by a minimum of 35× as compared to previous approaches. This reduction in run-time directly translates to a corresponding energy reduction, which is further boosted by the reduction in inter-node communication. On our input dataset with 1.7 billion particles, TPCF computation takes 5.3 hours1 on the above cluster. Coupling our algorithmic novelty with an implementation that maximally exploits all micro-architectural dimensions of modern multicore/manycore processors, we have achieved significantly more than an order-of-magnitude efficiency improvement for TPCF computation. This has the potential for bringing an unprecedented level of interactivity for researchers in this field. •

II. BACKGROUND A. Motivation The N -point correlation statistics is used in a wide range of domain sciences. In the field of astrophysics and cosmology, the correlation statistics is a powerful tool and has been studied in depth. For example, correlation statistics is used to study the growth of initial density fluctuation into structures of galaxies. In the recent decades, various experiments have validated that of the total matter-density of the universe, only 4% comprises everything that we observe as galaxies, stars, and planets. The remainder of the universe is made of something termed as dark energy (73%) and dark matter (23%) [10]. Dark energy is a hypothetical form of energy that permeates all of space and tends to accelerate the expansion of the universe [11]. The work that led to this discovery of accelerating expansion of the universe was awarded the Nobel Prize in Physics in 2011 [12]. Despite various and independent 1 the number of random sets generated (n ) equals the typically used value R of 100.

confirmations of the estimates, the nature of these physical components are largely unknown. Consequently, there are multiple ongoing efforts worldwide to conduct very large (sky coverage) and deep (faint objects) sky surveys. The large surveys are used to glean various cosmological parameters to constrain the models. A class of cross and auto-correlation functions helps derive parameters and the underlying physics from the observed clustering patterns in large galaxy surveys. Fourier transform of correlation functions yields the power spectrum of the matter density field. The decomposition of the field into Fourier components help in identifying physical processes that contributed in the density perturbations at different scales, unfolding the interplay of dark matter and matter in the early universe and the effects of acceleration due to dark energy relatively later. Studying the large-scale distribution (clustering due to gravity) patterns of matter density at various stages of cosmic evolution sheds light on the underlying principles that can trace back to the conditions at the very beginning of the creation of the universe. The size of the datasets captured has been growing at a rapid pace. From relatively small datasets with around 10M galaxies in 2011, the datasets have already crossed a Billion galaxies and expected to cross tens of billions of galaxies by the end of the next year [2]. This represents an increase of 3 orders-ofmagnitude in less than 3 years. The recently announced Square Kilometer Array (SKA) radio telescope project is expected to produce 2 times the current internet traffic rates (1 exabyte) by the next decade [13]. The LSST data challenge [14] and the IBM*2 ’Big Data Challenge’ [15] are some examples that further strengthen the need to develop efficient highperformance tools. In fact, correlation computation (and its variants) has been identified as one of the Top-3 problems for the data generated by next generation telescopes [16]. Current TPCF computation rates are too slow and hence impact the size of datasets that scientists typically run on. For example, a 1.7 billion dataset using current algorithms would take more than a month 3 on a Petaflop cluster, and is clearly infeasible for analyzing results on real/simulated data. Hence, it is imperative to develop algorithms that speedup the processing times for such datasets to a more tractable range (around a few hours) and scale with future processor trends (larger number of nodes, cores and increasing SIMD widths) to be able to efficiently process the increasing dataset sizes and to put us on the path of discovering answers to the ultimate question – ‘How was the universe formed and how will it end?’ B. Problem Definition Notation: We use the following notation throughout the paper: D : Set of input data particles (point-masses) in 3D space. 2 Other

names and brands may be claimed as the property of others using performance numbers by Dolence and Brunner [1]. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 3 projected

R : Set of random data particles (point-masses) in 3D space. nR : Number of random sets generated. N : Number of input data particles. Dim : Dimension of the box (cube) that contains all the particles. H : Number of histogram bins. rmin : Minimum distance (expressed as a % of Dim) that is binned. rmax : Maximum distance (expressed as a % of Dim) that is binned. DD(r ) : Histogram of distances between the input particles. RRj (r ) : Histogram of distances within Rj (j ∈ [1 .. nR ]) DRj (r ) : Histogram of distances between D and Rj (j ∈ [1 .. nR ]) ω(r ): Two Point Correlation Function (defined in Eqn. 1).

Although there exist different estimators for the two point correlation function (TPCF) [17], [18], they are all histogrammed functions of the distances, denoted by DD(r ), RRj (r ), and DRj (r ) (defined above). Similar to other works [19], [20], [1], we use the Landy-Szalay estimator [18], ω(r ), which is defined as: n R P nR · DD(r ) − 2 DRj (r ) ω(r ) = 1 +

j=1 n R P

(1)

RRj (r )

Dual_Tree_Traversal_and_Histogram_Update() { (rminn1,n2, rmaxn1,n2) = Compute_Min_Max_Dist(n1, n2);  if ( (rmin > rmaxn1,n2) || (rminn1,n2 > rmax)    return; (min_bin, max_bin) = Compute_Min_Max_Bin_ID(rminn1,n2 , rmaxn1,n2 );  if (min_bin == max_bin) { Hist[min_bin] += |n1| *|n2|;  return; } else if (!leaf(n1) && !(leaf(n2)))  // Case 1: Call  Dual_Tree_Traversal_and_Histogram_Update() with four cross pairs ‐‐ (n1.left, n2.left),  (n1.right, n2.left), (n1.left, n2.right), (n1.right, n2.right) ... else if (leaf(n1) && (!leaf(n2))  // Case 2: Call Dual_Tree_Traversal_and_Histogram_Update()  // with two cross pairs ‐‐ (n1, n2.left), (n1, n2.right) ... else if (!leaf(n1) && (leaf(n2)) // Case 3: Call Dual_Tree_Traversal_and_Histogram_Update() // with two cross pairs ‐‐ (n1.left, n2), (n1.right, n2) ... else  //Case 4 for (n1_i=0; n1_i rmax ) or (rmax < rmin ), then none of the pairs of particles from n1 and n2 contribute to the histogram and there is no further need to traverse down the trees. Otherwise, this is followed by executing Compute_Min_Max_Bin_ID, that computes the range of histogram bins that overlap with n1 ,n2 n1 ,n2 ]. The resultant minimum and maximum [rmin .. rmax histogram bins are referred to as min_bin and max_bin respectively. If min_bin equals max_bin (i.e., all the pairs fall in the same bin), the corresponding histogram index (min_bin) is incremented by the product of respective number of particles in the two sub-trees, and no further traversal is required. Else, one of the following four cases is executed: Case 1: Both n1 and n2 are non-leaf nodes. The four cross pairs of nodes formed by the immediate children of n1 and n2 are recursively traversed down. Case 2: n1 is a leaf node and n2 is a non-leaf node. The two cross pairs of nodes formed by n1 and immediate children of n2 respectively are traversed down. Case 3: n2 is a leaf node, and n1 is a non-leaf node. Symmetric to case 2. Case 4: Both n1 and n2 are leaf nodes. For each particle in n1 n1i ,n2 (n1i ), the minimum distance (rmin ) and maximum distance n1i ,n2 (rmax ) are computed. If these fall within one bin, the cor-

responding histogram index is incremented by the number of particles in n2 . Else, the actual distances between n1i and each particle in n2 are computed, and appropriate histogram bins incremented (Compute_Dist_Update_Histogram). This process is carried out for all particles in n1 . DD(r ) and RRj (r ) are also computed in a similar fashion (using the appropriate trees). Typically, these trees are built with axis-aligned bounding boxes [23]. III. C HALLENGES As far as performance of TPCF on modern SIMD multicore processors in a multi-node cluster setting is concerned, the following two factors need to be efficiently optimized for: A. Exploiting SIMD Computation

(a) Zoom-in of the input dataset

(b) Kd-Tree visualization

Fig. 2: (a) Zoom-in to a region of the input dataset. (b) Visualization of the kd-tree built around the input particles. Only a few tree-nodes are shown for clarity. Note the varying size of the nodes – which captures the non-uniform density of particles in that neighborhood, and prevalent throughout realistic datasets, which poses a challenge for load-balanced computation across tens of thousands of cores.

The SIMD widths of modern processors have been continuously growing. The SSE architecture has a 128-bit wide SIMD, that could operate simultaneously on 4 floating point values, while current AVX-based [6] processors have doubled it to 256-bits. Intel’s Xeon Phi processors [8] are further increasing it by 2× to 512-bits. GPUs have a logical 1024-bit SIMD with physical SIMD widths of 256-bits on the NVIDIA GTX 200 series, increasing to 512-bits on the Fermi architecture [25]. Hence it is critical to scale with SIMD in-order to fully utilize the underlying computing resources. TPCF computation consists of two basic computational kernels – computing distance between points and updating the histogram. A large fraction of the run-time is spent in performing histogram updates, which do not map well to SIMD architecture [9]. This is primarily due to the possibility of multiple lanes within the same SIMD register mapping to the same histogram bin, which requires atomic SIMD updates – something that does not exist on current computing platforms. Most of the recent work on exploiting SIMD on CPUs for histogram updates [26], [27] has resulted in a very small speedup (less than 1.1–1.2×). In the case of correlation computation (TPCF), we exploit the spatial locality in updating of histogram bins to devise a novel SIMD friendly algorithm for efficiently computing bin ids and performing histogram updates. The techniques developed here are applicable to other histogram-based applications with similar access patterns (details in Sec. IV-A). We are not aware of any previous work in this area.

predicted apriori (See Fig. 2). As far as cores on a single-node are concerned, there exist systems [28], [29] which employ a dynamic task queueing model [30] based on task stealing for load-balancing across them. Due to the shared on-chip resources, the resultant overhead of task stealing is less than a few thousand cycles [31], [32], and adds very little overhead, and scales well in practice. However, there are relatively fewer systems [33], [34] for scaling such tree-traversal code across thousands of nodes – something we focus on in this paper. This is because the task stealing now occurs across shared-nothing computational nodes, and involves large node-to-node latency and complicated protocols to guarantee termination. The resultant task stealing overheads are now orders of magnitude larger and need to be carefully optimized for. We perform the work division between the nodes using a two-phase algorithm: a low-overhead domain specific initial static subdivision of work across the nodes; followed by dynamic scheduling involving task stealing (or migration) across nodes such that all the computational resources are kept busy during the complete execution (Sec. V). This helps us achieve scaling both within the cores of a node, and across nodes in the cluster. In addition, the time spent in data communication also adds up to the total execution time, and can potentially reduce scaling. We propose a hybrid scheme that overlaps computation with communication to reduce the overhead of data communication to obtain near-linear speedups with as large as 25,600 cores across 1,600 nodes.

B. Scaling to tens-of-thousands of cores across thousands of nodes

IV. O UR A PPROACH A. SIMD-Friendly Histogram Computation

TPCF computation involves traversal of kd-trees and for a given node, computing the neighboring nodes, and the correlation between particles in the corresponding pair of nodes. As shown in Fig. 1, the work done per pair is a function of the distance between the nodes, product of the number of particles in each node, etc. Furthermore, this work can vary drastically across the nodes of the tree. Such tree-traversal based algorithms pose a challenge for scaling due the irregular amount of work done which cannot be

Let |ni | denote the number of particles in node ni and K denote the SIMD width (8-wide AVX [6] for CPU) of the architecture. For each pair of kd-tree leaf nodes (n1 and n2 n1 ,n2 n1 ,n2 respectively), if [rmin .. rmax ] overlaps [rmin .. rmax ] in more than one bin, then for each particle in n1 whose extent with n2 overlaps with more than one bin, we compute its distance with all particles in n2 and update the relevant histogram bins. Let n01 denote the set of particles in n1 whose extent with n2 overlaps with more than one bin.

Consider the first part of the computation — distance computation. One way of SIMDfying is to perform computation for each particle in n01 simultaneously with K particles in n2 , until all particles have been considered. However, this would require gather operations to pack the X, Y, and Z coordinates of the particles in n2 into consecutive locations in the registers. To avoid the repeated gather operations, we store particles in a Structure-Of-Arrays (SOA) [35]: we store the X coordinates of all particles contiguously, followed by the Y coordinates, and so on. This allows for simply loading X, Y, and Z coordinates of K particles into three registers before the distance computation. To avoid the expensive square-root operation, we actually compute the square of the distance, and appropriately compare with the square of the bin boundaries. As the next step, we need to update the histogram. As described in Sec. II-B, bin boundaries are spaced uniformly in the logarithmic space, thereby forming a geometric progression. Let d0 , d0 γ, ..., d0 γ H denote the bin boundaries (d0 = r min and d0 γ H = r max ) . The fraction of volume covered by ith bin (i ∈ [1 .. H]) is ∼ (γ 3 − 1)/γ 3(H−i+1) . Consider a typical scenario with rmax = 10% divided into H = 10 bins. The last four bins cover 99.6% of the total volume (74.9%, 18.8%, 4.7%, and 1.2%, starting from the last bin). After pruning cases where all pairs between kd-tree nodes fall within a bin (since they do not incur any computation), a large fraction of the remaining time is expected to be spent in cases where the range of bins for n01 and n2 overlap with a small number of bins (typically 2 – 4). This observation is also confirmed by the results on real datasets (Sec. VI-C). We develop a SIMD algorithm to perform histogram updates for such cases (without using gather/scatter) as described below: n1 ,n2 n1 ,n2 Consider the case of [rmin .. rmax ] overlapping with two bins, and denote the bin boundaries as rl , rm , and rh . Instead of an explicit histogram, we maintain K counters in a SIMD register, one for each lane. We denote this register as reg_count. In addition, we maintain an auxiliary regis2 ter (reg_aux) that contains the value rm ) splatted across the K lanes. After K distance (square) computations, we compare the results with reg_aux and increment the lanes of reg_count that passed the comparison test (masked add). After all |n01 | × |n2 | comparisons, we obtain the total number of pairs that passed the test by summing up the K lanes in reg_count (denoted as count

Suggest Documents