dynamic blocking prefetch are proposed to optimize the overhead of memory access. ... search engine indexing[2], can benefit from the perfor- .... cific software cache algorithm. ..... instruction is used to achieve the best performance[13].
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444 Digital Object Identifier(DOI): 10.1007/s11741-011-0765-2
Blocking optimized SIMD tree search on modern processors
Ü
ZHANG Zhuo ( ZHENG Yan-heng (
), LU Yu-fan ( )
),
SHEN Wen-feng (
),
XU Wei-min (
),
School of Computer Engineering and Science, Shanghai University, Shanghai 200072, P. R. China ©Shanghai University and Springer-Verlag Berlin Heidelberg 2011
Abstract Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm. Keywords single instruction multiple date (SIMD), tree search, binary search, streaming SIMD extensions (SSE), Cell broadband engine (BE)
Introduction Tree search is a widely used fundamental algorithm in computer science. As one of the most important operations in database system, in-memory structured index[1] is based on tree search. Currently, many of the massive data processing applications, such as the data mining, network monitoring, financial analysis[1], and search engine indexing[2] , can benefit from the performance optimized tree search algorithms. The search algorithms are divided into the linear search, sort-based search, tree-based search and hashbased search according to the classical algorithm theory. For an ordered data set, the time-complexity of binary search is minimum in sort-based algorithms[3] . Binary search can be transformed into tree-based search on a binary search tree. In this paper, we will focus on discussing about tree-based search algorithms. Binary search algorithm (including tree-based search) cannot give full play to the features of modern processors in fact. Binary search only does one comparing operation each time while this is not able to exploit the single instruction multiple date (SIMD) capacity of processors[4]. Furthermore, the performance of modern processors largely depends on the efficiency of cache. Tree search algorithm typically involves long latency for
the main memory access due to irregular and unpredictable data access during the tree traversal[1]. This also affects the efficiency of SIMD operations. Modern processors provide tremendous computing power by integrating multiple cores, each with an SIMD (vector) processing unit. A typical homogeneous multicore processor, Intel Core i7 (Nehalam microarchitecture), featuring 16 128-bit SIMD registers in each core, can execute 4 32-bit operations simultaneously. Likewise, a heterogeneous multi-core processor, Cell Broadband Engine, featuring 128 128-bit vector registers in each synergistic processor element (SPE), supports execute 4 32-bit operations simultaneously. The new generation processors supporting AVX instruction set from Intel can now support 256-bit SIMD data width (Sandybridge microarchitecture), which can be extended to 512-bit in the future. SIMD instructions are particularly well-suited to accelerate computation-intensive applications[4] . Many studies have shown that dataintensive applications, such as sorting[6], hash-based search[7] , tree-based search[1,4], database operations[8] can also benefit from SIMD instructions to improve performance. Several researchers have carried out some work in exploiting SIMD instructions to improve the perfor-
Received Mar.10, 2011; Revised June 13, 2011 Project supported by the Shanghai Leading Academic Discipline Project (Grant No.J50103), and the Graduate Student Innovation Foundation of Shanghai University (Grant No.SHUCX112167) Corresponding author XU Wei-min, Prof, E-mail: wmxu@staff.shu.edu.cn
438
mances of binary search and tree search. Zhou, et al.[9] employed SIMD instructions to improve binary search performance for small dataset. Schelgel, et al.[4] proposed k-Ary search algorithm basing on the SIMD binary search. The main idea of k-Ary search algorithm is to use SIMD instructions to compare k elements simultaneously, dividing the search tree into (k+1) parts. The algorithm performance can be improved with the increase of n. The number of its iterations is logk n. k-Ary search algorithm also avoids the noncontinuous memory access inside each tree level through the linearized tree structure. Similar results came from the study on parallel search on GPUs from Kaldewey, et al.[10] However k-Ary still involves long latency for memory access. Because when the search tree is relatively huge, the parent node and the child node might not in a consecutive memory space, resulting in cache miss when traversing to the lower levels. Kim, et al.[1] improved k-Ary search algorithm on CPUs and GPUs, and presented FAST algorithm, mainly aiming to reduce memory access latency. FAST algorithm stores the linearized search tree in hierarchical blocks, making all the sub-trees correspond to blocks are stored in the adjacent memory space. So that multi-level blocking can be adapted to the storage hierarchy in modern processors. Usually, hierarchical division can correspond to the SIMD data-width, L1 cache line length, and memory page size. Meanwhile, FAST algorithm also studied the thread-level parallelism and data-level parallelism to improve throughput for batch queries. The purpose of this paper is to propose the blocking optimized SIMD tree search algorithm basing on the above studies, and exploite the efficiency of modern processors on memory access and computation. Based on the previous research, we further studied the data structure of blocking tree. The data structure can be improved to meet the requirements of memory alignment for the SIMD loading operations in each iteration to avoid the overhead of unaligned memory access. FAST algorithm uses hierarchical blocking to adapt the processors’ cache structure, but more hierarchical level can lead to the additional complexity of data structure. Dynamic blocking prefetch is equivalent to provide a specific software cache algorithm. In the case of hardware support, the dynamic blocking prefetch is more flexible and fully optimized than multi-level hierarchical blocking. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. This paper mainly studies how to improve the response speed of search algorithms by exploiting instruction-level parallelism. The optimization of the throughput is beyond the scope of our work. The experiments suggest that blocking
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the algorithm which is not optimized, 2.2 times faster than generic binary search on Intel Core i7 920 processor. All of the variant tree search algorithms for acceleration on modern processors are based on the binary search. In this sub-section, we provide a brief overview of binary search. Binary search is a bifurcated divide-and-conquer search algorithm. At first, we assume that the input to the search algorithm is a sorted array of keys A = (a1 , · · · , an ), where ai aj for 1 i n, and a search key s. The algorithm will pick the median element as the separator key in each iteration. The separator divides the search space into two equally-sized parts, or called partitions: The left partition which contains the keys smaller than the separator key and the right partition containing larger ones. The median key can be found at the center position of the array A due to A has been sorted, which means a n2 is chosen out in first iteration. The separator and the search key will be compared next step. The search will be terminated if both the keys are equal, i.e., the desired one has been found. Otherwise, the search process will be repeated taking either the smaller or the larger parts to its next iteration until the satisfied key is found. The algorithm will be terminated and report s ∈ / A if the selected partition is empty. It can be concluded that the algorithm performs h = log2 (n + 1) iterations in worst case.
1 Data structure of blocking tree k-Ary search algorithm takes k-Ary tree as its data structure instead of binary tree while the FAST algorithm uses a blocking binary tree to adapt different kinds of SIMD processors (CPU and GPU) by stretching the width of branches flexibly. In addition, this method is helpful to the optimization of memory access. FAST algorithm designs a 3-level blocking tree, based on which we will illustrate a more general definition and properties of the hierarchical blocking tree, i.e., m-level blocking search tree. Figure 1 shows a hierarchical blocking tree. N is the numberofnodes input, and m is the specified
Fig.1
Three-level blocking tree
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
classification number. di is the depth of the i-th subtree, with d0 = 1, meaning the depth of the subtree is 1 when the hierarchical blocking tree degenerates to binary tree. In such case, di = kdi−1 , i ∈ [1, m], k > 1 and k is a natural number. dN is the deep of the search tree. Ni is the number of keys that can fit into a block of level i. Si is the block storage size of level i. Oi is the block output degree (branches) of level i. Besides, dN = kdm , k 1. The depth of each level must be an integer multiple of the former one. According to the properties of binary tree, it is easy to find Ni = 2di – 1, i ∈ [1, m] (in particular, if i = 0, then N0 = 1), and obtain the depth of the blocking tree dN . The depth of binary tree which contains N nodes dN dm in order is dN = log2 N + 1. dN must be dN = dm to meet the requirement that dN must be the integer multiple of dm . Si means the storage space needed to store Ni nodes in i-layer. If all the nodes are stored continuously, then Si = Ni . If not, set Si according to the requirei ment under the circumstance that Si ( NN )Si−1 . i −1 When the nodes in i-layer are stored continuously, i )Si−1 . For example, assume that S0 = Si = ( NN i −1 1 N0 = 1, d1 = 2, d2 = 4, S1 = 4 ( N N0 )S0 = 3, and if we store the nodes in the 2nd layer continuously, 15 2 S2 = ( N N1 )S1 = ( 3 )4 =20 can be calculated. In other words, we obtain the Si using recurrence according to Si−1 instead of calculating from Ni directly if the nodes are stored non-continuously in the same level. We introduce the non-continuous storage in order to illustrate the optimization on the memory alignment in the rest part of this paper. The number of branches (out-degrees) of a subtree block with the depth of di , refers to Oi , can be calculated according to the formula Oi = 2di = Ni + 1. The relationship between blocking binary tree and k-Ary tree is that a subtree block with the depth of di forms the node of an Oi -Ary tree, level by level nested. While, the differences between them come that the “k” in k-Ary tree corresponding to the blocking binary tree must be integer power of 2, refers to 4 forks, 8 forks, 16 forks, etc. Especially, when m = 0, S0 = N0 , d0 = dm = 1, N0 = 1, Oi = 2, dN = log2 N + 1. Blocking tree equates to a binary tree’s sequential storage structure. Thus, the sequential storage structure of binary tree is a special form of the blocking storage structure. Blocking search tree can be constructed with an ordered array having size N . Filling the nodes with the array’s data under the order of in-order traversal can generate a complete blocking tree.
1.1
Address calculation
It will be very convenient to calculate the addresses when the binary tree is stored in continuous structure:
439
Node number i is just the address offset. While the binary tree is stored in blocking structure, the address offset must be calculated with node number. Here we define the node number in m-level blocking tree as a tuple with the dimension of m+1, (Bm , Bm−1 , · · ·, Bi , · · ·, B1 , B0 ), where Bi starts with 0. The address offset formula is provided as Eoffset =
m
Bi Si = Bm Sm + Bm−1 Sm−1
i=0
+ · · · + B1 S1 + B0 S0 . Especially, when m = 0, S0 = N0 = 1, the address formula of binary tree is very familiar with Eoffset = B0 . Basing on the conclusions mentioned above, we continue discussing about the calculation method of topdown traversal on hierarchical blocking tree. Assuming some node in m-level blocking tree is presented by tuple (Bm , Bm−1 , · · ·, Bi , · · ·, B1 , B0 ), we have the pseudocode as follows: begin if visiting left child then B0 ←− 2B0 + 1 else B0 ←− 2B0 + 2 end if i←0 while Bi < Ni+1 and i < m do Bi+1 ←− Bi+1 Oi+1 + (Bi – Bi ←−0 i ←− i + 1 end while end
Ni+1 ) Ni
+1
(Bi –Ni ) is the order number of child node from 0 to Ni –1. The result we have is (Bm , Bm−1 , · · ·, Bi , · · ·, B1 , B0 ).
1.2
Node number calculation
The node number calculation are needed in two occasions in tree search algorithm: One is when nodes match successfully while the other is when the algorithm has to choose which branch to traverse. FAST algorithm calculates the node number using lookup table that has its own shortage: Checking the table will take an indirect addressing operation, which will deteriorate the performance when the width of SIMD search branches amplifies. The k-Ary search algorithm[4] also calculates the node numbers with the idea of using SIMD instructions. However due to the difference between the data structure of k-Ary tree and blocking tree, their data alignment in SIMD registers becomes different too. Here we will emphasis on the situation in blocking tree. Figure 2 illustrates the SIMD operations in blocking
440
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
tree structure. Suppose now there are 3 nodes operated by one SIMD instruction every time, i.e., N1 = 3, S1 = 4. Combining the result of SIMD comparing operation produces a 4-bit mask: mask = (r3 , r2 , r1 , r0 ), with r0 is the lowest bit (ri = 0 means logic false while ri = 1 indicates logic true).
Fig.2
Node arrangement in SIMD registers
When the nodes matching operation succeed, which means there is one node equaling to key at least, the formula to pick out the matching one is i = trailing zero count (mask) = popcount (not(mask) and (mask – 1)), where trailing zero count() means the number of trailing zeros. Because the lowest bit aligns from r0 , the number of 0s summing from the lower bit indicates the location of serial number when some bit is 1. If there are many nodes match the key, the lowest ranked one will be picked out. trailing zero count() operation can be implemented by popcount() operation, showed as the second half of the formula. popcount() does the statistics of 1s in a binary number which can be supported by the particular instruction in many modern processors. Branch-choosing which is implemented by either popcount() or leading zero count() (statistics of leading 0s)[4] in k-Ary search algorithm seems to be more complicated. For further discussion, the child nodes are sorted according to the order from left to right, starting from 0. In k-Ary tree, there are k – 1 elements in each node, presented as E1 , E2 , · · ·, Ek−1 and E1 E2 · · · Ek−1 according to the definition of k-Ary tree. Under such circumstance, mask = 00· · · 0 indicates key E1 E2 · · · Ek−1 , thus i = 0, and mask = 00· · ·1 indicates E1 key E2 · · · Ek−1 , thus i = 1, and so on. In other words, the 1s in mask appears one by one with the increasing of node number, and once the higher bit becomes 1, all lower bit(s) become 1 as well. Due to such property, both popcount() and leading zero count() can be applied to calculate the node numbers with the prerequisite that the order of the nodes satisfies the partial order relation mentioned above. Nevertheless, in blocking structure, the order of nodes aligned in SIMD registers does not meet such partial order relation (when 3 levels or more). Thus
the leading zero count() becomes invalid while the popcount() still holds on. Here is the demonstration: (i) Assume there are M nodes consisting a block of the binary tree. (ii) Assume Si is one of the child nodes of this block (i is the number of the child node), and presents the set containing all the keys that correspond with the condition picking child node i as Si . (iii) If i = 0, the key is smaller than all the M nodes. Thus the number of nodes that are smaller than key is 0. (iv) If i = 1, obviously S0 key, with N0 is the father node of the S0 and S1 , we can have S0 N0 key Nt while Nt is all other M – 1 blocking nodes. Thus the number of M nodes that are smaller than key is 1. (v) When i = j + 1, S0 key, Nj is the common ancestor node of Si and Sj , with Si on Nj ’s left branch and Sj on the other. Thus we can have Ns Sj Nj key Nt with Ns is the common ancestor node found by i searched from 0 to j while Nt presents all the other M − j blocking nodes. Thus the number of M nodes that are smaller than key is i. (vi) Compositing both (iv) and (v), the number of M nodes that are smaller than key is i holds when i ∈ [0, M ]. Proved. Therefore, the child node number calculation formula for branch-choosing is provided as i = popcount (mask).
2 Memory accessing optimization 2.1
Blocking for memory alignment
When the blocking tree is stored continuously, Ni = 2di − 1, Ni is not the integer power of 2, and the offm Bi Ni is not integer power of 2 set derived Eoffset = i=0
while SIMD instructions operations perform under the requirement of memory alignment, the unit of alignment must be integer power of 2. According to previous research, storing the nodes as their natural order continuously will lead to the non-alignment address to be read, which requires more costs than alignment address. Usually, 128-bit wide SIMD instructions require the data input to be 16-byte aligned[11] . Presume the width of node data to be 32-bit, and d1 = 2, then we will have S1 = N1 = 3, which means each address increase in the regular pattern of 0, 3, 6, 9,· · ·, which cannot satisfy the requirement that the data input must be 16-byte aligned. According to the properties of blocking tree mentioned above, we can set Si > Ni . In the example illustrated in the former paragraph, if we set S1 = 4 > N1 , the increasing of address will obey the regular pattern of 0, 4, 8, 12,· · ·, meeting the requirement that the data
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
441
must be 16-byte aligned. Memory aligned blocking will cause extra space costs, while the space utilization ratio can be obtained Ni i through formula U = N Si = Ni +1 . The space utilization ratio will be U = 100% if the nodes in each level are stored continuously. U = 34 = 75% when S1 = 4, N1 = 3, and U = 78 = 87.5% when S1 = 8, N1 = 7. With the increase of Ni , the space utilization ratio will approach 100%. Thus it will be useful to amplify the space utilization with the accretion of d1 .
2.2
Dynamic blocking prefetch
FAST algorithm also uses multi-level blocking to adapt the cache structure of processors. On the other hand, too many levels will produce new restrictions. Because in blocking search tree, it will cause the squander of the storage space and the imbalance of the search tree that the depth of each level must be an integer multiple of the former one. To a 16 MB 32-bit integer data set, for example, the height will be h = 22 if those integers are stored in binary tree while the height is h = 40 when the data set is stored in a blocking tree whose dm = 20. In worst case, each search will produce more 18 iterations than the former one, 90% iterations increase. According to the analysis, the first level blocks with the depth of d1 come to meet the data loading requirement of SIMD instructions, while all the other m−1 level(s) are designed to adapt the processors’ cache features. Hence it is not necessary to put such blocking in storing structure if only we can map this to the cache prefetch mechanism of processors, which means using the software prefetch instead of hardware. First we do not care about the type of processors, and generally discuss about a kind of software prefetching method, called dynamic blocking prefetch. Assume the depth of the first-level blocks is d1 , DBlock presents the depth of prefetch blocks which is a virtual block size and has no effect on the storage structure of the blocking search tree. Generally we can presume that DBlock = dm , where DBlock is integer multiple of d1 . DLead indicates that the block with the depth of DBlock will be prefetched before DLead layers. Due to the DLead of DBlock layers in the block are already loaded in previous block, the algorithm only needs to transfer remaining DBuffer layers data. DBlock = DLead +DBuffer , and both DLead and DBuffer are integer multiple of d1 . In order to cascade this dynamic blocking structure, the data must satisfy DLead DBuffer (see Fig.3). When the algorithm starts, DBlock layers data are prefetched into cache. The next DBuffer of DBlock layers are prefetched when the algorithm gets to the transfer requests will be (DBuffer − DLead ) layer. DBuffer d1 operated in each prefetch. The maximum data length of transfer request is L = ((2DBlock – 1) – (2DBlock −d 1 – S1 S1 1)) N = (2d1 − 1)2DBlock −d1 N . 1 1
Fig.3
Dynamic blocking prefetch
SSE instruction set supports the software prefetch method via an instruction of PREFETCH. However, the prefetched data length is restricted in a cache line of 64-byte[11], which makes only limited dynamic cache prefetch functionality available. For example, as for 32bit integer data type, when d1 = 2, N1 = 3, S1 = 4, the maximum capability of prefetch is DLead = 1 d1 = 2, then DBuffer = d1 = 2. The maximum data length S1 of transfer request is L = (2di – 1) 2DBlock −d1 N = (22 1 4−2 4 – 1) 2 3 = 16, which means only 16 elements of 32bit integer, 64 bytes in total. If set d1 = 3, prefetching next group of nodes (layer 3 to layer 6) requires the data length of 256 bytes, which is not supported by current SSE instruction set.
3 Search branch unwinding In previous studies, the search traverses to the next layer after one SIMD comparison operation, which means that the number of nodes in first level block N1 does not exceed the data width of SIMD instruction K, i.e., the maximum width of search branches is K + 1. For example, k-Ary algorithm computes 4 nodes simultaneously, generating 5 branches, which in FAST algorithm 3 nodes are computed at the same time and generating 4 branches. This method is the most intuitive one with the lowest time complexity. However, in each iteration, the comparison operations only take up a part of total overhead, while the other parts include the address calculations and loop overhead. Moreover, the overhead of SIMD operations are lower comparing to other two. Performing the SIMD comparison operations 2 times in each iteration and then aggregating the results, equates extending the width of search branches. In each iteration one comparison is operated, then d1 = 2, N1 = 3, and the branch width is 4. While two comparisons are operated, d1 = 3, N1 = 7, and the width is 8, equivalent to 8-Ary tree search. Original part of SIMD operations has 1 loading instruction, 2 comparision instructions and 2 mask instructions, and it is added with 1 loading instruction, 2 comparison instructions, 2 mask instructions and 4 logic instructions after broadening
442
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
8N the branch width. As a result, 1 – log log4 N = 33.3% of the number of loops are cut down, while the part of the overhead of address calculation in each iteration remains the same. The experimental results show that the performance can be enhanced by 31%, keeping the same as the decrease of loops. The optimization of extending the width of search branch is essentially a method of loop unrolling. Com-
Fig.4
paring to general loop unrolling, it is a way of non-linear loop unrolling, in which the ratio of saved number of iteration is less than the one of the increased computation inside loops. Therefore, this optimization method is called search branch unwinding. Figure 4 shows the code snippet for SIMD tree search optimized with branch unwinding, implemented with SSE.
Code snippet for SIMD tree search optimized with branch unwinding
4 Implementation of SIMD instructions Modern processors usually integrate multiple cores, each with a 128-bit (or above) SIMD processing unit. Current mainstream x86 architecture processors, including Intel Core i7, Intel Xeon, AMD Phenom, etc., all support the SIMD instructions set extension of streaming SIMD extensions (SSE). SSE supports 128-bit SIMD operations and 8 SIMD registers (16 in 64-bit environment). Four operations on 32-bit data can be executed simultaneously in one 128-bit SIMD instruction. SSE
was first introduced in Pentium III series processors and was subsequently expanded to SSE2, SSE3, SSSE3, and SSE4. The Cell BE architecture designs a single-chip multiprocessor consisting of one or more power processor elements (PPEs) and multiple high-performance synergistic processor elements (SPEs)[12] . The SPE is a pure SIMD processor, i.e., it supports SIMD instructions only, which provides 128 SIMD registers to support 128-bit SIMD operation each. The memory model of an SPE is also different: Instead of caches, each SPE pro-
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444
vides a local store of 256 kB memory, which is populated by the programmer using asynchronous DMA request. Table 1 shows the SIMD instruction list required on platforms of x86 SSE and Cell BE SSE to implement blocking optimized SIMD tree search algorithm for the data type of 32-bit integer and their correspondence. Table 1
SIMD instructions required on two platforms SSE series
Memory operations
Cell BE SPE
mm set1 epi32()
spu splats()
mm load si128()
(loading controlled by compiler)
Comparison operations
mm cmpeq epi32()
spu cmpeq()
mm cmpgt epi32()
spu cmpgt()
mm movemask ps() spu gather() Node number calculation
mm popcnt u32()∗
spu cntb()
Note:* Available on SSE4.2 or later.
On SSE platform, the operation of popcount() may have different implementations based on the version of SSE. On the platform which supports SSE4.2, POPCNT instruction is used to achieve the best performance[13] and lookup table can be used instead of this particular instruction if SSE4.2 is not supported. SSE4.2 now can be supported by Intel Core i7, AMD Barcelona microarchitecture or later processors.
443
and Intel QuichPath connection bus. Figure 5 is the experimental results of comparing the 5 algorithms on Core i7 platform. Binary search is implemented by the bsearch() function in standard C library. Non-SIMD tree search performs better than binary search when the size of data set is less than 64 MB and when the data set is larger than 64 MB, the overhead of non-SIMD tree search exceeds the one of binary searches due to the low cache hit rate. Compared with binary search, un-optimized SIMD tree search can achieve 2.89 times speedup when the size of data set is 16 kB, and due to lower cache hit rate, the speedup reduces to 1.36 times for a 512 MB data set with the increase of data set. Because of the limitation of hardware support, dynamic blocking prefetch and search branch unwinding cannot be used at the same time on Core i7 platform. The optimization of dynamic blocking prefetch with the parameters d1 = 2 and DBlock = 4 achieve 1.37 times faster than un-optimized SIMD tree search when the size is 512 MB. The speedup increases with the growth of data set. With the parameters d1 = 3, N1 = 7, the optimization of search branch unwinding achieves 1.64 times performance improvement over un-optimized algorithm and 2.23 times performance improvement over binary search.
5 Experiments 5.1
Experimental setup
We compared various implementations of search algorithms on the experimental platform, including nonSIMD tree search generic SIMD tree search SIMD tree search optimized by dynamic blocking prefetch, and SIMD tree search optimized by search branch unwinding. The experimental platform is based on Intel Core i7 processor. All algorithms were implemented with C/C++ using SIMD intrinsics. The benchmark programs were complied with GCC and run on a 64-bit Linux. The data type we used for experiments is 32-bit integer, and the size of data set are tested from 28 (approximately 1kB) to 228 (approximately 1GB). The performance is measured by the number of the processor cycles required to find a given search key with repeating of 100 000 times.
5.2
Intel Core i7
The following set of experiments was performed on Intel Core i7 920 processor. Intel Core i7 920 is a 2.66 GHz processor based on Nehalem microarchitecture, featuring 4 cores, 8 logical threads, 8 MB L3 cache
Fig.5
Results of comparing various search algorithms on Core i7 processor (lower is better)
6 Conclusions This paper proposes the blocking optimized SIMD tree search algorithm based on previous studies to exploit the efficiency of modern processors on memory access and computation. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than un-optimized algorithm, 2.2 times faster
444
than generic binary search on Intel Core i7 920 processor.
References
J Shanghai Univ (Engl Ed), 2011, 15(5): 437–444 [7] Ross K A. Efficient hash probes on modern processors [C]// IEEE the 23rd International Conference on Data Engineering, Istanbul, Turkey. 2007: 1297–1301.
[1] Kim C, Chhugani J, Satish N, Sedlar E, Nguyen A D, Kaldeway T, Lee V W, Brandt S A, Dubey P. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs [C]// The 2010 International Conference of SIMOD, Indianapolis, USA. 2010: 339–350.
[8] Kim C, Sedlar E, Chhugani J, Kaldeway T, Nguyen A, Diblas A, Lee V, Satish N, Dubey P. Sort vs. hash revisited: Fast join implementation on multi-core CPUs [J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1378–1389.
[2] Yang Y, Wang Y. Dictionary mechanism for Chinese word segmentation: Initial Bopomofo of secondcharacter Hash mechanism [J]. Computer Engineering and Design, 2010, 31(6): 1369–1375.
[9] Zhou J, Ross K A. Implementing database operations using simd instructions [C]// Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, New York, USA. 2002: 145–156.
[3] Knuth D E. The art of computer programming, volume III: Sorting and searching [M]. Baston: AddisonWesley, 1973.
[10] Kaldewey T, Hagen J, Blas A D, Sedlar E. Parallel search on video cards [C]// The First USENIX Workshop on Hot Topics in Parallelism, Berkeley, CA. 2009.
[4] Schlegel B, Gemulla R, Lehner W. k-Ary search on modern processors [C]// Proceedings of the 5th International Workshop on Data Management on New Hardware, Providence, Rhode Island. 2009: 52–60. [5] Lin Hai-bo, Xie Hai-bo, Shao Ling, WANG Yuanhong. Cell BE processor programming guide [M]. Beijing: Publishing House of Electronics Industry, 2008 (in Chinese). [6] Gedik B, Bordawekar R R, Yu P S. Cellsort: High performance sorting on the Cell processor [C]// Proceedings of the 33rd International Conference on Very Large Date Bases, Vienna, Austria. 2009: 52–60.
[11] Gerber R, Bik A, Smith K, Tian X The software optimization cookbook: High-performance recipes for IA-32 platforms [M]. 2ed. Hillsboro: Intel Press, 2006. [12] IBM, Sony, Toshiba. SDK for multicore acceleration, programming tutorial [EB/OL]. Version 3.1. (200810-24) [2011-04-30]. http://public.dhe.ibm.com/ software/dw/cell/CBE Programming Tutorial v3.1.pdf. [13] Intel Corporation. Intel SSE4 programming reference [EB/OL]. (2007-07-12) [2011-04-30]. http://software.intel. com/file/18187/. (Editor HONG Ou)