Exploiting CPU bit parallel operations to improve

Exploiting CPU bit parallel operations to improve efficiency in search

Pablo San Segundo, Diego Rodríguez-Losada, Ramón Galán, Fernando Matía, Agustín Jiménez Intelligent Control Group, Universidad Politécnica de Madrid [email protected]

Abstract It is the authors’ belief that the ability of processors to compute bit parallel operations should have a right to exist as an optimization discipline, rather than a state-ofthe-art technique. This paper is a step forward in this direction analysing a number of key issues related to bit model design and implementation of search problems. Building efficient search algorithms optimized at bit level is a difficult task. It requires not only a good implementation but also an adequate model so as to make bitwise operations meaningful, and only by addressing design and implementation globally an improvement in efficiency can be obtained. On the other hand, we have proved in this paper that the improvement can conceivably be exponential on the size of the CPU register for problems in NP. This global approach has been used to implement the fastest complete algorithm, to the best of our knowledge, for the maximum clique problem, an important NP-hard combinatorial problem from the graph domain. Our solver clearly outperforms other existing algorithms for hard instances, in some cases by nearly an order of magnitude.

1. Introduction Optimizing applications to exploit CPU bit level parallelism is, at the moment, more of an art than a serious optimization technique. Working at bit level is nothing new: i.e. STL1 for C++ has a bitset container as data type. However to improve overall efficiency by bit-masking operations is hard, because it is not just an implementation problem but reaches back and conditions analysis and design. In many cases the opposite approach is taken: a problem is solved, computation times do not meet requirements and only

1

STL: Standard Template Libraries

then the idea of bit parallel optimization is suggested (if at all). Another factor which has influenced the lack of interest in this field is an extended a priori opinion throughout the scientific community that the benefits of CPU parallelism at bit level have a counterpart in the overhead needed to manage information relative to singles bits in the compact bit array. We believe, however, that bit parallel optimization is not just an implementation trick but has a full right to exist as an independent discipline. This paper can be considered a continuation of research done in [10], [11] and is a step forward in pursue of this goal. The first part of this paper is concerned with theoretical aspects of bit model design in relation with search problems. To this end, a specific class Π of search problems is defined formally and discussion centers on problem instances in Π , proving for a concrete instance that it is theoretically possible to have exponential benefits in time by employing bit models. The second part of the paper applies these techniques to implement the most efficient complete algorithm at the moment for the maximum clique problem (MCP), BB-MCP. MCP is a classical combinatorial problem from the graph domain known to be NP-hard [5] which can be formulated informally as follows: given an undirected graph G, find the maximal set of vertices in G such that every one of them is adjacent to each other. Graphs where this property holds are said to be complete and commonly called cliques. MCP is the corresponding optimization problem of the k-clique problem which is to find a clique of k vertices in a given graph. The k-clique problem is known to be NP-complete ever since Richard Karp included it in his famous original 21 problems [7]. Clique study has several important practical applications such as computer vision, signal processing, global localization [12], security systems or reasoning in networks. An interesting read on clique applications can be found in [2].

2. Complexity for bitboard models To establish a formal comparison in time complexity between bit and non bit models it is necessary to have a domain framework. At the beginning of this section we define a generic class Π of search problems. We then proceed to analyze the effects on overall complexity of a bit model implementation compared to a classical one, and address the issue of establishing a theoretical threshold for improvement in time. We have chosen our generic problem class Π so that a bit optimized search algorithm for any of its instances is intuitively expected to perform well. This loss of generality, however, does not prevent Π from including an NP-complete problem, which makes our analysis directly applicable to search problems in NP. Our formal model is based on computer vision relational structures, but from a search perspective. Let D be a generic search domain of cardinality n, and let all knowledge concerning a particular state Si in D be represented by a set of binary relations Ri between pairs of elements. Ri is therefore any subset of DxD . Let S be the set of all possible sates in D. Let S i ∈ S be a state of the world at a given moment in time; according to previous definitions, Ri must hold. A transition from any state in S to state Sj can now be defined informally as the changes in the world which assert Ri. Let domain D previously defined be the generic search domain for all instances in Π . We further assume for the sake of simplicity that problems in Π are well defined, that is they can be formulated by a tuple < So , G, " > , where So ∈ S is the initial state and G ⊆ S is a set of goal states (possibly one). Solving a problem instance in Π is finding a/all path/s (set of transitions) to reach any state in G starting from So according to a set of specified rules (usually called operators) which depend on the concrete problem instance. Finally we assume all worlds to be deterministic so only one state is possible at any given moment in time.

2.1. An adequate bit model How to choose an adequate bit model for instances of our previously defined class Π ? The intuition is that CPU bit parallelism should be used at least to improve the computation of the most frequent operations which take place during search, which are undoubtedly those involved in state transitions. A less obvious but equally important computation is checking if the actual state is one of the goal states G. For problems in NP, this last computation is expected to be in P. We formalize these ideas in what we have called the NP-strategy:

NP-STRATEGY Given a problem instance P ∈ Π we define informally an adequate bit model for P as any bit interpretation such that state transitions and checking of candidate solutions can be computed by a small number of bitwise operations. To break ties we recommend that checking of candidate solutions should be achieved by as few bit parallel operations as possible. If we apply the NP-strategy to Π a possible bit interpretation for all instances is a mapping such that iff xRiy holds, ∀x, y ∈ D , there exists a corresponding bit bxy = 1. The smallest adequate bit model for every member in Π requires n2 bits, which is the maximum cardinality of any relation set for a given state. Proof is trivial. It is theoretically possible that further additional bits could reduce the amount of bit parallel operations needed for candidate solution checking in particular problem instances (i.e bit bxyz = 1 to extend binary relations to triads (x,y,z), but this cannot be extended to every instance in Π . The implementation of the previous bit model requires a two dimensional bit data structure. Let W_SIZE by a generic data type which represents the longest unsigned number that the language compiler used can handle. Then a typical C language type declaration for our bit model would be W_SIZE bit_model [|D|][|D|] A generic transition or candidate solution checking procedure implemented using bit_model can only benefit from bit parallelism in one of the two dimensions.

2.2. Candidate solution checking Consider a simple procedure which checks if a particular state Si is a goal state. In a non bit model, any algorithm must check every relation in Ri independently for any problem instance belonging to Π . In a bit model, however, the intuition is that there could be a time reduction of wsize (the size of the CPU registers) when checking for relations between pairs of elements. More formally, let BG be a bit encoding for goal state G and let BS be a bit encoding for the current state. Checking if BG = BS can be done by AND matching wsize times for every node, therefore time complexity is n ⋅ (n / wsize ) which is still in O(n2) but the reduction in wsize can make a huge difference in real life applications.

2.3.Reasoning with relations A basic issue when considering time complexity of a search problem instance belonging to our Π class is reasoning with relations between elements. In a conventional encoding accessing all relations of a specific element is simply reading a one dimensional data structure with an additional minor overhead to manage a loop. This complexity does not depend on the density of relations. Things differ in a bitboard model. Since we are operating at bit level, basic reading operations need to be implemented very carefully as they can have a very negative impact on overall efficiency. In this section we will discuss two such low level operators: 1Finding the next meaningful bit (known as LSB2(), and which can be employed to find the next related element to a given one) and 2-Finding the number of meaningful bits in a bit vector, known as population count or simply popcount (PC), which can be used to count the number of relations in our model. Our implementation of both operators works repeatedly over chunks of wsize bits until a global solution is found. For each chunk we use precompiled knowledge of all possible 16 bit combinations to expect a x16 threshold on time complexity compared to a conventional encoding. In both cases additional overhead is needed for matching control, to ensure a systematic search of the bit sets.. Operator LSB is more problematic than operator PC. It actually only works well, in our implementation, for LSB 25000 Time (µs)

20000 15000

NBB

10000

BB

5000 0 .1

.2

.3

.4

.5

.6

.7

.8

.9

Density

Figure 1. Computing times for finding the first 100 related elements to a given one for a randomly generated population of 5.000 after 1000 runs in a P4 2.7GHz CPU running on a Windows kernel domains with a low density of relations. Repeated use of LSB operator to find the first k-related elements of a given one turns out to be quite inefficient starting for densities of relations higher than 0.1. Figure 1 shows computing times in microseconds to find the first 100 related elements to a given one out of a randomly 2

LSB: Typical bit level operator which finds the least significant 1-bit in a bit vector

generated population of 5.000 for different densities of relations. Legend BB stands for a bit model and legend NBB stands for a conventional model. Final times result from adding up 1000 runs on a Pentium 4 at 2.7 GHz on Windows XP. This setback, although striking, is not the end of the story. The fact is that looking for the first k related nodes to a given one may not be that important during search. For example, a simple node selection strategy might need to find the first least significant bit only. An important idea to keep always in mind is that, for bit parallel optimization to work, semantics of major procedural knowledge should be about relations concerning sets of elements and not single elements whenever possible.

2.4.Breaking the w s i z e barrier The intuition behind bit parallel optimization would lead to think that, however good the implementation of our bit model is, it cannot be expected to improve overall time complexity of a generic problem instance in Π by more than a constant wsize factor compared with the corresponding implementation of a classical model. This raises a fundamental issue: is there a conceivable meaningful procedure for problem instances in Π which could be computed exponentially better under a bit model implementation? In other words, is it possible to break the wsize barrier and get a linear or even an exponential benefit? We will prove that this is indeed so in this subsection by giving an example. Consider a problem instance in Π with k all-ornothing properties Pk defined for all elements of the domain. We consider a bit model for the properties such that: Pk = {bk1 , bk 2 , " , bkn } , bkj → {0,1} where bit bkx is set to one if the k-th property holds for element x in the domain. We then partition our domain in exactly wsize subsets, the first subset made up of the first block of wsize elements, the second subset the second block and so on. Finally, we define knowledge K as the number of those subsets which have the same distribution of all k properties Pk, that is, we want to find k-tuples of subsets where properties Pk hold for every member. To compute all subsets for properties P1 and P2 a classical implementation needs to make P1 P2 wsize comparisons, which is in n 2 wsize since by definition Pk = n, ∀k . For a triad of properties the 2 k −1 computation is in n3 wsize and n k wsize for the full k properties. In case of a bit model, comparisons are changed to bitwise AND operations which can be done wsize at a time, so overall computation for K takes just k −1 reduction. If we consider a scenario n k , for a wsize where K is evaluated for n properties, we have an

n −1 exponential reduction in n of wsize . This is all about the proof. We have employed a similar scheme to compute efficiently knowledge for the maximum clique problem (MCP). The result has been BB-MCP, the most efficient exact general purpose MCP algorithm to the best of our knowledge. Results of our experiments are shown in section 4.

2.5. Bit models for the NP class In the previous subsection we proved that it is conceivable to implement a very efficient search procedure for instances of our generic class Π . We now proceed to show the importance of our analysis by proving the existence of at least an NP-complete problem from the graph domain which belongs to Π . Since, by definition of NP-completeness, all other NP problems can be reduced to this one, our research has been directed to finding meaningful concrete knowledge for that domain, in particular the maximum clique problem, or the corresponding existence problem k-clique. This gives us a practical tool for bitparallel optimization of all problems in the NP class: reduction to a bit optimized maximum clique algorithm, an important step forward in pursue of finding a formal framework for this optimization technique. Given an undirected graph G made up of vertices V and edges E, G = {V, E}, two vertices are said to be adjacent if they are connected by an edge. A clique is a subgraph of G where every two of its vertices are adjacent. The k-clique problem is to find a clique of k vertices and is known to be NP-complete [5][7]. Theorem: The k-clique problem belongs to Π Proof: The problem domain entities are the members of the vertex set V. Let there be a relation between a pair of domain elements in Π if there exists a corresponding edge between them in G. A possible initial state could be any vertex. The goal state can be implicitly defined by any subgraph induced over G which has a clique of size k. The problem thus formulated clearly belongs to our class Π and is equivalent to the k-clique problem.

3. A bit model for the maximum clique problem In the previous section we showed the theoretical importance of building an efficient bit optimized algorithm for the maximum clique problem (MCP). MCP is the corresponding optimization problem of the k-clique problem from the graph domain, known to be NP-complete [5] which belongs to our domain framework Π .

In our previous work [11] a fast, complete and well known MCP algorithm [6], was optimized for bit parallel operations and was compared with a classical implementation of the same algorithm. Computational results were very encouraging, to say the least. Further work on reformulation of knowledge in the domain has led us to a new algorithm BB-MCP which has clearly outperformed previous existing solvers (to the best of our knowledge), in particular the recent MCR [13].

3.1. Preliminaries Our specific problem instance is concerned with a simple undirected graph G = (V , E ) with a finite set of vertices V and a set E of pairs of vertices (x,y) called edges. Two vertices are said to be adjacent if they are connected by an edge. A clique is a subgraph of G where every two of its vertices are adjacent. A subset of vertices W ⊆ V such that every edge in W belongs to V is called a subgraph over G induced by W, and is written G (W ) . The maximum clique problem (MCP) consists in finding the largest possible clique in a given graph. Standard notation for the basic concepts used throughout this section include Γ(v) to denote the subgraph induced over G by all nodes adjacent to v and w(G ) or w(V ) to denote the number of vertices of a maximum clique in a graph. Usually the elements of the vertex set V are ordered. We denote vi the i-th vertex of the set.

3.2.Existing algorithms The main paradigms for efficient MCP solvers are backtracking, as a way to compute systematic search, and branch and bound. Early work includes [3] to find all the cliques of a graph. More recent algorithms for the maximum clique problem which use some form of branch and bound are [8] and [9], as well as recent MCR[13], which clearly outperforms the rest on average. Initialization: U=Vertex set of graph, size=0 function clique(U, size) Step 1:Leaf node (|U|=0) 1. if (max_size

Exploiting CPU bit parallel operations to improve

Exploiting CPU bit parallel operations to improve

Suggest Documents

Exploiting Parallel Treebanks to Improve Phrase

remarks on parallel bit-byte cpu structures of ... - CiteSeerX

Exploiting Simulation Slack to Improve Parallel Simulation Speed

Bit-wise Operations

Exploiting Harmonic Periods to Improve Linearly Approximated ...

Exploiting Correlated Keywords to Improve Approximate Information ...

Exploiting genome variation to improve next ...

Exploiting Parallel Texts to Produce a Multilingual

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional ...

Grid Approach to Embarrassingly Parallel CPU ... - Semantic Scholar

CPU and GPU Parallel Kramers-Klein Calculations

Exploiting Spatial Locality to Improve Peer-to-Peer ... - Semantic Scholar

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible ...

Exploiting Remote Memory Operations to Design ... - Semantic Scholar

Software Controlled Cell Bit-Density to Improve NAND ...

Using Fragile Bit Coincidence to Improve Iris Recognition - CiteSeerX

Software Controlled Cell Bit-Density to Improve NAND ... - IEEE Xplore

Exploiting Chemistry to Improve Performance of Screen ... - MDPI

Exploiting the Capture Effect to Improve WLAN ...

Exploiting Data Fusion to Improve the Coverage of Wireless ... - NTU

exploiting symmetry to improve the performance of branch ... - NMT.edu

Exploiting end-users caching capacities to improve Content-Centric ...

Exploiting Main Memory DBMS Features to Improve Real ... - CiteSeerX

Exploiting Multilingual Wikipedia to Improve Arabic Named Entity ...