Diplomarbeit Simultaneous Scheduling, Binding and Routing for ...

4 downloads 28163 Views 4MB Size Report
can only be optimized if both scheduling and routing are considered at the ..... In the sequel, we will call the data-flow graph Q or one of its embeddings a “box”. ..... The fourth layer of the embedding then looks like a lozenge whose center is the.
Diplomarbeit Simultaneous Scheduling, Binding and Routing for Processor-Like Reconfigurable Architectures Janina Alexandra Brenner

bei Prof. Dr. S´andor Fekete Institut f¨ ur Mathematische Optimierung Carl-Friedrich-Gauß-Fakult¨at f¨ ur Mathematik und Informatik Technische Universit¨at Braunschweig

Braunschweig, den 25. Oktober 2005

Eidesstattliche Erkl¨arung

Hiermit erkl¨ are ich an Eides statt, dass ich die vorliegende Arbeit selbstst¨ andig und nur unter Verwendung der angegebenen Hilfsmittel verfasst habe.

Braunschweig, den 25. Oktober 2005

Contents 1 Introduction

3

2 Problem Characterization 2.1 Problem Setting . . . . . . . . . . . . . . . . . . . 2.1.1 Architecture . . . . . . . . . . . . . . . . . 2.1.2 Instances . . . . . . . . . . . . . . . . . . . 2.1.3 Mapping . . . . . . . . . . . . . . . . . . . . 2.2 Introductory Graph Theory . . . . . . . . . . . . . 2.3 The Directed Subgraph Homeomorphism Problem

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 5 6 7 7 8 9

3 Computational Complexity 3.1 A Logic Engine for Proving NP-Completeness 3.1.1 The Gadget . . . . . . . . . . . . . . . 3.1.2 The Encoding . . . . . . . . . . . . . . 3.2 Application of the Logic Engine . . . . . . . . 3.2.1 Base . . . . . . . . . . . . . . . . . . . 3.2.2 Poles . . . . . . . . . . . . . . . . . . . 3.2.3 Flags . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

12 12 13 13 14 16 18 20

4 Special Cases of Data-Flow Graphs 4.1 NP-Completeness for Trees . . . . 4.1.1 Flags . . . . . . . . . . . . . 4.1.2 Poles . . . . . . . . . . . . . 4.1.3 Base . . . . . . . . . . . . . 4.2 Polynomial Cases . . . . . . . . . . 4.2.1 Paths and Semi-Paths . . . 4.2.2 Semi-Cycles . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

23 23 23 26 29 31 31 32

. . . . . . . . . . . .

34 34 35 35 36 36 37 38 38 40 40 41 41

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 Exact Solution 5.1 A “Good” ILP Formulation for Scheduling 5.2 ILP Constraints . . . . . . . . . . . . . . . 5.2.1 Variables . . . . . . . . . . . . . . 5.2.2 Assignment Constraints . . . . . . 5.2.3 Timing Constraints . . . . . . . . . 5.2.4 Storing Constraints . . . . . . . . . 5.2.5 Resource Constraints . . . . . . . . 5.2.6 Location Constraints . . . . . . . . 5.2.7 Usage Constraints . . . . . . . . . 5.3 Two Objective Functions . . . . . . . . . 5.3.1 Minimize Time . . . . . . . . . . . 5.3.2 Minimize Storage . . . . . . . . . . 1

. . . . . . .

. . . . . . .

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

2

CONTENTS 5.4

Examples and Running Time . . . . . . . . . 5.4.1 A Notion on Time Bounds . . . . . . . 5.4.2 The Impact of the Objective Function 5.4.3 A Real Problem Instance . . . . . . .

6 Approximation and Heuristics 6.1 A |V |3/2 -Approximation . . . . . . . 6.2 Heuristic Algorithm . . . . . . . . . 6.2.1 Operation and PE Priorities . 6.2.2 Marking System . . . . . . . 6.2.3 Proceeding . . . . . . . . . . 6.2.4 Evaluation . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

41 42 42 44

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

48 48 53 54 56 57 58

7 Conclusion

62

Appendix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 64 66

Bibliography

67

Chapter 1

Introduction Today, the market offers a wide range of processors. They vary from hard-wired multi purpose processors as used in personal computers to micro-processors that are specifically designed to perform a single task. While software programmable multi purpose processors are flexible but often inefficient, Application Specific Integrated Circuits (ASICs) solve one specific task efficiently but lack flexibility. Because they combine the advantages of the two, reconfigurable devices like Field Programmable Gate Arrays (FPGAs) have come to considerable attention. They solve computationally involved applications fast and can be employed for various tasks. But in recent research, an even more flexible type of reconfigurable processor has been developed. While FPGAs can be customized for the execution of distinct applications, the new devices are reconfigured permanently during run-time. This so-called processor-like reconfigurability [9] is accomplished by extremely fast reconfigurations in less than a clock cycle. It combines the advantages of flexibility and time to market of hard-wired processors with the short execution time and low energy consumption of specialized microprocessors. Processor-like reconfigurable devices promise to outperform even FPGAs in compute intensive areas like visual computing and mobile communication. In this thesis, we are concerned with optimizing the performance of coarsegrained processor-like reconfigurable devices. Their efficiency depends on an expedient mapping of applications to the underlying architecture. The architecture consists of a network of processing elements (PEs) that are arranged on an orthogonal grid. PEs can be dynamically configured to execute one of a given set of elementary algebraic or logic operations in each clock cycle. On an abstract level, the application to be executed is modeled as a directed graph, also called “dataflow” graph. The vertices represent operations, and arcs stand for dependencies between these operations. A feasible binding assigns to every operation a PE and an execution time while complying with the data dependencies. Traditionally, the mapping is done in two steps. First, every operation is assigned to a clock cycle by a scheduling algorithm. Second, a feasible mapping is sought such that routings between dependent operations can be realized. However, performance can only be optimized if both scheduling and routing are considered at the same time. This proves to be a mathematical optimization problem. Our aim is to study the problem of Simultaneously Scheduling, Binding and Routing applications to a specific type of processor-like reconfigurable architecture. We base our research on a model proposed by Oppold, Schweizer, Kuhn, and Rosenstiel in the Configurable Reconfigurable Core (CRC) project [11]. The CRC model is a very general model that allows for large-scale optimization while corresponding to a real architecture. The thesis is organized as follows: Chapter 2 gives a detailed description of the 3

4

CHAPTER 1. INTRODUCTION

architecture model, followed by a brief introduction to graph theory that allows us to formulate the problem in a mathematical way, namely as a special case to the Directed Subgraph Homeomorphism problem. In Chapter 3, we prove that the Simultaneous Scheduling, Binding and Routing problem is NP-complete by a reduction from a variant of 3-Satisfiability. Chapter 4 discusses some special cases of data-flow graphs. We show that the problem remains NP-hard even for the class of directed trees. The practical part of the thesis starts with Chapter 5. We present an Integer Linear Program (ILP) formulation that gives time-optimal solutions for all input applications. We also address energy consumption by proposing an alternative objective function that minimizes storage of intermediate results. In Chapter 6, we describe a |V |2 -approximation algorithm for the time-minimization problem. We also propose and benchmark a promising new heuristic that solves practical instances nearly optimally within seconds.

Chapter 2

Problem Characterization

PE's

Figure 2.1: PEs with communication arcs

In this chapter, we define the Simultaneous Scheduling, Binding and Routing (SSBR) problem for processor-like reconfigurable architectures as considered in this work. Since there already exist a variety of architectures incorporating the processorlike reconfigurability, we specify the problem setting through a detailed description in Section 2.1. Section 2.2 offers a brief introduction into graph theory as it is used in this work. We observed that the SSBR problem can be described as a special case to a graph theoretical problem called the Subgraph Homeomorphism Problem. A definition of this problem and the respective modeling is given in Section 2.3.

2.1

Problem Setting

Dynamically reconfigurable architectures include the Dynamically Reconfigurable Processor Architecture (DRP) by NEC, eXtreme Processor Platform (XPP) by PACT, the DAP/DNA architecture by IPFlex, reconfigurable DSP (rDSP) by Morpho Technologies, and the Reconfigurable Communications Architecture (RCA) by Intel [12]. However, this thesis is based on the Configurable Reconfigurable Core (CRC) model that was developed by Oppold, Schweizer, Kuhn, and Rosenstiel at the University of Tuebingen [9]. The CRC model is a very general variant that combines the advantages of allowing for large-scale optimization and being able to be implemented as a real architecture. The setup is described in the following subsections. 5

6

2.1.1

CHAPTER 2. PROBLEM CHARACTERIZATION

Architecture

The CRC model basically consists of a set of processing elements (PEs) that carry a functional unit and a memory cell (see Figure 2.2). PEs can process data that is received from other PEs through interconnect wires. We consider coarse-grain reconfigurable architectures, where processing elements can execute single operations such as ∗, +, -, AND, OR, and ==. The reconfigurability of the architecture implies that all operations can be done by all PEs, i.e. PEs have no specialization. In this model, it is assumed that all operations require the same processing time. We impose clock cycles on the whole process, such that each PE may execute one operation in each of these time steps. Since some of the operations, particularly multiplications, are much more sumptuous than others, this means that PEs may run idle for some time after having processed an easier operation. However, new research is trying to balance running times by power differentiation [9].

functional unit

register set

memory cell

Figure 2.2: A simple representation of a processing element

PEs are arranged in a 2-dimensional orthogonal grid and are connected through a certain interconnect network. The size of the grid is variable or may be given with the input application. In practice, the length and width varies between 2 and 20. Without loss of generality, we will assume that the PEs lie on integer coordinates in the plane, such that the distance of two neighboring PEs is normalized to 1. In this thesis, we consider an interconnect network that provides connections between horizontally and vertically neighboring PEs. In other words, PEs are connected to each other iff their Manhattan distance equals 1. See Figure 2.1 for an illustration. Information can be exchanged between directly connected PEs in the time between two clock cycles. Thereby, data may be transmitted in both directions at the same time, which is why we draw arcs in both directions. However, communication between non-neighboring PEs takes several time steps, because the information has to be buffered on the PEs lying “on the way”. During the same time – between the processing steps – reconfiguration of the PEs is performed. Due to recently developed techniques, this takes only about 2 ns, which is much less than the time required to process an operation [9]. The fast reconfiguration enables one single PE to execute e.g. a multiplication in one step and a logical comparison in the next step, which is not even possible in FPGAs.

7

2.1. PROBLEM SETTING

2.1.2

Instances

An instance to the SSBR problem is an application consisting of a number of operations and certain interdependencies. Applications can generally be taken from any area, but the most impressive improvements compared to hard-wired processors are made for compute intensive applications. Some interesting applications come for instance from the area of visual computing. In order to find out exactly for which types of applications the herein presented reconfigurable architecture is optimal, this work can hopefully make some contribution. In our model, all operations are equal to the effect that they can all be processed by any PE in any clock cycle. Together with the assumption that all operations require the same processing time, there is no more distinction between different types of operations. The important properties of an operation are its dependencies to other operations. They determine whether an operation must be processed before or after other operations and whether it needs to be executed on a PE “close” to other operations because it uses their results. As a consequence, we neglect the specific operation types and simply view an input application as a directed graph (see Definition 2.1). Figure 2.3 shows the directed graph corresponding to the calculation of ((a + b) − c) ∗ a.

a

b +

c -

* Figure 2.3: An input application and the corresponding directed graph The vertices of the directed graph correspond to the operations, or “jobs”, of the application. An edge from a node representing some operation i to that representing operation j means that j “depends” on i, that is: ˆ Operation j may only be processed after the execution of operation i, and ˆ The result of operation i needs to be “routed” to operation j to serve as an input value.

Due to the representation as a directed graph, an instance of the SSBR problem is also referred to as a “data-flow” graph. Note that in this model, we do not consider input or output values that have to be placed at specific PEs. Hence, even operations without incoming or outgoing edges can be treated as usual.

2.1.3

Mapping

A solution to the SSBR problem is a feasible mapping of an application to the architecture. Such a mapping, or “binding”, assigns each operation to a PE and a clock cycle for its execution. In order for a mapping to be feasible, the dependencies between operations have to be taken into account. An edge from operation i to j implies that the two above requirements have to be met.

8

CHAPTER 2. PROBLEM CHARACTERIZATION

The first property is easy to ensure – the clock cycle to which j is scheduled needs to be later than the one i is scheduled to. For fulfilling the second “routing” condition, the underlying interconnection network is significant. Results can only be routed over existing connection wires. If the two dependent operations are assigned to neighboring PEs in consequent clock cycles, the result can be transmitted between the two processing steps. Operation j can then directly reuse the result of operation i as an input value. However, there are cases in which the two operations are either processed by PEs that are not neighbors, or are assigned to time steps that are not directly consecutive. This means that the result of operation i needs to be transmitted over more than one arc and therefore be “buffered” on some PE(s). In this case, the routing process may require several time steps. Due to the technical realization of our model, a PE over which a result is routed is considered to be occupied with buffering during the respective processing steps. Thus, no other operation can be executed or stored on this PE at the same time. The routing aspect is therefore significant in the solving process of an SSBR instance. From the practical point of view, it is of course important not only to find a feasible but rather a “good” mapping. The most important property of a mapping is the number of clock cycles needed to execute the input application. Hence we often want to minimize time while the architecture size is fixed. However, we may also be allowed to vary the architecture dimensions in order to find a shorter mapping. Another aspect of a good binding is the amount of energy consumed by its execution. In the course of this work, we address different objectives. Nevertheless, the main emphasis is given to time minimization.

2.2

Introductory Graph Theory

In the course of this work, we use some mathematical terminology, mainly from the field of graph theory. In order to avoid misunderstandings due to inconsistencies in the literature, the most important definitions and notations are given here. 2.1 Definition: A finite directed graph D = (V, E) consists of a pair of finite sets (V, E) and an incidence function ψ : E → V 2 . Another common name for a directed graph is digraph. An element v ∈ V is called node or vertex, and an element e ∈ E with ψ(e) = (u, v) is called (directed) edge from u to v and is often denoted as e = (u, v). If e = (u, v) ∈ E, then e is incident to u and v, and u is a (direct) predecessor of v and v is a (direct) successor of u. A graph is called simple if it does not contain loops (edges with e = (v, v)) or parallel edges (two edges with e1 = e2 = (u, v)). 2.2 Definition: Every directed graph D = (V, E) has an underlying undirected graph U = (V, E) consisting of the same sets (V, E) and an incidence function ψ ′ that assigns to each e ∈ E an (unordered) set of two elements of V , such that ψ(e) = (u, v) ⇒ ψ ′ (e) = {u, v}.

Edges in an undirected graph do not point from one node to another, but simply connect two vertices. In this work, all graphs are simple finite directed graphs unless the contrary is stated explicitly, despite often being referred to as directed graphs or graphs only. 2.3 Definition: A path from v1 to vl+1 is a sequence v1 , e1 , v2 , e2 , . . . , el , vl+1 of l ≥ 1 edges and l + 1 distinct vertices, such that edge ek connects the vertices vk and vk+1 .

2.3. THE DIRECTED SUBGRAPH HOMEOMORPHISM PROBLEM

9

In the case of a directed graph, each edge ek is demanded to point from vk to vk+1 , otherwise we speak of a semi-path. The length of a path is defined as the number of edges contained. A longest path of a graph is also called a critical path (CP). 2.4 Definition: A cycle is a path for which v1 = vl+1 . The meaning in the directed case is as above. A directed graph is called acyclic if it does not contain directed cycles. A graph that does not contain any cycles or semi-cycles is called a tree. 2.5 Definition: A topological ordering of a directed graph D = (V, E) is a permutation of its nodes such that for every edge (u, v) ∈ E, u is listed before v. 2.6 Definition: A subgraph of a (directed) graph D = (V, E) is a graph S = (V ′ , E ′ ) with V ′ ⊆ V and E ′ ⊆ {e = (u, v) ∈ E | u, v ∈ V ′ }. A subgraph is said to be induced by the set of vertices V ′ if E ′ = {e = (u, v) ∈ E | u, v ∈ V ′ }.

2.3

The Directed Subgraph Homeomorphism Problem

From the graph theoretical point of view, the Simultaneous Scheduling, Binding and Routing problem is a special case to the so-called Directed Subgraph Homeomorphism problem, which is defined below. 2.7 Definition: Directed Subgraph Homeomorphism problem (DSH) Given a (large) digraph H, determine whether or not a given directed “pattern graph” G is homeomorphic to a subgraph of H. That is, determine whether there exists a function φ : G → H that maps every node of G to a node of H and each directed arc (a, b) ∈ G to a directed path from φ(a) to φ(b) in H, such that paths may only intersect at their starting or endpoints. In our case, the application given in form of a data-flow graph constitutes the pattern graph G. The larger “host” graph H in which we try to “find” the pattern graph is not explicitly given in a SSBR instance. It is a special grid graph whose size is determined by the size of the architecture to which the application is to be bound, and the number of time steps that we allow. Except for its size, the host graph is the same for every instance of the SSBR problem. That is why we speak of a special case to the Subgraph Homeomorphism Problem. As opposed to the case in which the smaller “pattern graph” is fixed, special cases in which the host graph is fixed have rarely been studied so far. 2.8 Proposition: The problem of Simultaneously Scheduling, Binding and Routing to processor-like architectures as defined in Section 2.1 is a special case to the Directed Subgraph Homeomorphism problem if the architecture size and the time frame are fixed. Proof: As described in Section 2.1, a solution to the Simultaneous Scheduling and Binding Problem consists of two attributes for each node (or job) in G: The coordinate of the PE on which it is to be executed, and the correspondent schedule step. Thus, it is equal to say that each job has to be mapped to a node belonging to a 3-dimensional grid with base N × M (the size of the PE architecture) and height S (the number of time steps allowed). This grid forms the vertex set of the host graph H in the corresponding DSH problem (see Figure 2.4). Hence, the vertex image φ(V (G)) represents the PE and time assignment for all jobs.

10

CHAPTER 2. PROBLEM CHARACTERIZATION PE's

S M N

Figure 2.4: The node structure of H – a PE lattice on several time layers

In order to establish the routings between interdependent operations (i.e. the transmission of results), we need to mark the paths that connect the correspondent processing nodes. We said that data can be transmitted from one PE to any of its neighbors in the time between two processing steps or can be saved in the memory cell of the same PE. We model these transmissions by adding directed edges to H from each node to the node directly on top of it and to the direct neighbors of the latter. Figure 2.5 shows the resulting outedges of a grid node in H. Note that all edges point from one time layer to the directly following time layer.

layer 2

layer 1

Figure 2.5: The outedges of a grid node Thus, we have defined the host graph H. In order for a solution of the respective SHP instance to solve the SSBR problem, it remains to show that the routing of results is translated correctly. Consider an edge (i, j) in the data-flow graph. The homeomorphism image of (i, j) in H is a path from the image of i to that of j which does not intersect any other image path, except possibly at the starting or endpoints if demanded to. In terms of the binding problem, this coincides with the fact that every node on the image path is “occupied” with buffering the result of i. Since edges in H only point from a node representing a certain PE to those representing PEs reachable from that PE in the next time level, the image path constitutes a feasible routing. 2 Figure 2.6 shows an example of the described regular grid graph H for N = 6, M = 3 and S = 4. Due to the representation as a subgraph homeomorphism problem we will also speak of a solution to the SSBR problem as an “embedding”

2.3. THE DIRECTED SUBGRAPH HOMEOMORPHISM PROBLEM

11

of the data-flow graph (or job graph) into the grid H.

layer 4

layer 3

layer 2

layer 1

Figure 2.6: The host graph H of size 6 × 3 × 4 For the sake of completeness it remains to mention the related concept of topological minors. Although problems studied in that context are rather decision problems (i.e. whether or not a graph is a topological minor of another graph), the relation between the two graphs which is studied is the same. G is a topological minor of H ⇔ there exists a subgraph homeomorphism from G to H.

Chapter 3

Computational Complexity In this chapter, we show that the Simultaneous Scheduling, Binding and Routing Problem as defined in Chapter 2 belongs to the class of NP-complete problems. As we know from Section 2.3, we can interpret the problem as a special case to the Directed Subgraph Homeomorphism problem if we fix the size of the architecture and the time frame we allow to exploit. Although the SHP constitutes the associated decision problem, we need to be able to solve this in order to cope with the optimization problem: If we want to minimize the number of time steps that allow for an embedding by a polynomial time algorithm, we must be capable of deciding in polynomial time whether or not we can succeed in a given number of time steps. Thus, if this is an NP-complete problem already, then the time minimization becomes NP-complete as well. The Directed Subgraph Homeomorphism is shown to be NP-complete in general by Fortune, Hopcroft and Willie [6] even if the “pattern graph” is fixed and therefore not taken as an input size. While this result does not imply that the problem is still NP-hard for the special regular host graph that we have in this case, we present a proof here. In fact, in Chapter 4, we even show that the problem remains NPcomplete if we restrict the data-flow graphs to directed trees.

3.1

A Logic Engine for Proving NP-Completeness

In order to prove hardness, we employ a technique involving a so-called “logic engine”, which we present in this section. The logic engine is a general tool for establishing NP-hardness of graph representations. It was first used by Bhatt and Cosmadakis [2] to show that it is an NP-complete problem to determine whether a tree of maximum degree 4 can be embedded in a planar orthogonal grid, where tree vertices must be positioned at grid vertices and tree edges must occupy unit length grid edges. A short but understandable description including an extension called the “wobbly logic engine” that allows for an even wider use of the logic engine is given in [5]. The functionality of the logic engine is to offer a reduction from the Not-AllEqual-3-SAT (NAE-3SAT) problem, whose instances consist of c 3-element clauses on v literals. An instance is solvable iff there is a truth assignment such that every clause contains at least one true and one false literal. To determine whether such a truth assignment exists is known to be NP-complete [7]. For every instance of the NAE-3SAT problem, a gadget graph can be built such that the graph can be represented in a feasible way if and only if the NAE-3SAT instance is solvable. A solution to either problem can then be transformed to a solution of the other in polynomial time. Consequently, if it were not hard to find 12

3.1. A LOGIC ENGINE FOR PROVING NP-COMPLETENESS

13

an embedding of the gadget, the associated NAE-3SAT problem would also be easy to solve – a contradiction.

3.1.1

The Gadget

In its most general sense, the logic engine consists of a bar to which v pairs of rigid “flag poles” are attached. The flag poles are arranged in two lines, so that one pole from every pair belongs to the first line, and the second pole belongs to the second line. This construction has to be completely rigid except for the possibility to exchange two flag poles building a pair (see Figure 3.1). Attached to the flag poles are small “flags” which can point in a direction parallel to the base. The positions of the flags will encode the NAE-3SAT clauses as described in Section 3.1.2. The poles are fixed to the base in a very small distance from each other, enforcing that two flags attached to neighboring poles in the same height overlap if they point against each other. The task is then to arrange the flag poles and flags in such a way that there is no overlapping of flags. flags Xi Xi

base

Figure 3.1: The basic logic engine

3.1.2

The Encoding

The encoding works as follows: Each pair of flag poles represents one variable of the NAE-3SAT instance, where one pole (e.g. in the front line) stands for the true literal Xi , and the partner pole (in the back line) stands for its complement X i . Each clause Ck (for k = 1, . . . , c) is encoded by flags in a row of the same height, according to the following rules (for all i = 1, . . . , v): ˆ If variable Xi is not contained in clause Ck (neither Xi nor X i ), a flag is positioned on both poles corresponding to the variable at height k. ˆ If Xi is contained in the clause, then a flag is attached to the pole corresponding to its complement X i , and a space is left at height k on the pole belonging to Xi . ˆ If X i is contained in the clause, a flag is attached to the pole corresponding to Xi , leaving a space at height k on the pole belonging to X i . ˆ Finally, flags are added to the four poles at the ends of both lines, one on every height level, in order to ensure that the previous “encoding” flags must always stay between two poles.

14

CHAPTER 3. COMPUTATIONAL COMPLEXITY

Note that since there are only 3-element clauses, exactly three flags are left out on every height level, distributed among the front and the back line of poles. Once the logic engine is built, poles can be flipped back and forth (exchanging them with their “partner pole”) and flags may be flipped to the left and right. The goal is to find a configuration where no two flags overlap by facing each other on the same height from neighboring flag poles – a solution to the geometric problem. If we interpret a variable to be set TRUE if the pole standing for its true literal is placed in the front line, and FALSE otherwise, then it is easy to see that finding a feasible configuration for the logic engine means also finding a valid truth assignment for the NAE-3SAT instance: Between the v flag poles in each line, there are exactly v − 1 spaces for sticking in flags on each height level. As a consequence, on each line and level, at least one pole must have a flag left out. Otherwise, we would have to distribute v flags on v − 1 spaces, a contradiction. For the truth assignment, this means that at least one literal (X or X, whichever is contained in the clause) per level (= clause) must be set TRUE in order to have enough space in the front line, and at least one literal must be set FALSE in order to have enough room in the back line. Altogether, the following proposition can be expressed: 3.1 Proposition: Each feasible layout of the logic engine defines a solution to the NAE-3SAT instance, and a YES assignment of variables of the NAE-3SAT instance can easily be used to construct a feasible layout of the logic engine. Accordingly, either both problems can be solved (within approximately the same amount of time), or they are both infeasible. Thus, NP-completeness is established for the geometric problem if it lies in NP. Proof: We have already shown that the solvability of both problems constitutes an equivalence relation. It is easy to see that from the solution of either problem instance, we can construct a solution to the other in polynomial time: If we have a solution to the geometric problem, we set variables TRUE iff corresponding poles are positioned in the front row to get a feasible truth assignment. In the other direction, assume that we have a feasible variable assignment for the NAE-3SAT instance. If we order the flag poles accordingly, we know that there is enough space in each row to fit in all flags. For finding a feasible flag arrangement, begin from the leftmost pole in each row and height level, and direct flags to the left when possible. Obviously, this requires only polynomial time. As for the last statement, [13] proves that the NAE-3SAT problem is NPcomplete. We have given a polynomial time reduction, while we demand that the geometric problem lies in NP, thus we have NP-completeness. 2

3.2

Application of the Logic Engine

In this section, we adopt the concept of the logic engine for proving NP-completeness of the Simultaneous Scheduling, Binding and Routing problem. In this case, the geometric problem considered is the Directed Subgraph Homeomorphism Problem with embedding into the regular graph H. Therefore, we will need to construct a data-flow graph G that has, when embedded, the characteristics of the logic engine defined above: ˆ A rigid long base, ˆ Pairs of flag poles that are also rigid except for the possibility to exchange back and forward poles (pairwise), and

15

3.2. APPLICATION OF THE LOGIC ENGINE

ˆ Flags whose only valid arrangement is to either the left or the right side of a pole.

In order to avoid too much freedom for the embedding, we will construct G in such a way that every node lies on a longest path. We will then attempt to embed the graph in a grid of the same depth, so that it is clear beforehand to which time layer each node will be bound. Let us assume for now that the other two dimensions of H are large enough for the embedding of any data-flow graph we discuss here. Despite these simplifications, it is still NP-complete to decide whether an embedding of G is possible! First of all, we need some kind of basic module that always takes the same shape when embedded into the graph H, like the one shown in Figure 3.2. A’

B’

C’

D’

A

B

C

D

Figure 3.2: A rigid basic module – the data-flow graph Q

3.2 Definition: The data-flow graph Q consists of two layers of each four nodes labeled e.g. A, B, C, D and A′ , B ′ , C ′ , D′ , respectively, and the directed edges EQ = {(A, A′ ), (A, B ′ ), (A, D′ ), (B, A′ ), (B, B ′ ), (B, C ′ ),

(C, B ′ ), (C, C ′ ), (C, D′ ), (D, A′ ), (D, C ′ ), (D, D′ )}.

In the sequel, we will call the data-flow graph Q or one of its embeddings a “box”. Define the nodes labeled {A, B, C, D} as belonging to layer 1, and those labeled {A′ , B ′ , C ′ , D′ } as belonging to layer 2. Furthermore, say that a node Y is a (direct) successor of another node X if there is a directed edge (X, Y ) between them. X is then said to be a (direct) predecessor of Y . 3.3 Lemma: If the “box” graph Q has to be implemented in two time steps, Q can only be embedded in one single way, apart from permutation of nodes. If the embedding of the nodes in layer 1 is fixed, the assignment for all nodes is therewith determined. D’ A’

C’ B’

D A

C B

Figure 3.3: An embedding of the “box” graph Q

Proof: According to the definition of Q, every pair {X1 , X2 } of nodes in layer 1 has two common successors in layer 2. Since Q is to be embedded in two time steps,

16

CHAPTER 3. COMPUTATIONAL COMPLEXITY

there is no possibility of routing over a distance greater than 1. Thus, there must be two PEs in Manhattan distance ≤ 1 to both X1 and X2 . From this it follows that (i) X1 and X2 have to lie at a distance ≤ 2 from each other and (ii) they cannot lie at the endpoints of an axis-parallel line of length two (else there would only be one common PE of distance ≤ 1 to each). Thus, the nodes of layer 1 can pairwise only lie in two relative positions to one another: (i) In horizontally or vertically neighboring PE’s or (ii) in diagonally adjacent PEs at Manhattan distance 2. Establishing this for all pairs of nodes in layer 1, four PEs with the relative coordinates (n, m), (n, m + 1), (n + 1, m), (n + 1, m + 1) (the corners of a square with side length 1) must be used: Since there are no four PEs that all lie at a distance of 1 from each other, at least two nodes have to be embedded on diagonally adjacent PE’s at a distance of 2. When this is done, the only two PE’s that lie at a distance ≤ 2 from both are the remaining two corners of the square indicated by the fist two nodes. Then, the position of each node in layer two is defined exactly by its three predecessors, on the same PE square in the next time level. Every set S of three nodes in level 1 builds an “L” and has exactly one common reachable PE in the angle of the L. This PE is the only possible position for the node that has the set S as its predecessors. 2 3.4 Remark: The host graph H into which we embed is highly regular – if we assume it to be large enough in all dimensions, it does not change locally by reflecting at a vertical axis-parallel plane or by rotating around a vertical axis by 90°. Therefore, it is always possible to embed a graph in several ways simply by reflecting, rotating or shifting the embedding. We will regard the additional embeddings hereby obtained as trivial extra embeddings, and therefore call an embedding unique if it is fix without regard to these. This means it is fix when at least two nodes on a same height level are fixed to two nodes in H. W.l.o.g., we will look at our gadget and its components from a specific side, namely the one that we see when the base is spread in front of us from the left to right. This makes it easier to describe relative positions.

3.2.1

Base

Let v be the number of variables in the NAE-3SAT instance. We build the base of our logic engine by nesting a row of 3v − 2 “boxes” one next to another, so that the right side of the first box is concurrently the left side of box number two, and so on (see Figure 3.4). 3.5 Lemma: Let b ∈ N. If the “base graph” B = Bb = (VB , EB ) with VB = {1, 2, . . . , 2b + 2} ∪ {1′ , 2′ , . . . , (2b + 2)′ }

and EB = {(1, 1′ ), (2, 2′ ), . . . , (2b + 2, (2b + 2)′ )} ∪ {(1, 3′ ), (2, 4′ ), (3, 5′ ), . . . , (2b, (2b + 2)′ )}

∪ {(3, 1′ ), (4, 2′ ), (5, 3′ ), . . . , (2b + 2, (2b)′ )} ∪ {(1, 2′ ), (3, 4′ ), (5, 6′ ), . . . , (2b + 1, (2b + 2)′ )} ∪ {(2, 1′ ), (4, 3′ ), (6, 5′ ), . . . , (2b + 2, (2b + 1)′ )}

has to be implemented in two time steps, B can only be embedded in one single way, looking like a row of boxes. The embedding is unique even with regard to node permutation in the case of b ≥ 2, that is, when at least two boxes are concatenated.

17

3.2. APPLICATION OF THE LOGIC ENGINE

2’ 1’

3’

2nd box

(6v-8)’



6 5

3 1st box

8’ 7’

5’ 4

2 1

6’

4’

(6v-9)’

6v-8

… 7

(6v-7)’

6v-9

9

3rd box

(6v-4)’

(6v-5)’ 6v-6

6v-7

(3v-3)rd box

(6v-6)’

6v-4

6v-5

(3v-2)nd box

Figure 3.4: The base of the logic engine

Proof: The proof is by induction over the number b of boxes. First, let b = 1. Then the statement follows from Lemma 3.3. See Figure 3.3 to check that an embedding with nodes 3 and 4 on two neighboring PEs and nodes 3′ and 4′ on the same PEs one time step later is possible. This is necessary for the second box to be attached, because we know from Lemma 3.3 that the base vertices of the second box have to lie on a square of side length 1 (which may not be the same square as for the first box, of course). Now assume that the lemma is true for any fixed b ≥ 1. Then the last box b has the four end nodes 2b + 1, 2b + 2, (2b + 1)′ and (2b + 2)′ which are bound to some coordinates (x, y, z), (x, y + 1, z), (x, y, z + 1) and (x, y + 1, z + 1), respectively. Going from b to b + 1, the subgraph of B induced by the “end” nodes listed above joint with the new nodes and edges builds a box. Since four of the box nodes are already fixed, particularly in a feasible way such that the box can be completed, there is exactly one embedding for the last four nodes, that is on the coordinates (x + 1, y, z), (x + 1, y + 1, z), (x + 1, y, z + 1), (x + 1, y + 1, z + 1). 2 3.6 Definition: We define the data-flow graph F consisting of four nodes labeled A, B, A′ , B ′ , and the directed edges EF = {(A, A′ ), (A, B ′ ), (B, A′ ), (B, B ′ )}. We call F or one of its embeddings a back-to-front or left-to-right “flip”, depending on its orientation in the embedding.

A’

B’

A

B

Figure 3.5: The “flip” graph F

3.7 Lemma: Let F be a flip graph as defined above, and let the vertices on the lower layer A, B be embedded on two neighboring PEs. Then, if F has to be implemented in two time steps, the nodes A′ and B ′ have to use the same two PEs in the following time step, with arbitrary distribution among the two.

18

CHAPTER 3. COMPUTATIONAL COMPLEXITY

Proof: The nodes of H lying directly in top of A and B are the only ones reachable from both A and B in one step. Since both A′ and B ′ depend on both nodes in the first layer, the allocation is arbitrary between the two. 2 3.8 Remark: We call a flip turned “off ” if nodes labeled with the same literal are placed on the same PE, otherwise it is turned “on”.

3.2.2

Poles

The poles to both sides of the base are constructed in the following way: ˆ First, we attach boxes to the top of the 1st, 4th, 7th, . . . , and (3v-2)nd base boxes as foundations for flag pole pairs. We add edges from the free upper nodes of the base to the top of the boxes to both sides of them (see Figure 3.7) so that each node lies on a longest path. ˆ Second, we extend every foundation box with two back-to-front flips placed one above its left, one above its right side. ˆ On top of this “double flip”, we place another box with a line of three boxes on its top and connection edges as above.

Figure 3.6 illustrates the foundation of a pole pair on an extract of the base. From now on, boxes will be represented by nontransparent cubes for a better visibility.

C’ B’

A’ D A



C B



Figure 3.6: An extract of the base with the foundation of a pole pair

3.9 Lemma: Assume that the data-flow graph O outlined in Figure 3.6 has to be embedded into H in six time steps, and that the embedding of the nodes in the first two layers is given. Then there are exactly two distinct feasible node assignments for O, and one can be obtained from the other by reflecting its upper three layers at the plane parallel to the x-z-axis that passes through the center of the two flips. Proof: Since the bottom layer of the foundation box is fixed to the base, it follows from Lemma 3.3 that there is only one feasible embedding for it. The additional connection edges lie between fixed nodes, so they are firm as well.

3.2. APPLICATION OF THE LOGIC ENGINE

19

Note that the parts of the data-flow graph O above and below the double flip are identical besides their time orientation. Because the grid H does not change by inverting its edge orientations, we can conclude that the upper part of the graph is rigid as well as the lower one, for which we have proven this above. The only flexible parts of the graph are the two back-to-front flips in the middle. Since A′ and B ′ have to lie next to each other as well as A and B because of the rigid structures to which they belong, either both flips have to be turned on or both have to be turned off. The latter will infer an embedding as in Figure 3.6, the first will reflect the upper half as described in the lemma. 2

While the double flips underneath allow the exchange of two coupled poles, the three lined up boxes in each foundation serve to hold the pairs of flag poles. On every first and last box of such a line, we attach a flag pole consisting of a pile of boxes riddled with a number of double flips. For the encoding of the given NAE-3SAT instance, as described in Section 3.1.2, we need one flag level for each clause. Thus, we build each pole by piling up c identical modules which can each hold one encoding flag. The resulting gadget including the base and flag poles for a NAE-3SAT instance with two clauses on four variables can be seen in Figure 3.7.

Figure 3.7: The gadget without flags

It is clear from the construction of the flag poles that their nodes all have to lie on the same PE square, since they only consist of piled up boxes and double flips. Therefore, the condition of rigidness of the flag poles is fulfilled.

20

CHAPTER 3. COMPUTATIONAL COMPLEXITY

3.2.3

Flags

Each flag pole consists of c modules that hold either no flag, one flag, or two flags (one outer flag and one encoding flag). The modules are defined as follows: 3.10 Definition: We define the data-flow graph Pf with f = 0 as the graph consisting of twenty nodes labeled e.g. A, B, C, D, A′ , B ′ , C ′ , D′ , . . . , AIV , B IV , C IV , DIV where

ˆ {A, B, A′ , B ′ } and {C, D, C ′ , D′ } build two left-to-right flips and ˆ {A′ , B ′ , C ′ , D′ , A′′ , B ′′ , C ′′ , D′′ }, {A′′ , B ′′ , C ′′ , D′′ , A′′′ , B ′′′ , C ′′′ , D′′′ }, and {A′′′ , B ′′′ , C ′′′ , D′′′ , AIV , B IV , C IV , DIV } build three boxes. For f = 1 we add four additional nodes {E ′′ , F ′′ , E ′′′ , F ′′′ } and edges so that {A , D′′ , E ′′ , F ′′ , A′′′ , D′′′ , E ′′′ , F ′′′ } build a box, plus the connection edges {(A′ , F ′′ ), (D′ , E ′′ ), (F ′′′ , AIV ), (E ′′′ , DIV )}. For f = 2 we add to that four nodes {G′′ , H ′′ , G′′′ , H ′′′ } and edges so that ′′ {B , G′′ , H ′′ , C ′′ , B ′′′ , G′′′ , H ′′′ , C ′′′ } build a box, plus the connection edges {(B ′ , G′′ ), (C ′ , H ′′ ), (G′′′ , B IV ), (H ′′′ , C IV )}. ′′

The resulting “pole parts” Pf with no flag (f = 0), one flag (f = 1), or two flags (f = 2) can be seen in Figure 3.8: DIV AIV

DIV

CIV AIV

BIV C'''

A'''

B'''

C''' A'''

B'''

C'' A''

B''

F''

A''

A

f=0

C''' A'''

B'''

H''' G''' H''

F''

A''

B''

C'

A

f=1

B' D

C

B

G'' C'

A'

B' D

C

B

BIV

E''' F'''

B''

A'

B' D

CIV

C''

C' A'

AIV

BIV

E''' F'''

DIV

CIV

A

C

B

f=2

Figure 3.8: The pole parts Pf with no flag, one flag, or two flags

3.11 Lemma: Assume that the data-flow graph Pf is to be embedded into H in 5 steps, and the root vertices A, B, C, D are fixed to four PEs building a square (n, m), (n, m + 1), (n + 1, m), (n + 1, m + 1) in a certain time layer, with A and B in the front row, that is having y-coordinate m. Then, the time layer for the embedding of each node is determined, AIV , B IV , IV C , DIV have to use the same PE square and AIV and B IV will also be in the front row. In case of f ≥ 1, there are exactly two feasible embeddings for the E and F nodes, either to the PEs with coordinates (n − 1, m), (n − 1, m + 1) or to (n + 2, m), (n + 2, m + 1). For f = 2, the G and H nodes must use the other two PEs, respectively. Proof: All nodes of Pf lie on a longest path of length 5, and we have only 5 time layers to use, so the scheduling of all nodes is fix. Because in all steps, both nodes

3.2. APPLICATION OF THE LOGIC ENGINE

21

named A and B (with any number of bars) depend on both A and B nodes in the layer before, these two are forced to always share the same two PEs. The same is true for the C and D nodes. Now, however the two flips Pf are set, there is at most one feasible embedding for the remaining part of Pf , because it consists merely of boxes whose layer one nodes are inductively fixed, plus possibly boxes that are merged to their sides. In order for the additional “side” boxes to be embedded, it is necessary that the A and D nodes of the last 4 layers (and therefore the B and C nodes as well) lie next to each other. By turning the flips on and off, this can be established with A and D to either the left or the right side of the square, but not to the front or back, because of the assumption enforcing A’s and B’s to both stay in the front row. Thus, whatever the starting configuration of A, B, C and D is, there are always two (or even four in case of f = 0) possible embeddings which are determined by the setting of the flips. 2 Thus, each flag can be directed individually to point to the left or right, as claimed. Now that we have all the necessary modules, we can compose a logic engine G for every NAE-3SAT instance with c clauses on v variables. We simply merge a “base” graph B3v−2 with v flag pole foundations O and 2v flag poles each consisting of c pole parts Pf . The pole parts are distributed according to the rules described in Section 3.1.2, with the number of flags defining the index f of any Pf allocated on a specific flag pole and height level. An embedding of the resulting gadget for the instance (X1 ∨ X2 ∨ X4 ) ∧ (X 1 ∨ X 3 ∨ X4 ) is shown in Figure 3.9.

Figure 3.9: The complete gadget corresponding to the NAE-3SAT instance (X1 ∨ X2 ∨ X4 ) ∧ (X 1 ∨ X 3 ∨ X4 )

22

CHAPTER 3. COMPUTATIONAL COMPLEXITY

3.12 Theorem: The Directed Subgraph Homeomorphism Problem remains NPcomplete if we restrict the host graph to a regular grid H as defined in Section 2.3. Proof: First, we show that the problem belongs to the class NP. To verify a solution, we only need to check for every edge (A, B) of the pattern graph that its image in H is a path from the image of A to that of B, without intersections between different paths except possibly at the starting and endpoints. This can easily be done by a polynomial-time algorithm. From the preceding lemmas it follows that G has the required structure of a logic engine as described in Section 3.1: The base is completely rigid within itself, as well as the foundations of the flag pole pairs, except for their demanded ability to exchange forward and backward poles – for each pair independently. Lemma 3.11 proves that flags can only be situated between neighboring poles in a flag pole row, except for the “outer” flags, and that it is always possible for a flag to be pointed to either side, if there is no other flag in the way: Since the flag poles only have one row of PEs between them, it is impossible to embed two flags between the same two poles on the same height level. Thus, Proposition 3.1 can be applied and we have NP-completeness. 2 3.13 Remark: We have proven that the Simultaneous Scheduling, Binding and Routing problem is NP-complete even if we do not allow routings between nonneighboring PEs, i.e. if we forbid the storing of results.

Chapter 4

Special Cases of Data-Flow Graphs After having proved NP-completeness for the general Simultaneous Scheduling, Binding and Routing problem, we now have a look at some special cases of dataflow graphs. We show in Section 4.1 that the problem does not become significantly easier if we restrict the data-flow graphs to the class of trees. In fact, it stays NP-complete even then. In Section 4.2, we point out some easy cases for which a polynomial-time algorithm can be found.

4.1

NP-Completeness for Trees

Even though the class of trees is a much smaller and more manageable class of graphs, we prove in this section that the SSBR problem stays NP-complete if we only allow data-flow graphs that are directed trees. One could imagine that the routing problem becomes easier if we do not have any (undirected) cycles. A possible strategy for this case could be to route operations as far away from each other as possible, since paths that have the same source never need to be rejoined later on. However, the following results show that there is still no polynomial algorithm unless P = N P . Our proof uses the same technique as the general proof demonstrated in Chapter 3. Again, we build a logic engine with a rigid base, pairs of “flag poles” and “flags” that may point to the left or right. Still, the construction is remarkably different. This time, the poles do not “stand” in a vertical direction, i.e. in the time axis, but “lie” stretched out in the plane. A time depth is only obtained from the construction depth that is needed to hold the different parts together.

4.1.1

Flags

The first data-flow graph that we define is the basic flag pole part that can hold zero, one, or two flags. 4.1 Definition: Define the data-flow graph L as the graph shown in Figure 4.1. We will also call L or one of its embeddings a “lozenge”. 4.2 Lemma: Assume that the data-flow graph L has to be embedded into the grid H in four steps. Then, the usage of nodes in H is determined for all forth-layer nodes of L by the placement of its root r. 23

24

CHAPTER 4. SPECIAL CASES OF DATA-FLOW GRAPHS

{

25 nodes

r

Figure 4.1: The data-flow graph L

The fourth layer of the embedding then looks like a lozenge whose center is the PE to which r is bound. Proof: L contains 25 nodes in layer 4, which are all connected to its root r by a path of length 3. Thus, these nodes can only be placed on PEs with a distance of ≤ 3 to the PE r is bound to. Since there are exactly 25 PEs that fulfill this condition, all of them have to be used. See Figure 4.2 for a feasible embedding. 2

Manhattan distance