Vertex Ordering Algorithms for Automatic Differentiation of Computer Codes Emmanuel M. Tadjouddine University of Aberdeen, King’s College, Department of Computing Science, Aberdeen AB24 3UE, Scotland, UK Email:
[email protected] In the context of Automatic Differentiation of functions represented by computer code via the vertex elimination approach first advocated by Griewank and Reese (Automatic Differentiation of Algorithms, SIAM, 1991, p126-135), we present two approximate algorithms based on the linearised computational graph of the input code. The first is a statement reordering algorithm aiming to tune the AD-generated code so as to maximise its performance for modern superscalar processors. The second aimed at detecting interface contractions introduced by Bischof and Haghighat (Computational Differentiation, Techniques, Applications and Tools, SIAM, 1996, p83-94) in order to enable exploitation of the structure of the input code in the differentiation process. Performance data is also presented. Keywords: Automatic Differentiation, vertex elimination, code reordering, interface contraction, graph partitioning. Received April 2, 2007; revised November 29, 2007
1.
INTRODUCTION
Many scientific applications require the first derivatives (at least) of a function f : x ∈ Rn 7→ y ∈ Rm represented by computer code. Finite-differencing is a popular way of calculating derivatives. It is easy to implement but incurs truncation errors that may affect robustness of algorithms that use derivatives. An alternative is to use the algorithms and tools of Automatic Differentiation (AD) [1, 2], by which derivatives of a function represented by a computer program are computed without the truncation errors associated with finite-differencing. This technology has been applied to many engineering applications, see for example [3, 4, 5]. AD can be viewed as a program transformation in which the original code’s statements that calculate real valued variables are augmented with additional statements to calculate their derivatives. Typically, the original program is broken into a code list, a sequence of elementary operations that define variables vi of the program in terms of previously defined vj ’s and elementary functions φi of the programming language: vi = φi {vj }j≺i ,
i = 1, . . . , p + m.
(1)
Following Griewank [2], the n inputs x = (x1 , . . . , xn ) are renamed as v1−n , . . . , v0 , while vp+1 , . . . , vp+m are renamed to be the output y = (y1 , . . . , ym ). The
notation j ≺ i means vi depends on vj . Assuming φi have first derivatives and differentiating (1) gives the linear relations X dvi = cij dvj , i = 1, . . . , p + m, (2) j≺i
where cij = ∂φi /∂vj , betweendifferential (or gradient) ∂vi ∂vi variables dvi = ∂x . , . . . , ∂x 1 n Computing the derivative ∇f can be carried out using an elimination algorithm, which consists in eliminating the intermediate dvi (i = 1, . . . , p) from (2) to express the output differentials dyi = dvp+i in terms of the input ones dxj = dvj−n : dyi =
n X
Jij dxj ,
i = 1, . . . , m.
(3)
j=1
Then J = (Jij ) is the required m × n Jacobian matrix J = ∇f = ∂y ∂x . Moreover, Automatic Differentiation mainly provides two standard algorithms for calculating derivatives: the forward mode and the reverse mode. The forward mode propagates directional derivatives along the control flow of the program. The cost (in terms of floating point operations) of evaluating J is bounded from above by a small constant factor c as follows, see [2] for more details: Cost(J) ≤ c · n · Cost(f ) (4)
The Computer Journal, Vol. ??, No. ??, ????
2
E. M. Tadjouddine
The reverse mode first computes the function, then calculates the sensitivities of the output variables with respect to the intermediate and input variables in the reverse order to their calculation in the function. The sensitivities of the outputs to the input variables give the desired derivatives. Likewise, the cost of calculating J is bounded from above as follows: Cost(J) ≤ c · m · Cost(f )
(5)
There is a close relationship between the elimination algorithm and the two AD standard algorithms [6]. The forward mode AD corresponds to taking the pivots in forward order and solving the system of equations (2) by forward substitution. Similarly, the reverse mode AD corresponds to choosing the pivots in reverse order and solving the transposed linear system using backward substitution. The main advantage of the elimination approach is that it permits us to exploit the sparsity of the computation. Denoting by n ˆ < n [m ˆ < m] the maximum number of nonzeros in any row [column] of J, the cost of calculating J is bounded above c · n ˆ · Cost(f ) [c · m ˆ · Cost(f )] by using a forward [reverse] pivot ordering [7]. One can mix the forward and reverse mode by looking for narrow interfaces and then partitioning the computational graph into subgraphs with a number of inputs [outputs] smaller than that of the overall code’s. This is called interface contraction [8]. The contributions of this paper are the followings: i) The paper discusses the opportunities for integrating the statement reordering algorithm outlined in [9] into finding elimination sequences that reduce floating point operations (FLOPs) and memory traffic in order to compute the Jacobian of the function. ii) It presents a novel algorithm to find a minimum interface contraction enabling us to mix forward and reverse AD elimination. iii) Runtime results showed Jacobian codes obtained by interface contraction and statement reordering are up to 11% faster compared to that of the nearest rival using the vertex elimination approach and up to 39 times faster compared to that produced by the conventional AD forward mode as implemented, for example in Tapenade [10]. The remainder of this paper is organised as follows. Section 2 discusses the elimination approach in calculating Jacobians. Section 3 presents a heuristic to reorder statements of the AD-generated code so as to improve its performance and discuss its implication to producing cache sensitive elimination orderings. Section 4 presents a scheme enabling the exploitation of the code’s structure to be differentiated by a ‘divide and conquer’ strategy called interface contraction. Section 5 discusses runtime results from three test cases. Finally, Section 6 concludes. 2.
AUTOMATIC COMPUTATION OF JACOBIANS
Equations (2) form a sparse linear system of p + m equations in N = n+p+m unknowns with a N ×N lower
triangular matrix C = (ci,j )1≤i,j≤N , wherein entries cij are defined as ( ∂φi n+1 ≤ i ≤ N and 1 ≤ j ≤ N ∂vj ci,j = 0 1 ≤ i ≤ n and 1 ≤ j ≤ N. ∂xi ∂xi , . . . , Writing dxi = ∂x ∂xn , the linear system (2) 1 can be rewritten in matrix form as, (C − IN )dx =
−In 0(p+m)×n
,
(6)
where Ik denotes the k × k identity matrix. The lower triangular matrix C−IN is called the extended Jacobian. Adopting the standard notation [2, p. 161], the strictly lower triangular N ×N matrix C can be written in block form, (with L strictly lower triangular),
n C= p m
n 0 B R
"
p 0 L T
m # 0 0 . 0
(7)
The Jacobian J of the function f is then the Schur complement of R in C−IN and can be calculated using some form of Gaussian elimination [2] in which the intermediate variables dv1 , . . . , dvp are eliminated oneby-one in some arbitrary order, the pivot sequence. An alternative view of the Jacobian accumulation process is in terms of the computational graph G=(V, E) of the program, where V , the set of vertices, is the integers {1−n, . . . , p + m} and E, the set of edges, consists of those (j, i) for which j ≺ i. G is a DAG (Directed Acyclic Graph) whose edges correspond to the nonzero positions (i, j) of the matrix C. Labelling each edge (j, i) with its cij gives the linearised computational graph, which is an alternative representation of the matrix C. Gaussian elimination is then equivalent to vertex elimination on the linearised computational graph, as follows. For each intermediate vertex k, following a chosen pivot sequence, for each predecessor j ≺ k and each successor i ≻ k, one either adds the value of cik ckj to the label cij of edge (j, i) if this edge exists already, or (the fill-in case) creates a new edge (j, i) labelled with this value. Vertex k and all its incident edges are then deleted. At the end, the graph has become bipartite — its only edges join an input vertex to an output one, and the labels on them are the nonzeros of the Jacobian J. The matrix viewpoint can be found in [2, 6], the graph viewpoint in [2, 11], and the generalisation of vertex elimination into edge or face eliminations in [11, 12]. Assume the DAG of Figure 1a) represents the linearised computational graph of a function f such that y = f (x), x ∈ R3 and y ∈ R2 . We wish to calculate the Jacobian ∇f using the DAG of Figure 1a).
The Computer Journal, Vol. ??, No. ??, ????
3
AD Vertex Ordering Algorithms
FIGURE 1. An example of vertex elimination on a computational graph
t7 I @ c76 @
t7 t8 I @ c76 c86 @ @t 6 6 c65
@t 6 6
t8
c86
c65
t4 6 I c43 @ @ c42 t1 t 2 @t 3 a) Original comp. graph
t5 K A 6 A c53 c51 c A 52 A t 2 At 3 t 1 b) After elimination of 4
t7 t8 I @ c76 c86 @ @t 6 6 MB B B B c63 c61 B c62 B B t 1 t 2 Bt 3 c) After elimination of 5
t7 t8 6 6 KCO c A CA 73 c81 CA C A C A c83 c71 C A c72C A c82 C A C A C A t 1 Ct 2 At 3 d) After elimination of 6
t5 6 c54
c41
2.1.
mark(v) = |pred(v)| · |succ(v)|,
The Jacobian ∇f = can be calculated by eliminating intermediate vertices from the graph until it becomes bipartite, see [2, 7, 11]. For example vertex 4 may be eliminated by creating new edges c51 = c54 ∗ c41 , c52 = c54 ∗ c42 , c53 = c54 ∗ c43 and then deleting vertex 4 and all its adjacent edges. The elimination of vertex 4 took 3 multiplications. We might then eliminate vertex 5 and finally vertex 6; this is a forward elimination as illustrated in Figures 1b)—d). We eventually obtain the Jacobian entries c71 , c72 , c73 , c81 , c82 , c83 . The overall elimination takes 3+3+6=12 multiplications. However, if we choose to eliminate 5, 6, 4 in that order, the Jacobian accumulation will take 1+2+6=9. Another elimination sequence may give a different multiplications count. In fact, the elimination sequence determines the number of multiplications required by the Jacobian accumulation. Elimination sequences
There are as many vertex elimination sequences as there are permutations of the intermediate vertices. We choose a sequence using heuristics from sparse matrix technology aimed at reducing fill-in since minimising fill-in is NP complete [13]. Over the past four decades several heuristics aimed at producing low-fill orderings have been investigated. These algorithms have the
(8)
represents the number of symbolic multiplications required to eliminate the intermediate vertex v. The number of symbolic multiplications N× required by the Jacobian accumulation is then
Jacobian accumulation ∂(y1 ,y2 ) ∂(x1 ,x2 ,x3 )
2.2.
desired effects of reducing the FLOPs count as well. The most widely used are nested dissection [14, 15] and minimum degree: the latter, originating with the Markowitz method [16], is for example studied in [17]. Nested dissection, first proposed in [15], is a recursive algorithm which starts by finding a separator. A separator is a set of vertices that when removed partition the graph into two or more components, each composed of vertices that represent almost a constant factor of the total number of vertices and whose elimination does not create fill-in in any of the other components. Then the vertices of each component are ordered, followed by the vertices in the separator. Unlike nested dissection that examines the entire graph before reordering it, the minimum degree or Markowitzlike algorithms tend to perform local optimisations. At each elimination step, such a method selects a vertex with minimum cost or degree, eliminates it and looks for the next vertex with the smallest cost in the new graph. The Markowitz cost for a vertex v is defined as the product of the number of its predecessors (pred) and the number of its successors (succ). This cost
N× =
p X
mark(vi ).
(9)
i=1
To efficiently accumulate the Jacobian, we can decide to minimise the quantity N× even though this is not sufficient to guarantee runtime performance of the Jacobian code on modern superscalar machines. In [9], it is shown that memory accesses need be taken into account to optimise runtime performance. This can be carried out using statement reordering techniques from software pipelining [18] so as to keep the processor’s pipelines busy, see Section 3 and the reference [9] for detailed analysis. Note that although the optimal Jacobian accumulation problem is closely related to the optimal fill-in problem, we have observed that the two are different problems. To illustrate this observation, consider the graph of Figure 1a); if we add the transitive edges (1,5), (2,5), (3,5), (1,6), (2,6), (3,6), (1,7), (1,8), (2,7), (2,8), ( 3,7), (3,8) (4,7), and (4,8) into it, a forward elimination will give 18 multiplications and 18 additions (a total of 36 FLOPs) with a zero fill-in whereas the elimination sequence 5, 6, 4 will give 18 multiplications and 17 additions (a total of 35 FLOPs) while creating the fill-in c64 . This means a minimum fill-in does not necessarily imply minimum FLOPs. Moreover, the complex interdependence of the set of variables within, for instance, the nonlinear core of Partial Differential Equations (PDE) solvers (e.g.,
The Computer Journal, Vol. ??, No. ??, ????
4
E. M. Tadjouddine
finite volume flux calculations) combined with the speed of modern floating-point processors indicates that throughput is governed more by data movement than the number of FLOPs, see [6, 19]. Current heuristics [15, 16, 17, 20] aimed at reducing fill-in have the desired effects of reducing the number of FLOPs as well. To increase the efficiency of AD by the elimination approach, we need to reduce the memory traffic by making efficient use of the processor’s registers and cache. 3.
CACHE SENSITIVE ELIMINATION ORDERINGS
Cache sensitive elimination orderings are here viewed as orderings that reduce not only the number of FLOPs but the amount of memory accesses as well. The idea is to devise a Markowitz-like algorithm by adding to the Markowitz cost of equation (8), a quantity that will account for the memory traffic. In this section, we propose to choose that quantity to be a ranking function based on the assumption that operations with more successors and which are located in a longer path should be eliminated first. The reason is that such operations are likely to execute with a minimum delay and they affect more future operations in the evaluation process. A detailed study of this code reordering scheme presented in this section can be found in [9]. In previous studies [9, 19], we have basically used fillreducing elimination heuristics followed by statement reordering schemes. A closer look at the assembler from the AD generated codes showed that codes with the lowest FLOPs count were not necessarily the fastest. Moreover, reordering statements sometimes yielded improved runtime. A Statement Reordering Algorithm (SRA) based on depth first traversal of the Data Dependence Graph (DDG) [21] of the computer code has been studied in [6, 19]. The idea was that, for each statement s, the SRA tries to place the statements on which s depends, close to s. It was hoped this would speed up the code by letting the compiler use the registers better since, in our test cases, cache misses were shown not to be a problem [19]. The benefits were inconsistent, probably because this does not account for the instruction level parallelism of modern cache-based machines and the latencies of certain instructions. We are looking for statement orderings that encourage the compiler to exploit instruction level parallelism and use registers better, by a SRA that gives priority to certain statements via a ranking function. The compiler’s instruction scheduling and register allocation work on a dependency graph G′′ whose vertices are machine code instructions. We have no knowledge of the graph G′′ , so we work on the derivative code’s DDG G′ . This relies on the premiss that our ‘preliminary’ optimisation will help the compiler generate optimised machine code. In the next sections, we shall make use of the instruction
scheduling approach used in for instance [22], and the ranking function ideas of [23, 24, 25, 26] to reorder the statements of the AD-generated code. 3.1.
The derivative code and its evaluation
The derivative code includes original statements vi of f ’s code as well as statements to compute the local derivatives cij that are the nonzeros of the extended Jacobian C before elimination. But the bulk of it is elimination statements. As originally defined in [7], these take the basic forms cij :=cik ckj or cij :=cij +cik ckj , and typically a cij position is ‘hit’ (updated) more than once, needing non-trivial renaming of variables to put it into the needed singleassignment form. This means every variable is assigned exactly once. Furthermore, the resulting code is unnecessarily long and can be rewritten in a compact way. Though not strictly necessary, we assume the vertex elimination process has been rewritten in a singleassignment form, e.g. by the LU-based approach of [27]. That is, that each cij occurring in it is given its final value P in a single statement, of the form either cij :=c0ij + k∈K cik ckj ifPupdating an original elementary derivative, or cij := k∈K cik ckj if creating fill-in. Here K is a set of indices related to i, j and to the elimination order used, and c0ij stands for an expression that computes the elementary derivative, ∂φi /∂vj . The result is that the derivative code is automatically in single-assignment form. Its graph G′ =(V ′ , ≺′ ) — where V ′ is the set of statements and ≺′ the data dependence relationship — is a DAG. A schedule π for G′ assigns to each statement (denoted s in this section) a start-time in clock-cycle units, respecting data dependencies. It is viewed as a one-to-one mapping of V ′ to the integers {0, 1, 2, · · · }. Write t(s) for the execution time of s. Then to respect dependencies it is necessary and sufficient that: s1 ≺ s2 ⇒ π(s1 ) + t(s1 ) ≤ π(s2 ).
(10)
for s1 and s2 in V ′ . The completion time of π is T (π) = max′ {π(s) + t(s)}. s∈V
(11)
Our aim is to find π to minimise the quantity T (π) subject to (10). This optimisation problem is NP-hard [22, 28]. 3.2.
The DDG of the derivative code
The classical way of constructing the derivative code’s DDG would be to parse the code, build up an abstract representation and deduce the dependencies between all statements, see for instance [21, 28]. Since derivative code is generated from function-code, its DDG can be constructed more easily by using data that is available during code generation. We omit details here. Consider
The Computer Journal, Vol. ??, No. ??, ????
5
AD Vertex Ordering Algorithms
FIGURE 2. A code fragment and its computational graph
v−1 = x1 v0 = x2 v1 = sin(v0 ) v2 = v1 − v−1 v3 = v−1 v2 √ v4 = v1
c3,−1
FIGURE 4. The derivative code from the DAG of Figure 3
s3 s4 Y c H 32 6 HH c41 Hs 2 c @ I 21
@s1
6
c10
c2,−1 s−1 s0
v−1 = v0 = c21 = c2,−1 = v1 = c10 = v2 = c41 = v3 = c32 = v4 = c20 = c3,−1 = c30 = c40 =
the code fragment and its computational graph of Figure 2. Its computational graph can be augmented as represented in Figure 3 with the extra edges produced by eliminating the intermediates v1 and v2 in that order.
FIGURE 3. Graph augmentation process: eliminated vertices are kept and fill-in is represented by dashed arrows
u3 u4 Y H H 6 I 6 HHc32 c41 HH u 2 c
6 21 I
@ @ c40
c @u1 20
c10 6 c3,−1
c30
c
2,−1
u−1 u0
Denoting by F the set of newly created edges (fill-in) from the DAG G = (V, E) and P the set of paths joining i and j, an edge (j, i) of the above augmented graph is labelled with a partial derivative using the chain rule as follows: cij =
P Q c0ij + P (l,k)∈P clk P Q P (l,k)∈P clk
if (j, i) ∈ E if (j, i) ∈ F
(12)
Figure 4 shows derivative code from the input code shown on the left of Figure 2, with a somewhat arbitrary order of the statements respecting dependencies. Note one statement, the computation of c3,−1 , that combines computation of an elementary derivative with elimination. There are different ways of ordering the statements of the derivative code of Figure 4 including depth-first or breadth-first traversals. In Section 3.3, we propose a statement ordering that combines the depth-first traversal property and the instruction level parallelism of pipelined processors via a ranking function.
3.3.
x1 x2 1 −1 sin(v0 ) cos(v0 ) v1 − v−1 √ 1/(2 v1 ) v−1 v2 v−1 √ v1 c21 c10 v2 + c32 c2,−1 c32 c20 c41 c10
A statement reordering algorithm
Our statement reordering is adapted from greedy list scheduling algorithms as investigated in [25, 29]. We first preprocess the DDG G′ to compute a ranking function that defines the relative priority of each vertex. Then, a modification of the labelling algorithm of [29, 30] is used to iteratively schedule the vertices of the graph G′ . Our ranking function uses the assumption that operations with more successors and which are located in a longer path should be given priority. Such operations are likely to execute with a minimum delay and to affect more operations in the rest of the schedule. We use the functions level(v), depth(v) defined to be the length of the longest path from v to an input (minimal) vertex and to an output (maximal) vertex respectively. The procedure level(v) is defined by 1. 2.
for each input (minimal) vertex v, level(v) := 0; for each other v ∈ V ′ , level(v) := 1 + max{level(w), for all w ≺ v}.
depth(v) is defined in the obvious dual way. For a vertex v ∈ V ′ we define the ranking function by rank(v) = a ∗ depth(v) + b ∗ succ(v).
(13)
where succ(v) is the number of successors of v and a, b weights chosen on the basis of experiment. For b = 0, we recover the SRA using depth-first traversal as in [6, 19]. By combining the values depth(v) and succ(v), we aim to trade off between exploiting instruction level parallelism of modern processors and minimising register pressure. The preprocessing phase of our algorithm is as follows. 1.
Compute the levels and depths of the vertices of G′ .
The Computer Journal, Vol. ??, No. ??, ????
6 2.
E. M. Tadjouddine Compute the ranks of the vertices as in equation (13).
The iterative phase of the algorithm schedules the vertices of G′ in decreasing order of rank. It constructs the mapping π defined in Section 3.1 by combining the rank of a vertex and its readiness using the following rule: Rule 1. A vertex v is ready to be scheduled if it has no predecessor or if all its predecessors have already completed. This ensures data on which the vertex v depends is available when v is scheduled. Ties between vertices are broken using the following rule: Rule 2. • Among vertices of the same rank choose those with the minimum level. • Among those of the same level, pick the first. The core of the scheduling procedure is as follows: 1.
2.
Schedule first an input vertex v with the highest rank (break ties using Rule 2). That is, set time τ := 0 and π(v) := τ . For τ > 0, let v be the last vertex that was scheduled at times < τ . (a) Extract from the set of so far unscheduled vertices, the set A as follows: S(v) := {w : w ≻ v} If S(v) is nonempty, set B(v) := {u : level(u) < max{level(w) for w ∈ S(v)}}, A := S(v) ∪ B(v); Otherwise A := the set of remaining vertices. (b) Extract from A, the set of vertices R that are ready to be scheduled. (c) If R is empty, do nothing. Otherwise, choose from R a vertex w with maximum rank (break ties by Rule 2), and set π(w) := τ . (d) Set τ := τ + 1.
3.
Repeat step 2 until all vertices are scheduled.
We can easily check that this algorithm determines a statement ordering π that satisfies (10), thus preserving the data dependencies between vertices of the graph. In [9], we gave a small example wherein this greedy list scheduling algorithm takes the optimal number of cycles to complete whereas the depth-first traversal algorithm takes two extra cycles to complete on an idealised processor. The complexity of this labelling algorithm, which is similar to that of [29, 30] for a DAG with k vertices and e edges, was initially proved to be O(k 2 ) [29, 30] and can be implemented in O(k + e) as shown in [31].
Note that this reordering scheme can be used for any numerical code for which we can build its data dependency graph. In the specific case of AD vertex elimination algorithms, we argue that we can find elimination sequences based on a Markowitz-style algorithm that attempts to minimise the Markowitz cost of equation (8) and to maximise the ranking function of equation (13) in an incomplete data dependence graph. The incomplete DDG is set to the input code’s computational graph G at the beginning and it is augmented at each elimination step. The cost of eliminating a vertex v can be formulated as follows wherein α, β, γ are nonnegative weights chosen by experiments: Cost(v) = α∗rank(v)−(β ∗depth(v)+γ ∗succ(v)) (14) Note this is an alternative to finding an elimination sequence based solely on minimising the FLOPs count followed by a statement reordering algorithm. The idea is to produce elimination sequences that trade-off between minimising the FLOPs count and the number of memory accesses in a single objective. To further improve the AD-generated code, we can use insights of the input code in the differentiation process. Insights may be known a priori or for some cases they may be detected using some analysis of the input code. An example is interface contraction [8], which illustrates a divide and conquer strategy resulting in improved performance of the generated code. 4.
INTERFACE CONTRACTION
Interface contraction also termed as vertex cut [32] was introduced in [8] and consists in looking for narrow interfaces and then partitioning the computational graph into subgraphs with a number of inputs [outputs] smaller than that of the overall code’s. It is seen as a way of exploiting the program structure in order to generate efficient derivative code. Equations (4) and (5) show that generally, it is better to use the forward mode AD when the number of inputs n is smaller than the number of outputs m and the reverse mode otherwise. In other words, a forward elimination is better for a computational graph or subgraphs of the computational graph with a number of input vertices smaller than that of output vertices. Exploiting interface contraction provides a hierarchical differentiation process and can lead to a substantial runtime reduction of the derivative code [8]. To date, AD tools do not provide an automated way of finding an interface contraction. In [33], a graph drawing system is used to recognise an interface contraction of a computational graph whilst stressing the benefits of identifying and exploiting such bottlenecks in the derivative calculation. The results of [8] have been achieved by manually transforming the code. Here, we devise an approximate algorithm to automatically identify an interface contraction and then partition
The Computer Journal, Vol. ??, No. ??, ????
AD Vertex Ordering Algorithms the computational graph so as to hierarchically preaccumulate local Jacobians and combine them for the overall Jacobian evaluation, see for example [34]. 4.1.
A motivating example
To motivate the benefits of exploiting interface contraction, consider the graph example given in [8] and shown in Figure 5. Eliminating the five intermediate vertices of the graph G of Figure 5 in forward or reverse order gives 2+2+6+6+6 = 22 multiplications for the Jacobian accumulation. If we use the elimination sequence 6, 4, 5, 7, 8, then the Jacobian accumulation takes 24 multiplications. However, if we can identify the separator S = {6}, we can then partition G into two subgraphs G1 = (V1 , E1 ), G2 = (V2 , E2 ) with V1 = {1, 2, 3, 4, 5} and V2 = {7, 8, 9, 10, 11}. This partition allows us to eliminate the intermediates 4, 5 in G1 , the intermediates 7, 8 in G2 and eventually the vertex 6 in the separator. This elimination process computes the Jacobian in 17 multiplications. FIGURE 5. A DAG exhibiting an interface contraction
u9 I @ @
u10 I @ @
@u7 I @ @
@u8
u4 I @ @
@u5 I @ @
u11
@u6 I @ @
u1
@u2
@u3
This partition has worked just as the nested dissection algorithm [15]. However, interface contraction is not constrained by the need of a balanced separator as required by graph partitioning algorithms for parallel computations. The good news is that such partitioning algorithms for sparse linear algebra or parallel computations can be adapted to find out near-optimal partitions for the interface contraction problem in order to efficiently accumulate Jacobians in the AD context. 4.2.
Finding vertex separators
In graph terms, interface contraction seeks to partition the computational graph into subgraphs using separators. Some graphs have small vertex separators e.g., n-vertex trees have an optimal separator composed of the root of the tree. Others have larger vertex separators e.g., an n-vertex complete graph Kn has a minimal separator of size Ω(n). Minimal separators are important in parallel computations [35], sparse LU factorisation and many graph algorithms based on the divide and conquer paradigm [20, 36].
7
Generally, finding minimal separators is NP-hard [35, 20]. However, there are approximation algorithms that find vertex separators whose sizes are closer to those of the optimal separators of the input graph [37]. Such approximation algorithms can be modified so as to produce imbalanced separators by specifying a percentage of imbalance between the partitions to be produced, see for example [38]. We devise an approximate algorithm to produce a ‘best’ separator enabling the interface contraction strategy. We define a best separator to be a separator composed of vertices that are mutually independent and therefore its size cannot be reduced any further. Two vertices v1 and v2 are mutually independent if there is no path connecting v1 and v2 . In the literature, there are at least two types of algorithms that can find vertex separators: • •
find a small edge-separator and then refine it to vertex separator using a vertex cover scheme [37, 38, 36]. directly build a vertex separator in the partitioning process [20]
Vertex separators are more useful for fill-reducing heuristics [39] which the interface contraction is related to. Our approach is to directly compute an initial separator of the computational graph and then refine it using an iterative scheme. The initial separator is similar to that of [39] but uses a kind of breadth-first search algorithm [40] in a levelised directed graph G=(V, E) instead of the minimum degree algorithm used in [39]. It also uses the notion of neighbouring vertices N (v) for a given vertex v. A neighbouring vertex of a vertex v is a direct successor or predecessor of v. Our algorithm is flexible allowing us to produce unbalanced separators for a given fraction ǫ. Denoting by C = V −C and A the set of visited but untagged vertices, our procedure for the initial separator can be described as follows: 1.
2.
3. 4.
Compute the levels of the directed graph using the procedure outlined in Section 3.3, choose an input vertex xi and Set A := ∅; C := ∅. Update A := (A − {xi }) ∪ N (xi ). Choose v ∈ A among the vertices of the lowest level and tag it. Repeat this process with the new vertex v so as to form a subgraph C of size max{|C| : |C| < ǫ|V |}. Set the initial separator S := {u ∈ C : ∃v ∈ C with (u, v) ∈ E} Set the initial partition P := C ∪ C
Applied to the graph of Figure 5, with ǫ = 2/3, we start with empty sets C and A and a vertex at the lowest level, say 1. We tag and add it to C. We compute A = ∅ ∪ N (1) = {4}. We choose 4, tag and add it to C. Again, we compute A = ∅ ∪ N (4) = {2, 6}. We choose the vertex of lowest level 2, tag and add it to C. We update A = {6} ∪ N (2) = {5, 6}. We then
The Computer Journal, Vol. ??, No. ??, ????
8
E. M. Tadjouddine
choose 5 since it has the lowest level and we continue the process. We stop when C = {1, 2, 3, 4, 5, 6, 7} since |C| < 2/3|V | = 22/3. At the end, we have the initial partition C = {1, 2, 3, 4, 5, 6, 7} and C = {8, 9, 10, 11} and the initial separator is S = {6, 7}. The initial separator is then iteratively refined to produce a best separator as well as a best partition by using the following procedure: 1. 2. 3.
Set BestPartition := C ∪ C Set BestSeparator := S If BestSeparator is not minimal then, (a) Among vertices of BestSeparator, select a vertex v of highest level, which is adjacent to one or more vertices of BestSeparator since BestSeparator must eventually contain mutually independent vertices. (b) If v ∈ C Then set set
C := C − {v} C := C ∪ {v}
set set
C := C ∪ {v} C := C − {v}.
Otherwise,
4.
(c) Set BestSeparator := S − {v} (d) Set BestPartition := C ∪ C
Repeat step 3 until BestSeparator is minimal
Applied to the initial separator S = {6, 7} obtained previously, we choose 7 as it is adjacent to 6 and has the highest level. We remove it from the set C and the separator S, which becomes {6} and therefore minimal. At the end we obtain the best partition P = {1, 2, 3, 4, 5, 6} ∪ {7, 8, 9, 10, 11} and the best separator is therefore S = {6}. Our algorithm for computing a best separator allows us to partition medium sized computational graphs composed of some hundreds of vertices. However, the computational graph can be larger and even sparser for real-life applications. Such graphs can be partitioned using multilevel partitioning schemes [20, 35, 41] or recursive bisection [42]. For the interface contraction problem, we propose a simple recursive bisection type algorithm. 4.3.
Recursive bisection scheme
A bisection of a graph G = (V, E) divides its vertices V into two disjoint subsets Vu , Vd of equal size. An approximate bisection relaxes the constraint of producing subsets of the same size. Our initial separator procedure is an approximate bisection. For a given percentage ǫ, our recursive bisection scheme can be described as follows: 1.
Partition G into Vu , Vd .
TABLE 1. Platforms Platform AMD Intel
Platform AMD Intel
2. 3.
a) Processors Processor CPU speed AMD Opteron 2.2 GHz Pentium D 2.99 GHz
L2-Cache 1 MB 1 MB
b) Compilers OS Compiler Linux/Centos G95 version 0.91 Windows XP G95 version 0.91
Options -O -r8 -O -r8
If |Vu | > 2ǫ, partition the subgraph composed of vertices of Vu . If |Vd | > 2ǫ, partition the subgraph composed of vertices of Vd .
In this scheme, the partitioning is carried out using the approximate procedure outlined in Section 4.2. As shown in [42] this approximate recursive bisection algorithm can produce a p-partition of the input graph for a cost that is within a factor O(log p) of the cost of the optimal p-partition. 5.
EXPERIMENTS
We would like to compare the performance of Jacobian code produced by the interface contraction approach with that produced by other vertex elimination techniques, conventional AD tools such as Tapenade [10], and finite-differencing (FD). This is carried out for three test problems on two different platforms (processor + compiler) described in Table 1. The processors we used have a memory hierarchy with the time required for loading data into the limited arithmetic registers increasing from the level-1 cache to the main memory. Optimising compilers usually seek to maximise performance by minimising redundant calculations as well as the memory traffic between registers and cache. Memory traffic may be increased by stalls arising when an instruction calls for a value not yet loaded into registers and causing the processor to wait. The AMD and Intel processors in our study feature out-of-order execution, in which the processor maintains a queue of arithmetic operations so that if the one at the head of the queue stalls, it can switch to an operation in the queue that is able to execute. More details on such issues may be found in [18]. 5.1.
Implementation outline
A Jacobian code generation tool using the LU-based vertex elimination approach of [27] was written in Matlab. The tool takes as input the computational graph of the function f and a specified pivot ordering to produce the derivative code, see [27] for more details. The pivot ordering is obtained by finding a feasible vertex separator as described in Section 4,
The Computer Journal, Vol. ??, No. ??, ????
AD Vertex Ordering Algorithms which is smaller in size than the number of input and output vertices, then ordering intermediate vertices of the subgraph with the inputs [outputs] in reverse [forward] order respectively and finally adding the vertices in the separator. Other pivot orderings are: forward order, reverse order, the order given by the Markowitz algorithm, and Naumann’s [11] VLR variant on Markowitz. Moreover, the tool performs certain optimisations, e.g., constant value propagation, simplifications avoiding expressions of the form cij − cij , cij /cij , and cij ∗ 0, and elimination of copies of the form cij = ckl or cij = −ckl using an alias table. This implementation has benefited from the support for sparse matrices within Matlab. It was intended for modest sized functions with computational graphs containing hundreds of vertices. To enable rapid and extensive testing, we wrote another Matlab tool that allows us to generate a “random function” f by selecting its numbers n, m and p of inputs, outputs and intermediate variables respectively, plus a random number seed for reproducibility purposes. Each code list statement does one arithmetic operation +, −, ∗, or /, selected using the Matlab random number generator with probabilities reflecting typical numerical code. The tool outputs (i) the computational graph for use by the other tool that generates Jacobian accumulation code, (ii) a Fortran file that computes just f , and (iii) a Fortran file that computes the local derivatives, into which Jacobian accumulation code is inserted to make a complete subroutine. 5.2.
Tests
Using our Matlab tool, we generated three random codes which we used as our examples to compare the interface contraction scheme with other vertex elimination methods, the forward vector mode AD [10] as implemented in Tapenade as well as the one-sided finite-differencing. RC1. The 50 × 50 Jacobian of a function randomly generated by our Matlab tool by setting n=50, m=50 and p=150. This has 394 nonzero entries out of a possible 2500. The corresponding linearised computational graph of the input is composed of 250 vertices and 399 edges. The calculated size of vertex separator is 34. RC2. The 150 × 85 Jacobian of a function randomly generated by our Matlab tool by setting n=85, m=150 and p=800. This has 3275 nonzero entries out of a possible 12, 750. The corresponding linearised computational graph of the input function is composed of 1035 vertices and 1898 edges. The size of the vertex separator is 14. RC3. The 50 × 85 Jacobian of a function randomly generated by our Matlab tool by setting
9
n=85, m=50 and p=1500. This has 2560 nonzero entries out of a possible 4250. The corresponding linearised computational graph of the input function is composed of 1635 vertices and 3098 edges. The size of the vertex separator is 27. All these Jacobians are not particularly sparse but their calculations involve sparse computations since their linearised computational graphs are very sparse. For each technique, we give (i) W (∇f )/W (f ), the ratio of the nominal number of floating-point operations within the generated Jacobian code to those in the function code, (ii) the corresponding ratios of CPU times for each platform. The nominal works in floatingpoint operations W (f ) and W (∇f ) were obtained using a PERL script to count the number of times that *, + , - and / operations appear in the source code. Consequently, the operations counts W (∇(f )) and W (f ) are over-estimates of the number of floating-point operations performed since they do not take account of optimizations performed by the compiler. We assume the compiler performs constant value propagation and evaluation [18, p. 32] to avoid unnecessary arithmetic operations. For each test problem, a driver program was written to execute and time different Jacobian evaluation techniques. To check that the Jacobians were calculated correctly, we compared each to one produced by Tapenade’s forward vector mode. Each set of values for the input variables was generated by using the Fortran intrinsic routine random number. To improve consistency of timings, a number Nd of sets of random input data was generated for a given problem. The code for each technique was run Nd times and the total time was divided by Nd to give the CPU time estimate for that code. The whole process was repeated ten times to get runtime averages. For each table, the first two rows are for Tapenade’s forward vector mode and one-sided finite differencing. The remaining rows are for various versions of VE (Vertex Elimination) methods. For VE, the mnemonics F, R, Mark, VLR denote ways of choosing the pivot order: forward order, reverse order, the order given by the Markowitz algorithm, and Naumann’s [11] VLR variant on Markowitz. IC stands for interface contraction. To give an idea of the effect of a statement reordering (SR), we have reordered the Jacobian code obtained by interface contraction as follows. The linearised computational graph of the original code is split into three components C1 ; S; C2 related respectively to the subgraph containing the inputs, the separator, and the subgraph with the outputs. It is then eliminated so as to produce the derivative code in the order C1 ; dC1 ; S; C2 ; dC2 ; dS instead of C1 ; S; C2 ; dC1 ; dC2 ; dS used in the VE-IC method. In here, the notation d means ’derivative of’.
The Computer Journal, Vol. ??, No. ??, ????
10
E. M. Tadjouddine
TABLE 2. problem
FLOPs and CPU time ratios on the RC1
technique TAPENADE FD VE-F VE-R VE-IC VE-IC-SR VE-Mark VE-VLR
TABLE 3. problem
W (∇f )/W (f ) 62.50 51.75 3.11 2.83 2.80 2.80 2.89 3.09
CPU(∇f )/CPU(f ) AMD Intel 95.33 95.51 68.47 72.02 19.86 20.11 20.93 20.15 18.85 18.93 18.12 18.07 19.76 20.08 19.81 20.10
FLOPs and CPU time ratios on the RC2
technique TAPENADE FD VE-F VE-R VE-IC VE-IC-SR VE-Mark VE-VLR
W (∇f )/W (f ) 129.20 86.35 6.55 5.49 4.88 4.88 6.84 6.92
CPU(∇f )/CPU(f ) AMD Intel 465.89 255.15 129.06 138.52 75.17 51.65 78.75 53.92 72.27 47.64 70.42 46.75 79.94 53.11 78.64 52.65
The Tapenade forward vector mode differentiates each line of the input code by propagating directional derivatives. It treats the jth column of the Jacobian J as the directional derivative (ej · ∇) f where ej is the jth unit vector, and computes these in a loop one column at a time. The unit vectors are the columns of the n × n identity matrix In , which is input to the process as a “seed matrix” (for a general n × q seed matrix S, the output is J ∗ S: this can be exploited, e.g., by Jacobian compression methods). The forward vector mode also uses optimization techniques such as common subexpression elimination. However, it may perform useless calculations such as multiplying by or adding to zero. Consequently, it does not account for the sparsity of the calculation. Results on FLOP and runtime ratios are shown on Table 2, Table 3, and Table 4. For completeness, Table 5 exhibits the statistics for the original codes considered so far.
TABLE 4. problem
FLOPs and CPU time ratios on the RC3
technique TAPENADE FD VE-F VE-R VE-IC VE-IC-SR VE-Mark VE-VLR
W (∇f )/W (f ) 171.32 86.22 9.99 4.77 3.99 3.99 8.08 7.52
CPU(∇f )/CPU(f ) AMD Intel 878.11 605.83 77.47 93.45 32.61 19.33 27.73 18.58 23.27 17.19 22.52 16.05 25.45 18.16 25.38 18.33
TABLE 5. Nominal floating-point operations counts W (f ), number of lines of code (l.o.c.) and CPU times for test problem functions Problem RC1 RC2 RC3
W (f ) # FLOPs 200 950 1550
# l.o.c. 200 950 1550
Function CPU time(µs) AMD Intel 0.62 1.29 1.72 2.61 2.90 4.97
Our first observation is that runtime performance depends on the platform. We also observe that the VEbased techniques performed dramatically less flops than Tapenade. They are consistently faster than finitedifferencing which, it is well known, does not provide accurate results. The runtime ratios for Tapenade’s forward vector mode are higher than may be expected. We think this is due to the fact that Tapenade’s vector mode is applied to an input code, which is essentially a code list. Consequently, a lot of intermediate array variables representing derivative objects are created and each line of the input code is differentiated without accounting for sparsity that may arise. This can increase the memory requirement and affect the cache usage by the derivative code and in some cases such as in the RC3 test wherein there are 1500 intermediate variables in the input code, the memory traffic can dominate the Jacobian computation. To back up this argument with some evidence, we rewrote the RC1 code so that each line of code contains at least two FLOPs hence reducing the number of its intermediate variables from 150 to 80; then we differentiated it using Tapenade’s forward vector mode. The size of the object file (as obtained by the Unix command size) was reduced from 28.32KB to 14.65KB. More importantly, the runtime ratio passed from 95.33 in Table 2 to 60.12; this represents an improvement of about 37%. An even deeper investigation on the performance of Tapenade’s forward vector mode by using a performance monitor or inspecting the generated assembler code, is beyond the scope of this paper. Note that in general, assignments of a computer code contain on average three or four FLOPs and Tapenade’s vector mode should show a better performance in comparison to that shown by the code-list programs in this work, see for example [43] wherein Tapenade is used to compute the gradient of an objective function in an optimisation problem. Furthermore, VE methods in contrast to Tapenade’s vector mode, are able to exploit the sparsity of the computation at compilation time. For the test cases we have considered, the best VE method produced a Jacobian code that is up to 39 times faster than that produced by Tapenade’s vector mode. Also, the Jacobian codes obtained by interface contraction outperformed those produced by other VE methods. The method VE-IC-SR consistently improved the
The Computer Journal, Vol. ??, No. ??, ????
AD Vertex Ordering Algorithms performance of the VE-IC method. This is probably caused by an improved cache utilisation that reduces the memory traffic since the two methods produce codes with the same FLOPs count. The VE-IC-SR method produced codes that run up to 11% faster compared to those obtained by the nearest rival using the VE approach. This improvement can be explained by the fact that interface contraction exploits the graph structure in addition to exploiting the sparsity of the computation as the other VE methods. 6.
CONCLUDING REMARKS
In this paper, we have reviewed current graph techniques used for Automatic Differentiation of functions represented as computer programs with a focus on the application of vertex ordering algorithms. We have outlined two approximate algorithms designed to enhance the performance of the AD-generated code by vertex elimination: i) A reordering scheme to help compilers achieve a fair trade-off between instruction level parallelism and register usage. ii) An interface contraction algorithm that partitions the computational graph into components suitable for the forward or reverse AD algorithms, hence exploiting the structure of the input code to be differentiated. Runtime results showed Jacobian codes obtained by interface contraction and statement reordering are up to 11% faster compared to that of the nearest rival using the vertex elimination approach and up to 39 times faster compared to that produced by the conventional AD forward vector mode. We have also discussed the idea of finding elimination sequences that minimise the FLOPs count and the number of memory accesses in a single objective function in lieu of using fill-reducing heuristics and then reordering the AD-generated code’s statements. Future work includes formalising this approach for a generic processor and evaluating it for different platforms. An interesting question is to empirically find out the parameters involved in setting the objective function so as to generate ’near-optimal’ elimination sequences in order to produce even faster derivative codes. ACKNOWLEDGEMENTS This work was partly done when the author was supported by EPSRC under grant GR/R21882 at Cranfield University. The author would like to thank Dr Shaun A. Forth and Dr John D. Pryce for their contributions in the statement reordering scheme presented in this paper. The author is also grateful to the anonymous reviewers for comments on earlier drafts of this paper. REFERENCES [1] Rall, L. B. (1981) Automatic Differentiation: Techniques and Applications, Lecture Notes in Computer
11
Science, 120. Springer-Verlag, Berlin. [2] Griewank, A. (2000) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Frontiers in Appl. Math., 19. SIAM, Philadelphia, PA. [3] Tadjouddine, M., Forth, S. A., and Qin, N. (2005) Elimination AD applied to Jacobian assembly for an implicit compressible CFD solver. Int. J. for Num. Methods in Fluids, 47, 1315 – 1321. [4] Bischof, C. H., B¨ ucker, H. M., Lang, B., and Rasch, A. (2003) Solving large-scale optimization problems with EFCOSS. Advances in Engineering Software, 34, 633– 639. [5] Hasco¨et, L., V´ azquez, M., and Dervieux, A. (2003) Automatic differentiation for optimum design, applied to sonic boom reduction. Proceedings of ICCSA’2003, Montreal, Canada, 18-21 May, Lecture Notes in Computer Science, 2668. Springer, Berlin. [6] Forth, S. A., Tadjouddine, M., Pryce, J. D., and Reid, J. K. (2004) Jacobian code generated by source transformation and vertex elimination can be as efficient as hand-coding. ACM Transactions on Mathematical Software, 30, 266–299. [7] Griewank, A. and Reese, S. (1991) On the calculation of Jacobian matrices by the Markowitz rule. In Griewank, A. and Corliss, G. F. (eds.), Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pp. 126–135. SIAM, Philadelphia, Penn. [8] Bischof, C. H. and Haghighat, M. R. (1996) Hierarchical approaches to automatic differentiation. In Berz, M., Bischof, C., Corliss, G., and Griewank, A. (eds.), Computational Differentiation: Techniques, Applications, and Tools, pp. 83–94. SIAM, Philadelphia, Penn. [9] Tadjouddine, M., Bodman, F., Pryce, J. D., and Forth, S. A. (2005) Improving the performance of the vertex elimination algorithm for derivative calculation. In B¨ ucker, M., Corliss, G., Hovland, P., Naumann, U., and Norris, B. (eds.), Automatic Differentiation: Applications, Theory, and Implementations, Lecture Notes in Computational Science and Engineering, 50, pp. 111–120. Springer, Berlin, Germany. [10] TR 0300 (2004) TAPENADE 2.1 user’s guide. INRIA Sophia Antipolis. 2004, Route des Lucioles, 09902 Sophia Antipolis, France. [11] Naumann, U. (1999) Efficient Calculation of Jacobian Matrices by Optimized Application of the Chain Rule to Computational Graphs. PhD thesis Technical University of Dresden. [12] Naumann, U. (2004) Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph. Mathematical Programming, 99, 399–421. [13] Yannakakis, M. (1981) Computing the minimum fill-in is NP-complete. SIAM J. Alg. Disc. Meth., 2, 77–79. [14] Bornstein, C. F., Maggs, B. M., and Miller, G. L. (1999) Tradeoffs between parallelism and fill in nested dissection. SPAA ’99: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, Saint Malo, France, 27-30 June, pp. 191– 200. ACM Press, New York. [15] George, J. and Liu, J. (1978) An automatic nested dissection algorithm for irregular finite element problems. SIAM Journal of Numerical Analysis, 15, 345–363.
The Computer Journal, Vol. ??, No. ??, ????
12
E. M. Tadjouddine
[16] Markowitz, H. (1957) The elimination form of the inverse and its application. Management Science, 3, 257–269. [17] Amestoy, P., Davis, T., and Duff, I. (1996) An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Applic., 17, 886–905. [18] Goedecker, S. and Hoisie, A. (2001) Performance Optimization of Numerically Intensive Codes. SIAM, Philadelphia. [19] Tadjouddine, M., Forth, S. A., Pryce, J. D., and Reid, J. K. (2002) Performance issues for vertex elimination methods in computing Jacobians using automatic differentiation. Proceedings of the Second International Conference on Computational Science, Amsterdam, The Netherlands, 21-24 April, LNCS, 2330, pp. 1077–1086. Springer, Berlin. [20] Gupta, A. (1997) Fast and effective algorithms for graph partitioning and sparse-matrix ordering. IBM J. Res. Dev., 41, 171–183. [21] Chapman, B. and Zima, H. (1991) Supercompilers for Parallel and Vector Computers. Addison-Wesley Publishing Company, Boston, USA. [22] TR 698 (1995) Combining Register Allocation and Instruction Scheduling: (Technical Summary). Courant Institute, New York University. New York, US. [23] GIT-CC-01-15 (2001) Cache Sensitive Instruction Scheduling. Center for Research in Embedded Systems and Technologies. Georgia Institute of Technology, Atlanta, Georgia. [24] Hennessy, J. and Gross, T. (1983) Postpass code optimization of pipeline constraints. ACM TOPLAS, 5, 422–448. [25] Palem, K. and Simons, B. (1993) Scheduling timecritical instructions on RISC machines. ACM TOPLAS, 15, 632–658. [26] Leung, A., Palem, K. V., and Ungureanu, C. (1997) Run-time versus compile-time instruction scheduling in superscalar (RISC) processors: Performance and tradeoff. Journal of Parallel and Distributed Computing, 45, 13–28. [27] AUCS/TR0702 (2007) Fast AD Jacobians by compact LU factorization. University of Aberdeen, Computing Sciences Departement. Aberdeen AB243UE, UK. [28] Muchnick, S. S. (1997) Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, San Francisco, USA. [29] Bernstein, D. and Gertner, I. (1989) Scheduling expressions on a pipelined processor with a maximum delay of one cycle. ACM TOPLAS, 11, 57–66. [30] Cofman, E. and Graham, R. (1972) Optimal scheduling for two-processor systems. Acta Informatica, 1, 200– 213. [31] Gabow, H. N. and Tarjan, R. E. (1983) A lineartime algorithm for a special case of disjoint set union. ACM Symposium on Theory of Computing (STOC ’83), Boston, MA, May 25-27, pp. 246–251. ACM Press, Baltimore, USA. [32] Iri, M. (1991) History of automatic differentiation and rounding estimation. In Griewank, A. and Corliss, G. F. (eds.), Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pp. 1–16. SIAM, Philadelphia, Penn.
[33] B¨ ucker, H. M. (2005) Looking for narrow interfaces in automatic differentiation using graph drawing. Future Gener. Comput. Syst., 21, 1418–1425. [34] Tadjouddine, M., Forth, S. A., and Pryce, J. D. (2003) Hierarchical automatic differentiation by vertex elimination and source transformation. Proceedings of ICCSA 2003, Montreal, Canada, 18-21 May, Lecture Notes in Computer Science, 2668, pp. 115–124. Springer, Berlin. [35] Hendrickson, B. and Leland, R. (1995) A multilevel algorithm for partitioning graphs. Supercomputing ’95 proceedings of the 1995 ACM/IEEE Supercomputing Conference, San Diego, California, 3-8 December 28. ACM Press, New York. [36] RR-1264-01 (2001) SCOTCH 3.4 User’s guide. LaBRI, Universit´e de Bordeaux. 351 Cours de la Lib´eration, 33405 Talence, France. [37] Karypis, G. and Kumar, V. (1998) Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48, 96–129. [38] UC-405 (1999) CHACO: Algorithms and Software for Partitioning Meshes. Computation, Computers, and Math Center, Sandia National Laboratories. Albuquerque, NM, US. [39] Liu, J. W. H. (1989) A graph partitioning algorithm by node separators. ACM Trans. Math. Softw., 15, 198– 219. [40] Knuth, D. E. (1997) The Art of Computer Programming, Volume 1: Fundamental Algorithms. AddisonWesley, Reading, Massachusetts. [41] Karypis, G. and Kumar, V. (1998) Multilevel algorithms for multi-constraint graph partitioning. Supercomputing ’98 proceedings of the 1998 ACM/ IEEE conference on Supercomputing, San Jose, CA, 7–13 November, pp. 1–13. IEEE Computer Society, Washington, DC. [42] Simon, H. D. and Teng, S.-H. (1997) How good is recursive bisection? SIAM J. on Scientific Computing, 18, 1436–1445. [43] Tadjouddine, M., Forth, S., and Qin, N. (2005) Automatic differentiation of a time-dependent CFD solver for optimisation of a synthetic jet. Proceedings of the Int. Conf. on Numerical Analysis and Applied Maths, Rhodes, Greece, 16-20 September, pp. 514–517. Wiley-VCH, Berlin.
The Computer Journal, Vol. ??, No. ??, ????