Parallel and Distributed Vision Algorithms Using Dual Decomposition Petter Strandmark, Fredrik Kahl, Thomas Schoenemann Centre for Mathematical Sciences Lund University, Sweden
Abstract We investigate dual decomposition approaches for optimization problems arising in low-level vision. Dual decomposition can be used to parallelize existing algorithms, reduce memory requirements and to obtain approximate solutions of hard problems. An extensive set of experiments are performed for a variety of application problems including graph cut segmentation, curvature regularization and more generally the optimization of MRFs. We demonstrate that the technique can be useful for desktop computers, graphical processing units and supercomputer clusters. To facilitate further research, an implementation of the decomposition methods is made publicly available. Keywords: graph cuts, dual decomposition, parallel, MRF, MPI, GPU 1. Introduction Many problems in low-level computer vision can be formulated as labeling problems using Markov Random Fields (MRFs). Among the examples are image segmentation, image restoration, dense stereo estimation and shape estimation [1–5]. When the number of labels is equal to 2, the problem can sometimes be formulated as a maximum flow or a minimum cut problem [6]. For problems with three or more labels α-expansion [7] and other approximation methods [8–10] are available. Email addresses:
[email protected] (Petter Strandmark),
[email protected] (Fredrik Kahl),
[email protected] (Thomas Schoenemann) URL: http://maths.lth.se/matematiklth/personal/petter (Petter Strandmark)
Preprint submitted to CVIU April 14, 2010, revised May 20, 2011 and accepted June 3, 2011.
In this paper we will study how the technique of dual decomposition [11] applies to these and related optimization methods. We will see that it is a valuable tool to parallelize existing algorithms to make them run faster on modern multi-core processors. We will also see that prohibitively large memory requirements can be made tractable on both desktop computers and supercomputer clusters. Examples of optimization problems with huge memory requirements are segmentation problems in three or more dimensions and curvature regularization problems. Finally, we will see that dual decomposition also provides an interesting way of implementing algorithms on parallel hardware, such as graphical processing units (GPUs). 1.1. Outline of the paper In the next section, a discussion of related work is given, followed by a short introduction to dual decomposition. The subsequent three sections apply the decomposition technique to three different problem domains. An overview of these three main sections is given below. Decomposition of graph cuts. The first contribution of this paper is a parallel implementation of the graph cuts method by Boykov and Kolmogorov (BK) introduced in [2]. The approach has a number of advantages compared to other parallelization methods: 1. The BK-method has been shown to have superior performance compared to competing methods for a number of applications [2], most notably sparse 2D graphs and moderately sized 3D graphs. 2. It is possible to reuse the search trees in the BK-method [12] which makes the dual decomposition approach attractive. Further, we show that the dual function can be optimized using integer arithmetic. This makes the dual optimization problem easier to solve. 3. Perhaps most importantly, we demonstrate good empirical performance with significant speed-ups compared to single-threaded computations, both on multi-core platforms and multi-computer networks. Naturally, there are also some disadvantages: 1. There is no theoretical guarantee that the parallelization will be faster for every problem instance. Already for the BK-method no polynomial time guarantee is known, and we cannot give one for the number of iterations, either. In practice this matters little. 2
2. Our current implementation is only effective for graphs for which the BK-method is effective. The underlying dual decomposition principle can however be applied in combination with any graph cut algorithm. The material in this section is based on our earlier work [13]. Decomposition of general MRFs. In Section 5 we broaden the perspective from submodular two-label problems to general optimization problems with unary and binary terms. That is, we now also address multi-label problems and allow arbitrary binary potentials. The computed solutions are now suboptimal and in fact the addressed optimization problems are typically NP-hard. The proposed dual decomposition scheme is based on subproblems that can be solved via dynamic programming. This allows massive parallelism, in particular GPU-implementations. In contrast to previous approaches to solve graph labeling problems on a GPU, cf. Section 2.1, our method works for non-trivial amounts of regularization. Decomposition of curvature linear programs. In the final part of the paper we discuss the problem of optimizing curvature regularity in region-based image segmentation. The recent work [14] has shown that this problem can be formulated as an integer linear program, and that its linear programming relaxation is very tight and sometimes even finds the global optimum. In this paper we show how the run-times and memory requirements of this method can be significantly reduced by applying a dual decomposition scheme. 2. Related work 2.1. Graph cuts Our work builds on the following two trends: the ubiquity of maximum flow computations in computer vision and the tendency of modern microprocessor manufacturers to increase the number of cores in mass-market processors. This implies that an efficient way of parallelizing maximum flow algorithms would be of great use to the community. Due to a result from Goldschlager et al. [15], there is little hope in finding a general algorithm for parallel maximum flow with guaranteed performance gains. However, the graphs encountered in computer vision problems are often sparse with much fewer edges than the maximum n2 − n in a graph with n vertices. The susceptibility to parallelization depends on the structure and costs of the graph. There are essentially three types of approaches used in computer vision for solving the maximum flow/minimum cut problem: 3
Augmenting paths. The most popular method due to its computational efficiency for 2D problems and moderately sized 3D problems with low connectivity (i.e., sparse graphs) is the BK-method using augmenting paths [2]. However, as augmenting path algorithms use nonlocal operations, they have not been considered as a viable candidate for parallelization. One way of making multiple threads cooperate is to divide the graph into disjoint parts. This is the approach taken by Liu et al. [16], in which the graph is split, solved and then split differently in an iterative fashion until no augmenting paths can be found. The key observation is that the search trees of the subgraphs can be merged relatively fast. The more recent work [17] splits the graph into many pieces which, in turn, multiple threads solve and merge until only one remains and all augmenting paths have been found. In this paper the graph is also split into multiple pieces, but our approach differs in that we do not require a shared-memory model, which makes distributed computation possible. Push-relabel. The push-relabel algorithm [18] is an algorithm suitable for parallelization. The implementation by Delong and Boykov [19] has been tested for up to 8 processors with good results. There have been attempts to implement this method on a GPU, the latest being CUDA cuts [20, 21], but our tests of the (freely available) implementation only gave the correct result for graphs with low regularization. Another attempt is [22], which performs all experiments on generated images with a very low amount of regularization. Solving such graphs essentially reduces to trivial thresholding of the data term. The earliest reference we found was [23] which does not report any speed-up compared to sequential push-relabel. Convex optimization. Another approach to parallel graph cuts is to formulate the problem as a linear program. Under the assumption that all edges are bidirectional, the problem can then be reformulated as an `1 minimization problem. The work in [24] attempts to solve this problem with Newton iterations using the conjugate gradient method with a suitable preconditioner. Matrix-vector multiplications can be highly parallelized, but this approach has not proven to be significantly faster than the single-threaded algorithm in [2] for any type of graph, even though [24] used a GPU in their implementation. Convex optimization based on a GPU has also been used to solve continuous versions of graph cuts, e.g. [25]. However, the primary advantage of continuous cuts has been to reduce metrication errors due to discretization. 4
Graph cuts is also a popular method for multi-label problems using, e.g., iterated α-expansion moves. Such local optimization methods can naturally be parallelized by performing two different moves in parallel and then trying to fuse the solutions, as done in [26]. 2.2. GPU optimization Graphical processing units (GPUs) have been used extensively to facilitate the often very computationally expensive tasks in low-level vision. Segmentation with total variation models has successively been implemented and can be performed in real time [27, 28]. Computing the optical flow between two images [29] and stereo estimation have also benefited greatly from the parallelization offered by a multi-processor GPU. Solving discrete labeling problems such as minimum cuts and its generalizations on a GPU has proven to be harder. Often the speed-up is not great compared to algorithms that run on a CPU or the method might work well for only a restricted amount of regularization. This work is a step in the direction of fast discrete energy minimization algorithms on a GPU. 3. Dual decomposition We achieve parallelism by dual decomposition [11], a general technique in optimization to split a large problem into two or more smaller, better manageable ones. The technique of decomposing problems with dual variables was introduced by Everett [30] and it has been used in many different contexts, e.g. in control [31]. The application within computer vision that bears the most resemblance to this paper is the work of Komodakis et al. [8] where dual decomposition is used for computing approximate solutions to general MRF problems. Consider the following optimization problem: inf E(x),
x∈X
(P)
where X and E are arbitrary. This problem might be hard to solve. Sometimes the function E can be split up into two parts: E(x) = E1 (x) + E2 (x), where the functions E1 and E2 are much easier to minimize. We will see later that this is the case with many energy functions associated with MRFs. Now we consider the equivalent optimization problem inf E1 (x) + E2 (y)
x,y∈X
(1)
subject to x = y. 5
The dual function [11, 32] of this optimization problem is d(λ) = inf E1 (x) + E2 (y) + λT (x − y) x,y∈X = inf E1 (x) + λT x x∈X + inf E2 (y) − λT y .
(2)
y∈X
From the last two rows we see that evaluating the dual function is equivalent to solving two independent minimization problems. If minimizing E1 and E2 is much easier than minimizing E, the dual function can be evaluated quite efficiently. Since the value of the dual function d(λ) is a lower bound on the solution to (P) for every λ, it is of great interest to compute the optimal such bound, i.e. solve the dual problem: sup d(λ).
(D)
λ
The dual function is concave, since it is the infimum over a set of concave (affine) functions of λ. Furthermore, it is also easy to find a supergradient to d, as stated by the following lemma [11]: ∗ Lemma 1. Given a λ0 , let x be the optimal solution to d(λ0 ) = minx f (x)+ λT g(x) . Then g(x∗ ) is a supergradient to d at λ0 . 0 Proof. For any λ, we have d(λ) ≤ f (x∗ ) + λT g(x∗ ) ∗ T ∗ = f (x∗ ) + λT 0 g(x ) + (λ − λ0 ) g(x ) = min f (x) + λT g(x) + (λ − λ0 )T g(x∗ ) 0
(3)
x
= d(λ0 ) + (λ − λ0 )T g(x∗ ), which is the definition of a supergradient. Maximizing the dual function d can then be done with subgradient methods as described in [11]. Given a λ, one takes a step in the direction of a supergradient ∇d(λ) = x − y.
6
Start with λ = 0 repeat Update x and y by evaluating g(λ) λ ← λ + τ (x − y). for some τ until x = y Here, in each iteration k, a step size τ needs to be chosen. The simplest way of doing this and still ensuring convergence is to pick τ = T /k, with T a constant [11]. During the maximization of the dual function, we have access to a lower bound d(λ) of the optimal energy, and also to two feasible solutions x and y of the original problem. An upper bound on the optimal energy is then given by min(E(x), E(y)). We can compute the relative duality gap which gives an indication of how close we are to the optimal energy: r=
min(E(x), E(y)) − d(λ) . min(E(x), E(y))
(4)
If X is a convex set and E a convex function, the optimal duality gap is zero. However, in this paper we consider non-convex sets consisting of a finite set of points with integral coordinates. In our first application we are nonetheless guaranteed an optimal gap of zero as it corresponds to solving an integer linear program whose linear programming relaxation represents the convex hull of the feasible integral solutions. 3.1. Decomposition into multiple parts Naturally, dual decomposition can be used to decompose E into more than two parts: E(x) = E1 (x(1) ) + . . . + Em (x(m) ). (5) The decomposed problem analogous to (1) then becomes a minimization of the sum of the m functions, subject to constraints x(i) = x(j) for all i, j. These constraints are each associated with a dual variable. 4. Decomposition of graph cuts Our first application of dual decomposition is a parallel algorithm for computing the minimum cut in a grid graph. We start this section by describing how the graph is split and how the dual variables enter the two subgraphs. This is followed by extensive experiments for graphs in 2, 3 and 4 7
dimensions, both multi-threaded and distributed across many computational nodes in a supercomputer. 4.1. Graph cuts as a linear program Finding the maximum flow, or, by duality, the minimum cut in a graph can be formulated as a linear program. Let G = (V, c) be a graph where V = {s, t} ∪ {1, 2, . . . , n} are the source, sink and vertices, respectively, and c the non-negative edge costs. A cut is a partition S, T of V such that s ∈ S and t ∈ T . The minimum cut problem is finding the partition where the sum of all costs of edges between the two sets is minimal. It can be formulated as X minimize ci,j xi,j x
i,j∈V
subject to xi,j + xi − xj ≥ 0, i, j ∈ V xs = 0, xt = 1, x ≥ 0.
(6)
The variable xi indicates whether vertex i is part of S or T (xi = 0 or 1, respectively) and xi,j indicates whether the edge (i, j) is cut or not. The variables are not constrained to be 0 or 1, but there always exists one such solution, according to the duality between maximum flow and minimum cut. We write DV for the convex set defined by the constraints in (6). 4.2. Splitting the graph Now pick two sets M and N such that M ∪ N = V and {s, t} ⊂ M ∩ N . We assume that when i ∈ M \ N and j ∈ N \ M , ci,j = cj,i = 0. That is, every edge is either within M or N , or within both. See Fig. 1. We now observe that the objective function in (6) can be rewritten as: X X X X ci,j xi,j = ci,j xi,j + ci,j xi,j − ci,j xi,j . (7) i,j∈V
i,j∈M
i,j∈N
i,j∈M ∩N
Define EM (x) =
X i,j∈M
EN (y) =
X
1 X ci,j xi,j 2 i,j∈M ∩N 1 X − ci,j y i,j . 2 i,j∈M ∩N
ci,j xi,j − ci,j y i,j
i,j∈N
8
(8)
/.-, ()*+ s 5 7 g OO1 2 oo O OO' o 2 1 ⇐⇒ o o w Og OO1 1 oo7 O o O'ow o 1 /OHIJK ONML HIJK NML −1 o 0 1 ( v 2 /.-, ()*+ t 1 ONML ONML HIJK /HIJK −2 5 o O O
(a) Convention for this figure. Numbers inside the nodes indicate s/t connections, positive for s, negative for t. 1 ONML 1 ONML 1 ONML ONML HIJK HIJK / −1 o 1 /ONML HIJK /HIJK /HIJK −3 o −1 1 o 2 o ? _ O O O ?? O O ??1 2 1 2 1 ?? 1 ? 1 ONML 1 ONML ONML HIJK / −1 o 1 /ONML HIJK HIJK / −1 o 1 /ONML HIJK HIJK 2 o 1 o 0 ? _ O ?? O O O O ??2 1 1 2 1 ?? 1 ? 1 /ONML 1 ONML 2 ONML ONML HIJK HIJK /HIJK / 0 o 1 /ONML HIJK HIJK 2 o 1 o 3 o 2 O O O O O 1
1
3
1
1
1 /OHIJK 1 /OHIJK 3 OHIJK 2 /OHIJK ONML HIJK NML NML / NML NML 1 o 0 o 1 o 0 o 1 (b) Original graph. 1 ONML ONML HIJK / −1 o 1 ONML HIJK /HIJK 1+λ1 1 o O O O 2
1
1
1 ONML 1 ONML 1 ONML HIJK o / HIJK o / HIJK +λ2 −1 2 2 O O ?_ ?? O ??2 1 1 1 ?? ? 2 1 ONML 1 /ONML 3 ONML HIJK /HIJK HIJK +λ3 2 o 1 o 2 O O O 1
1
3
2 1 1 1NML ONML HIJK /OHIJK NML /OHIJK +λ4 1 o 0 o 2
1 ONML 1 ONML ONML HIJK /HIJK / −1 HIJK −3 o 1−λ1 o O ?_ ?? O O ??1 1 1 1 ?? ? 1 ONML 1 ONML HIJK / −1 o 1 ONML HIJK / 0 HIJK −λ2 o 2 O O O 1
2
1
3
1
1
2 2 ONML 1 ONML 3 ONML HIJK o / HIJK o / HIJK −λ3 0 2 2 O O O 2 3 OHIJK 2 OHIJK 1 ONML HIJK o / NML o / NML −λ4 0 1 2
(c) Subproblems with vertices in M and N , respectively.
Figure 1: The graph decomposition into sets M and N . The pairwise energies in M ∩ N are part of both EM and EN and has to be weighted by 12 . Four dual variables λ1 . . . λ4 are introduced as s/t connections. 9
This leads to the following equivalent linear program: minimize EM (x) + EN (y) x∈DM y∈DN
(9) i ∈ M ∩ N.
subject to xi = y i ,
Here x is the variable belonging to the set M (left in Fig. 1c) and y belongs to N . The two variables x and y are constrained to be equal in the overlap. The Lagrange dual function of this optimization problem is: ! d(λ) = min
x∈DM y∈DN
X
EM (x) + EN (y) +
λi (xi − y i )
i∈M ∩N
! = min
x∈DM
EM (x) +
X
(10)
λi xi
i∈M ∩N
! + min
y∈DN
EN (y) −
X
λi y i
.
i∈M ∩N
We now see that evaluating the dual function d amounts to solving two independent minimum cut problems. The extra unary terms λi xi are shown in Fig. 1c. Let x∗ , y ∗ be the solution to (9) and let λ∗ maximize d. Because strong duality holds, we have d(λ∗ ) = EM (x∗ ) + EN (y ∗ ) [11]. Each subproblem may in general have multiple solutions, so to obtain a unique solution we always set our optimal x∗i and y ∗i equal to 1, wherever possible. Splitting a graph into more than two components can be achieved with the same approach. The energy functions analogous to (8) might then contain terms weighted by 1/4 and 1/8, depending on the geometry of the split. See Fig. 2. 4.3. Implementation Solving the original problem (9) amounts to finding the maximum value of the dual function. It follows from Lemma 1 that xi − y i , for i ∈ M ∩ N , is a supergradient to g. In order to maximize d, the iterative scheme described in Section 3 can be used. This scheme requires the dual function to be evaluated many times. To make this efficient we reuse the search trees as described in [12]. Only a small part of the cost coefficients is changed between iterations 10
(a) 2 × 2
(b) 2 × 2 × 2
Figure 2: Splitting a graph into several components. The blue, green and red parts are weighted by 1/2, 1/4 and 1/8, respectively. and our experiments show that the subsequent max-flow computations can be completed within microseconds, see Table 1. The step size τ needs to be chosen in each iteration. One possible choice is τ = 1/k, where k is the current iteration number. For this particular application, we have found that this scheme and others appearing in the literature [8, 11] are a bit too conservative for our purposes. Instead of using a single step length τ , we associate each vertex in the overlap with its own step length τi . This is because different parts of the graph behave in different ways. In each iteration, we ideally want to choose λi so that xi = y i ; therefore, if xi − y i changed sign, then the step length was too large and we should move in the opposite direction with a reduced step length. foreach i ∈ M ∩ N do if xi − y i 6= 0 then λi ← λi + τi (xi − y i ) if xi − y i 6= previous difference then τi ← τi /2 end end end To handle cases like the one shown in Fig. 8, we also increase the step length if nothing happens between iterations. Empirical tests show that keeping an individual step length improves convergence speed for all graphs
11
_ _ _ _ /.-, ()*+ s