Image processing operations like blurring, inverse convolution, and summed-area tables are often computed efficiently as a sequence of. 1D recursive filters.
GPU-Efficient Recursive Filtering and Summed-Area Tables Diego Nehab1
André Maximo1 1
IMPA
2
Digitok
Abstract Image processing operations like blurring, inverse convolution, and summed-area tables are often computed efficiently as a sequence of 1D recursive filters. While much research has explored parallel recursive filtering, prior techniques do not optimize across the entire filter sequence. Typically, a separate filter (or often a causal-anticausal filter pair) is required in each dimension. Computing these filter passes independently results in significant traffic to global memory, creating a bottleneck in GPU systems. We present a new algorithmic framework for parallel evaluation. It partitions the image into 2D blocks, with a small band of additional data buffered along each block perimeter. We show that these perimeter bands are sufficient to accumulate the effects of the successive filters. A remarkable result is that the image data is read only twice and written just once, independent of image size, and thus total memory bandwidth is reduced even compared to the traditional serial algorithm. We demonstrate significant speedups in GPU computation. Links:
1
DL
PDF
W EB
V IDEO
DATA
C ODE
Introduction
Linear filtering (i.e. convolution) is commonly used to blur, sharpen, or downsample images. A direct implementation evaluating a filter of support d on an h×w-image has cost O(hwd). For filters with a wide impulse response, the Fast Fourier Transform reduces the cost to O(hw log hw), regardless of filter support. Often, similar results can be obtained with a recursive filter, in which the computation reuses prior outputs, e.g. yi = xi − 12 yi−1 . Such feedback allows for an infinite impulse response (IIR), i.e. an effectively large filter support, at reduced cost O(hwr), where the number r of recursive feedbacks (a.k.a. the filter order) is small relative to d. Recursive filters are a key computational tool in several applications: • Low-pass filtering. Filters like Gaussian kernels are well approximated by a pair of low-order causal and anticausal recursive filters [e.g. Deriche 1992; van Vliet et al. 1998]. • Inverse convolution. If an image X is the result of convolving an image V with a compactly supported filter F , i.e. X = V ∗ F , the original image can be recovered as V = X ∗ F −1 . Although the inverse filter F −1 generally has infinite support, it can be expressed exactly as a sequence of low-order recursive filters. • Summed-area tables. Such tables store the sum of all pixel values above and to the left of each pixel [Crow 1984]. They have many uses in graphics and vision. On a GPU, summed-area tables are typically computed with prefix sums over all columns then all rows of the image [Hensley et al. 2005]. Crucially, a prefix sum is a special case of a 1D first-order recursive filter.
Rodolfo S. Lima2 3
Hugues Hoppe3
Microsoft Research
These applications all have in common the fact that they invoke a sequence of recursive filters. First, a 2D operation is decomposed into separate 1D filters. Second, except for the case of summedarea tables, one usually desires a centered and well-shaped impulse response function, and this requires the combination of a causal and anticausal filter pair in each dimension. As an example, consider the problem of finding the coefficient image V such that reconstruction with the bicubic B-spline interpolates the pixel values of a given image X. (Such a preprocess is required in techniques for high-quality image filtering [Blu et al. 1999, 2001].) The coefficients and image are related by the convolution X = V ∗ 61 [1 4 1] ∗ 16 [1 4 1]T . The inverse convolution can be decomposed into four successive 1D recursive filters: yi,j zi,j ui,j vi,j
= 6xi,j + αyi−1,j , = −αyi,j + αzi+1,j , = 6zi,j + αui,j−1 , √ = −αui,j + αvi,j+1 , with α = 3−2.
(causal in i) (anticausal in i) (causal in j) (anticausal in j)
Although the parallelization of such recursive filters is challenging due to the many feedback data dependencies, efficient algorithms have been developed as reviewed in section 2. However, an important element not considered in previous work is optimization over the entire sequence of recursive filters. If each filter is applied independently as is common, the entire image must be read from global memory, and then written back each time. Due to the low computational intensity of recursive filters, this memory traffic becomes a clear bottleneck. Because the data access pattern is different in successive filters (down, up, right, left), the different computation passes cannot be directly fused. Contribution In this paper, we present a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters, across both multiple dimensions and causal-anticausal pairs. The resulting algorithm scales efficiently to images of arbitrary size.
Our approach partitions the image into 2D blocks of size b×b. This blocking serves several purposes. Inter-block parallelism provides sufficient independent tasks to hide memory latency. The blocks fit in on-chip memory for fast intra-block computations. And, the square blocks enables efficient propagation of data in both dimensions. We associate a set of narrow rectangular buffers with each block’s perimeter. These buffers’ widths are the same as the filter order r. Intuitively, the goal of these buffers is to measure the filter response of each block in isolation (i.e. given zero boundary conditions), as well as to efficiently propagate the boundary conditions across the blocks given these measured responses. The surprising result is that we are able to accumulate, with a single block-parallel pass over the input image, all the necessary responses to the filter sequence (down, up, right, left), so as to efficiently initialize boundary conditions for a second block-parallel pass that produces the final solution. Thus, the source image is read only twice, and written just once. A careful presentation requires algebraic derivations, and is detailed in sections 4–5. Our framework also benefits the computation of summed-area tables (section 6). Even though in that setting there are no anticausal filters, the reduction in bandwidth due solely to the overlapped processing of the two dimensions still provides performance gains.
2
Related work
A prefix sum, yi = xi + yi−1 , is a simple case of a first-order recursive filter. A scan generalizes the recurrence using an arbitrary binary associative operator. Parallel prefix sums and scans are important building blocks for numerous algorithms [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et al. 2007]. Recent work has explored the efficient GPU implementation of scan operations using optimization strategies such as hierarchical reduction, bandwidth reduction through redundant computation, and serial evaluation within shared memory [Sengupta et al. 2007; Dotsenko et al. 2008; Merrill and Grimshaw 2009]. An optimized implementation comes with the CUDPP library [2011]. Prefix sums and scans
(1)
The forward- and back-substitution passes associated with solving banded linear systems by LU-decomposition are related to recursive filters, particularly for Toeplitz matrices, although the resulting L and U matrices have nonuniform entries at least near the domain boundaries. Forward- and back-substitution passes can be implemented with parallel scan operations [Blelloch 1990]. On GPUs, fast tridiagonal solvers process rows and columns independently with cyclic reduction [Hockney and Jesshope 1981], using either pixel-shader passes over global memory [Kass et al. 2006], or within shared memory using a single kernel [Zhang et al. 2010; Göddeke and Strzodka 2011]. The use of shared memory imposes a limit on the maximum image size. Recent work explores overcoming this limitation [Lamas-Rodrígues et al. 2011]. Our blocking strategy is able to overlap causal and anticausal computations to run efficiently on images of arbitrary size. We hope it can be applied it to general tridiagonal systems in future work. The typical approach for efficiently generating summed-area tables on the GPU is to compute parallel prefix sums over all columns, and then over all rows of the result. These prefix sums can be computed with recursive doubling using multiple passes of a pixel shader [Hensley et al. 2005]. As previously mentioned, recent prefix-sum algorithms reduce bandwidth by using GPU shared memory [Harris et al. 2008; Hensley 2010]. In fact, summed-area tables can be computed with little effort using the high-level parallel primitives of the CUDPP library and publicly available matrix transposition routines. Summed-area tables
3
ak yi−k ,
i ∈ {0, . . . , h−1}
where on the right-hand side of (1) we set as initial conditions (2)
y−k = pr−k ,
k ∈ {1, . . . , r}.
The definition of an anticausal filter R : Rh × Rr → Rh of order r is analogous. Given input vector y ∈ Rh and epilogue vector e ∈ Rr , the output vector z = R(y, e) is defined by
Recursive filtering
Tridiagonal systems
r X k=1
(3) Recursive filters generalize prefix sums by considering a weighted combination of prior outputs as indicated in equation (1). This generalization can in fact be implemented as a scan operation with appropriately redefined basic operators [Blelloch 1990]. For recursive filtering of images on the GPU, Ruijters and Thévenaz [2010] exploit parallelism across rows and columns. To expose parallelism within each row and each column, we build on the work of Sung and Mitra [1986, 1992]. The idea is to partition the input into blocks and decompose the computation over each block into two parts: one based only on the block data and assuming zero initial conditions, and the other based only on initial conditions and assuming zero block data (see section 4). We then overlap computation between successive filters passes. As noted by Sung and Mitra, there is also related work on pipelined circuit evaluation of recursive filters [e.g. Parhi and Messerschmitt 1989], but such approaches are more expensive on general-purpose computers.
yi = xi −
zi = yi −
r X
a0k zi+k ,
i ∈ {0, . . . , h−1},
k=1
where on the right-hand side of (3) we set as initial conditions (4)
zh+k−1 = ek−1 ,
k ∈ {1, . . . , r}.
Causal and anticausal filters frequently appear in pairs. In such cases, we are interested in computing (5) z = R(y, e) = R F (p, x), e . In image processing applications, it is often desirable to apply filters independently to each row and to each column of an image (i.e., in a separable way). To process all image columns, we extend the definition of F to operate independently on all corresponding columns of an r × w prologue matrix P and an h × w input matrix X. The extended filter is described by F : Rr×w × Rh×w → Rh×w . For row processing, we define a new extended filter F T that operates independently on corresponding rows of an h × w input matrix X and an h × r prologue matrix P 0 . This transposed filter is described by F T : Rh×r × Rh×w → Rh×w . Anticausal filters can also be extended to operate independently over multiple columns, R : Rh×w × Rr×w → Rh×w , and over multiple rows, RT : Rh×w × Rh×r → Rh×w . With these definitions, we are ready to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left) to an image. As shown in figure 1, given an input matrix X and the prologues (P , P 0 ) and epilogues (E, E 0 ) that surround its columns and rows, we wish to compute matrix V , where (6)
Y = F (P, X), T
0
U = F (P , Z),
Z = R(Y, E), V = RT (U, E 0 ).
In particular, we want to take advantage of the massive parallelism offered by modern GPUs.
Problem definition
Causal recursive filters of order r are characterized by a set of r feedback coefficients ak in the following manner. Given a prologue vector p ∈ Rr (i.e., the initial conditions) and an input vector x ∈ Rh , of any size h, the filter F : Rr × Rh → Rh produces the output vector y = F (p, x) with the same size as the input x:
Figure 1: The problem is to compute for an image X a sequence of four 1D recursive filters (down, up, right, left), with boundary conditions given by prologue P ,P 0 and epilogue E,E 0 matrices.
Mapping recursive filtering to GPU architectures presents several challenges. One is the dependency chain in the computation, which reduces the number of tasks that can be executed independently. The other problem is the high bandwidth-to-FLOP ratio, particularly in low-order filters. Without a large number of schedulable tasks to hide memory latency, processing cores become idle waiting for memory transactions to complete, and performance suffers. Given these considerations, the main design goal of our work is to increase the amount of parallelism without increasing the total memory bandwidth required to complete the computation.
4
Prior parallelization strategies revisited
In large images, the independence of column and row processing offers a significant level of parallelism. With that in mind, here is the approach followed by Ruijters and Thévenaz [2010]: Algorithm RT RT.1 In parallel for each column in X, apply F sequentially accord-
ing to (1). Store Y .
......
...... ...
...
...
...
...... ......
...... ......
Although bandwidth to global memory is higher than on a CPU, access must be coalesced for maximum performance. Ideally, all cores should access aligned, consecutive memory addresses. Furthermore, hiding memory latency requires a large number of independent tasks to be available for scheduling. Each SM contains a small amount of fast memory (48KB in our case) shared among its cores. This memory is typically used as a managed cache to reduce the number of expensive accesses to global memory.
......
......
Modern GPUs are composed of p streaming multiprocessors (SMs), each containing c computational cores. For example, our test hardware (an NVIDIA GTX 480) contains p = 15 SMs, each with c = 32 cores. Because the cores in each SM execute instructions in lockstep, divergence in execution paths should be minimized for maximum performance. SMs execute independently from each other, but typically run the same program, called a kernel, operating over distinct parts of the input. Design goals on GPUs
(a)
(b)
Figure 2: (a) 2D block notation showing a block and its boundary data from adjacent blocks. (b) 1D block notation, showing vectors as rows for layout reasons.
4.1
Block notation
We partition the image into blocks of b × b elements, where b matches the number of threads executed in lockstep (the warp size) within an SM, i.e., b = c = 32 throughout our work. As shown in figure 2(a), we define operators to select a block and its boundary data in adjacent blocks. Bm,n (X) identifies in matrix X the b×b data block with index (m, n). Pm−1,n (X) extracts the r×b submatrix in the same shape and position as the column-prologue to block Bm,n (X). Similarly, Em+1,n (X) denotes the r×b submatrix in the position of the column-epilogue to block Bm,n (X). For row T T processing, transposed operators Pm,n−1 (X) and Em,n+1 (X) are 1 defined accordingly. It will often be useful to select prologue- and epilogue-shaped submatrices from a b × b block. To that end, we define a tail operation T (for prologues) and a head operation H (for epilogues). Transposed counterparts T T and H T are defined analogously: Pm,n (X) = T Bm,n (X) , Em,n (X) = H Bm,n (X) , (7) T T Pm,n (X) = T T Bm,n (X) , Em,n (X) = H T Bm,n (X) .
RT.2 In parallel for each column in Y , apply R sequentially accord-
ing to (3). Store Z. RT.3 In parallel for each row in Z, apply F T sequentially according to (1). Store U . RT.4 In parallel for each row in U , apply RT sequentially according to (3). Store V . Algorithm RT uses bandwidth proportional to 8hw, and completes in O( hw cp 4r) steps. The number of tasks that can be performed in parallel is limited by h or w. In a GPU with 15 SMs and 32 cores per SM, we need at least 480 rows and columns to keep all cores busy. Much larger images are required for effective latency hiding. A common approach to improving performance is to restructure the computation so memory accesses are coalesced (during row as well as column processing). This is one of the reasons we employ a blocking strategy (section 4.1). We introduce inter-block parallelism by adapting the method of Sung and Mitra [1986] to modern GPU architectures (section 4.3). The inter-block parallelism also lets us fuse computations, thereby reducing traffic to global memory (section 4.4). Roadmap for improvements
The fundamental contributions of our work begin in section 5.1, where we develop an overlapping strategy for merging causal and anticausal filtering passes. The idea culminates in section 5.2, where we present an overlapped algorithm that merges row and column processing as well. Finally, in section 6 we specialize it to compute summed-area tables. The resulting reduction in memory bandwidth combined with the high degree of parallelism enable our overlapped algorithms to outperform prior approaches.
For simplicity, when describing unidimensional operations, we will drop the vertical block index n, as shown in figure 2(b). Given these definitions, the blocked version of the problem formulation in (6) is to compute all output blocks Bm,n (V ) where: Bm,n (Y ) = F Pm−1,n (Y ), Bm,n (X) , Bm,n (Z) = R Bm,n (Y ), Em+1,n (Z) , (8) T Bm,n (U ) = F T Pm,n−1 (U ), Bm,n (Z) , T Bm,n (V ) = RT Bm,n (U ), Em,n+1 (V ) . All algorithms described from now on employ blocking to guarantee coalesced memory access patterns. Typically, each stage loads an entire block of data to shared memory, performs computation in shared memory, and finally writes the data back to global memory.
4.2
Important properties
We first introduce key properties of recursive filtering that will significantly simplify the presentation. The first property of interest is superposition, and it comes from linearity. The effects of the input and the prologue (or epilogue) on the output can be computed independently and later added together: (9) (10) 1 Note
F (p, x) = F (0, x) + F (p, 0), R(y, e) = R(y, 0) + R(0, e). that P T (·) is not the transpose of P (·) (but has the same shape).
Notice that the first term F (0, x) does not depend on the prologue, whereas the second term F (p, 0) does not depend on the input. Similar remarks hold for R. We will use superposition extensively to split dependency chains in the computation. Because filtering operations are linear, we can express them as matrix products. This will help reveal useful associative relations. For any d, let Id denote the identity matrix of size d × d. Then, F (p, 0) = F (Ir , 0) p = AFP p, AFP R(0, e) = R(0, Ir ) e = ARE e, ARE (11) F (0, x) = F (0, Ib ) x = AFB x, AFB R(y, 0) = R(Ib , 0) y = ARB y, ARB
= F (Ir , 0), = R(0, Ir ), = F (0, Ib ), and = R(Ib , 0).
Here, AFB and ARB are b × b matrices, AFP is b × r, and ARE is b × r. Notice that whereas the columns of AFB and ARB are shifted impulse responses (which allows for efficient storage), the columns of AFP and ARE are generally not shifts of each other. Row processing is analogous, except for transposition. This is the justification for our notation choice: (12)
F T (pT , 0) = pT (AFP )T ,
RT (0, eT ) = eT (ARE )T ,
F T (0, xT ) = xT (AFB )T ,
RT (y T , 0) = y T (ARB )T .
A trivial property of matrix product lets us write (13)
T (A p) = T (A) p and
H(A e) = H(A) e.
T (AFP ) = AbF
and
H(ARE ) = AbR ,
where AbF , AbR are precomputed r×r-matrices that depend only on the feedback coefficients ak , a0k of filters F and R, respectively.
4.3
Inspired by the bandwidth reduction strategy described by Dotsenko et al. [2008] in the context of scan primitives, we proceed differently. Notice that the dependency chain only involves prologues. Using properties (11), (13), and (14), we can make this apparent: (20) Pm (y) = T Bm (y) = T F (Pm−1 (y), Bm (x)) (21) = T F 0, Bm (x) + T F Pm−1 (y), 0) (22) = T Bm (¯ y ) + T F (Ir , 0) Pm−1 (y) (23)
= Pm (¯ y ) + T (AFP ) Pm−1 (y)
(24)
= Pm (¯ y ) + AbF Pm−1 (y).
Therefore, in contrast to the work of Sung and Mitra [1986], we refrain from storing the entire blocks Bm (¯ y ). Instead, we store only the incomplete prologues Pm (¯ y ). Equation (24) then allows us to sequentially compute one complete prologue from the previous by means of multiplication by the r × r matrix AbF and an r-vector addition with the incomplete Pm (¯ y ). The resulting algorithm is: Algorithm 1
Finally, as shown in appendix A, (14)
Sung and Mitra [1986] perform unidimensional recursive filtering on a CRAY X-MP computer. Their algorithm works in two stages. First, each processor works on a separate block, computing Bm (¯ y) sequentially. Second, the blocks are corrected sequentially, according to (16) and (19), with each block’s elements updated in parallel across processors. Although this algorithm requires twice the bandwidth and computation of the sequential algorithm, the amount of available parallelism is increased by a factor h/b (i.e., the number of blocks) in the first stage, and by a factor b in the second stage.
Inter-block parallelism
We start the description in one dimension for simplicity. When attempting to perform block computations independently, we are faced with the difficulty that each output block (15) Bm (y) = F Pm−1 (y), Bm (x) depends on the prologue Pm−1 (y), i.e., on the tail end of the previous output block, T Bm−1 (y) . This creates a dependency chain that we must work around. By the superposition property (9), we can decompose (15) as (16) Bm (y) = F 0, Bm (x) + F Pm−1 (y), 0 . The first term can be computed independently for each block (interblock parallelism) because it assumes a zero prologue. We use symbol y ¯ to denote this incomplete causal output: (17) Bm,n (¯ y ) = F 0, Bm (x) ,
1.1 In parallel for all m, compute and store each Pm (¯ y ). 1.2 Sequentially for each m, compute and store the Pm (y) according to (24) and using the previously computed Pm (¯ y ). 1.3 In parallel for all m, compute and store output block Bm (y) using (15) and the previously computed Pm−1 (y). The algorithm does some redundant work by computing the recursive filter within each block in both stages 1.1 and 1.3. However, the reduction in memory bandwidth due to not writing the block in stage 1.1 provides a net benefit. Although stage 1.2 is still sequential, it typically executes faster than the remaining stages. This is because we focus on small filter orders r with block size b large in comparison. This causes stage 1.2 to account for a small fraction of the total I/O and computation. Although we have not experienced a need for it, a hierarchical parallelization strategy is possible, as outlined in section 7. The derivation for the anticausal counterpart to algorithm 1 follows closely along the same lines and is omitted for brevity. We include the anticausal versions of equations (15), (17), and (24) for completeness. Assuming z = R(y, e): (25) Bm (z) = R Bm (y), Em+1 (z) (26) Bm (¯ z ) = R Bm (y), 0 (27)
Em (z) = Em (¯ z ) + AbR Em+1 (z).
where each block is filtered as if its prologue was null.
Row operations are analogous and are also omitted.
The second term of (16) is more challenging due to the dependency chain. Nevertheless, as soon as the dependency is satisfied, each element in a block can be computed independently (intra-block parallelism). To see this, apply property (11):
When applying causal and anticausal filter pairs to process all columns and rows in an image, the four successive applications 2 1 of algorithm 1 require O hw cp (8 + 4 b (r +r) steps to complete. The hw/b tasks that can be performed independently provide the scheduler with sufficient freedom to hide memory access latency.
F (Pm−1 (y), 0) = F (Ir , 0) Pm−1 (y) = AFP Pm−1 (y)
The repeated passes over the input increase memory bandwidth to (12 + 16r/b)hw, which is significantly more than the 8hw required by algorithm RT. Fortunately, we can reduce bandwidth by employing the kernel fusion technique.
(18) (19)
and recall that AFP depends only on the feedback coefficients.
4.4
Kernel fusion
As it is known in the literature [Kirk and Hwu 2010], kernel fusion is an optimization that consists of directly using the output of one kernel as the input for the next without going through global memory. The fused kernel executes the code of both original kernels, keeping intermediate results in shared memory. Using algorithm 1 for all filters, we fuse the last stage of F with the first stage of R. We also fuse the last stage of F T with the first stage of RT . Combined, the fusion of causal and anticausal processing lowers the required bandwidth to (10 + 16r/b)hw. Row and column processing are also fused. The last stage of R and the first stage of F T are combined to further reduce bandwidth usage to (9 + 16r/b)hw. The resulting algorithm is presented below: Algorithm 2
2.1 In parallel for all m and n, compute and store the Pm,n (Y¯ ). 2.2 Sequentially for each m, but in parallel for each n, compute and store the Pm,n (Y ) according to (24) and using the previously computed Pm,n (Y¯ ). 2.3 In parallel for all m and n, compute and store Bm,n (Y ) using the previously computed Pm−1,n (Y ). Then immediately ¯ compute and store the Em,n (Z). 2.4 Sequentially for each m, but in parallel for each n, compute and store the Em,n (Z) according to (27) and using the ¯ previously computed Em,n (Z). 2.5 In parallel for all m and n, compute and store Bm,n (Z) using the previously computed Em+1,n (Z). Then immediately T ¯ ). compute and store the Pm,n (U 2.6 Sequentially for each n, but in parallel for each m, compute T and store the Pm,n (U ). 2.7 In parallel for all m and n, compute and store Bm,n (U ) using T the previously computed Pm,n−1 (U ). Then compute and store T the Em,n (V¯ ). 2.8 Sequentially for each n, but in parallel for each m, compute T and store the Em,n (V ). 2.9 In parallel for all m and n, compute and store Bm,n (V ) using T the previously computed Em,n+1 (V ).
5.1
Causal-anticausal overlapping
One source of inefficiency in algorithm 2 is that we wait for the complete causal output block Bm,n (Y ) in stage 2.3 before ¯ we obtain the incomplete anticausal epilogues Em,n (Z). The important insight is that it is possible to work instead with twiceˆ computed directly from incomplete anticausal epilogues Em,n (Z), the incomplete causal output block Bm,n (Y¯ ). Even though these are obtained before Bm,n (Y ) are available, we are still able to compute the complete epilogues Em,n (Z) from them. We name this strategy for obtaining prologues Pm,n (Y ) and epilogues Em,n (Z) with one fewer pass over the input causal-anticausal overlapping. The trick is to apply properties (9) and (10) to split the dependency chains of anticausal epilogues. Proceeding in one dimension for simplicity: (28) Em (z) = H R Bm (y), Em+1 (z) (29) = H R F (Pm−1 (y), Bm (x)), Em+1 (z) = H R 0, Em+1 (z) (30) + H R F (Pm−1 (y), Bm (x)), 0 = H R 0, Em+1 (z) (31) + H R F (Pm−1 (y), 0), 0 + H R F (0, Bm (x)), 0 . We can simplify further using (11), (13) and (14) to reach Em (z) = H R(0, Ir ) Em+1 (z) (32) + H R(Ib , 0) F (Ir , 0) Pm−1 (y) + H R F (0, Bm (x)), 0 = H R(0, Ir ) Em+1 (z) (33) + H R(Ib , 0) F (Ir , 0) Pm−1 (y) + H R F (0, Bm (x)), 0 = AbR Em+1 (z) (34)
+ H(ARB ) AFP Pm−1 (y) + Em (ˆ z)
The combination of coalesced memory accesses, increased parallelism, and kernel fusion make algorithm 2 substantially faster than algorithm RT (see section 8).
where the twice-incomplete zˆ is such that
We could aggressively reduce I/O even further by refraining from storing any of the intermediary results Y , Z, and U at stages 2.3, 2.5, and 2.7, respectively. These can be recomputed from X and the appropriate prologues/epilogues when needed. Bandwidth would decrease to (6 + 22r/b)hw. (Notice that this is even less than the sequential algorithm RT!) Unfortunately, the repeated computation of Y , Z, and U increases the total number of steps 2 1 to O hw cp (14r + 4 b (r +r) , and offsets the gains from the bandwidth reduction. Even though the tradeoff is not advantageous for our test hardware, future GPU generations may tip the balance in the other direction. T T ¯ Pm,n ¯ ), Em,n Conceptually, buffers Pm,n (Y¯ ), Em,n (Z), (U (V¯ ) can be thought of as narrow bands around the perimeter of each block Bm,n (X). However, these buffers are stored compactly in separate data arrays to maintain efficient memory access.
(36)
5
Overlapping
Our main contribution is the development of an overlapping technique which achieves further bandwidth reductions without any substantial increase in computational cost.
(35)
Bm (ˆ z ) = R F (0, Bm (x)), 0 = R(¯ y , 0).
Notice that the r × r matrix H(ARB ) AFP used in (34) can be precomputed. Furthermore, each twice-incomplete epilogue Em (ˆ z) depends only on the corresponding input block Bm (x) and therefore they can all be computed in parallel already in the first pass. As a byproduct of that same pass, we can compute and store the Pm (¯ y) that will be needed to obtain Pm (y). With Pm (y), we can compute all Em (z) sequentially with equation (34). The resulting one-dimensional algorithm is as follows: Algorithm 3
3.1 In parallel for all m, compute and store Pm (¯ y ) and Em (ˆ z ). 3.2 Sequentially for each m, compute and store the Pm (y) according to (24) and using the previously computed Pm (¯ y ). 3.3 Sequentially for each m, compute and store Em (z) according to (34) using the previously computed Pm−1 (y) and Em (ˆ z ). 3.4 In parallel for all m, compute each causal output block Bm (y) using (15) and the previously computed Pm−1 (y). Then compute and store each anticausal output block Bm (z) using (25) and the previously computed Em+1 (z).
Using algorithm 3 for both row and column processing and fusing the two stages leads to: Algorithm 4
4.1 In parallel for all m and n, compute and store the Pm,n (Y¯ ) ˆ and Em,n (Z). 4.2 Sequentially for each m, but in parallel for each n, compute and store the Pm,n (Y ) according to (24) and using the previously computed Pm,n (Y¯ ). 4.3 Sequentially for each m, but in parallel for each n, compute and store the Em,n (Z) according to (34) using the previously ˆ computed Pm−1,n (Y ) and Em,n (Z). 4.4 In parallel for all m and n, compute Bm,n (Y ) using the previously computed Pm−1,n (Y ). Then compute and store the Bm,n (Z) using the previously computed Em+1,n (Z). T T ¯ ) and Em,n Finally, compute and store both Pm,n (U (Vˆ ). 4.5 Sequentially for each n, but in parallel for each m, compute T T ¯ ). and store the Pm,n (U ) from Pm,n (U 4.6 Sequentially for each n, but in parallel for each m, comT pute and store each Em,n (V ) using the previously comT T puted Pm,n−1 (U ) and Em,n (Vˆ ). 4.7 In parallel for all m and n, compute Bm,n (V ) using the T previously computed Pm,n−1 (V ) and Bm,n (Z). Then compute and store the Bm,n (U ) using the previously computed Em,n+1 (U ). As expected, algorithm 4 is faster than algorithm 2 (see section 8). 2 1 The advantage stems from a similar O hw cp (8r + 6 b (r +r) number of steps combined with lower (5 + 18r/b)hw bandwidth.
5.2
Row-column causal-anticausal overlapping
There is still one source of inefficiency: we wait until the complete block Bm,n (Z) is available in stage 4.4 before computing incomT T ¯ ) and twice-incomplete Em,n plete Pm,n (U (Vˆ ). Fortunately, we can overlap row and column computations and work with thriceincomplete transposed prologues and four-times-incomplete transposed epilogues obtained directly during stage 4.1. From these, we T T can compute the complete Pm,n (U ) and Em,n (V ) without going over the input again. The derivation of the procedure for completing the transposed prologues and epilogues is somewhat tedious, and can be found in full in appendix B. Below is the formula for completing thriceincomplete transposed prologues: T T Pm,n (U ) = Pm,n−1 (U ) (AbF )
(37)
T
+ ARE Em+1,n (Z) T (AFB )
T
+ (ARB AFP ) Pm−1,n (Y ) T (AFB ) T ˇ ), + Pm,n (U
T
where b × r matrix ARE as well as b × r matrices ARB AFP and ˇ satisfies T (AFB ) can all be precomputed. The thrice-incomplete U ˇ ) = F T 0, Bm,n (Z) ˆ . (38) Bm,n (U To complete the four-times-incomplete transposed epilogues of V : T T Em,n (V ) = Em,n+1 (V ) (AbR )
T
T + Pm,n−1 (U ) H(ARB ) AFP
(39)
T
+ ARE Em+1,n (Z) H(ARB ) AFB
The fully overlapped algorithm is: Algorithm 5
5.1 In parallel for all m and n, compute and store each Pm,n (Y¯ ), T T ˆ Pm,n ˇ ), and Em,n Em,n (Z), (U (V˜ ). 5.2 In parallel for all n, sequentially for each m, compute and store the Pm,n (Y ) according to (24) and using the previously computed Pm−1,n (Y¯ ). 5.3 In parallel for all n, sequentially for each m, compute and store Em,n (Z) according to (34) and using the previously ˆ computed Pm−1,n (Y ) and Em+1,n (Z). 5.4 In parallel for all m, sequentially for each n, compute and T store Pm,n (U ) according to (37) and using the previously T ˇ ), Pm−1,n (Y ), and Em+1,n (Z). computed Pm,n (U 5.5 In parallel for all m, sequentially for each n, compute and T store Em,n (V ) according to (39), using the previously comT T puted Em,n (V˜ ), Pm,n−1 (U ), Pm−1,n (Y ), and Em+1,n (Z). 5.6 In parallel for all m and n, successively compute Bm,n (Y ), Bm,n (Z), Bm,n (U ), and Bm,n (V ) according to (8) and using the previously computed Pm−1,n (Y ), Em+1,n (Z), T T Pm,n−1 (U ), and Em,n+1 (V ). Store Bm,n (V ). The number of steps required by thefully overlapped algorithm 5 2 1 is only O hw cp (8r + b (18r +10r)) . In our usage scenario, the filter order is much smaller than the block size. The step complexity is therefore about twice that of the sequential algorithm. On the other hand, algorithm 5 uses only (3 + 22r/b)hw bandwidth. This is less than half of what is needed by the sequential algorithm. In fact, considering that the lower bound on the required amount of traffic to global memory is 2hw (assuming the entire image fits in shared memory), this is a remarkable result. It is not surprising that algorithm 5 is our fastest, at least in first-order filters (see section 8).
6
Overlapped summed-area tables
A summed-area table is obtained using prefix sums over columns and rows, and the prefix-sum filter S is a special case of first-order causal recursive filter (with feedback weight a1 = −1). We can therefore directly apply the idea of overlapping to optimize the computation of summed-area tables. In blocked form, the problem is to obtain output V from input X where the blocks satisfy the relations T Bm,n (V ) = S T Pm,n−1 (V ), Bm,n (Y ) and (41) Bm,n (Y ) = S Pm−1,n (Y ), Bm,n (X) . With the framework we developed, overlapping the computation of S and S T is easy. In the first stage, we compute incomplete output blocks Bm,n (Y¯ ) and Bm,n (Vˆ ) directly from the input: (42) Bm,n (Y¯ ) = S 0, Bm,n (X) and (43) Bm,n (Vˆ ) = S T 0, Bm,n (Y¯ ) . T We store only the incomplete prologues Pm,n (Y¯ ) and Pm,n (Vˆ ). Then we complete them using:
T
+ (ARB AFP ) Pm−1,n (Y ) H(ARB ) AFB T + Em,n (V˜ ).
Here, the r × r matrix H(ARB ) AFP is the same as in (34), and the b × r matrix ARB AFP is the same as in (37). The new r × b matrix H(ARB ) AFB is also precomputed. Finally, the four-timesincomplete V˜ is ˇ ), 0 . (40) Bm,n (V˜ ) = RT Bm,n (U
T
(44) (45)
Pm,n (Y ) = Pm−1,n (Y ) + Pm,n (Y¯ )
and T Pm,n (V ) = Pm,n−1 (V ) + s Pm−1,n (Y ) + Pm,n (Vˆ ). T
T
Figure 3: Overlapped summed-area table computation according to algorithm SAT. Stage S.1 reads the input (in gray) then computes and stores T incomplete prologues Pm,n (Y¯ ) (in red) and Pm,n (Vˆ ) (in blue). Stage S.2 completes prologues Pm,n (Y ) and computes scalars s Pm−1,n (Y ) T (in yellow). Stage S.3 completes prologues Pm,n (V ). Finally, stage S.4 reads the input and completed prologues, then computes and stores the final summed-area table. Scalar s Pm−1,n (Y ) in (45) denotes the sum of all entries in vector Pm−1,n (Y ). The simplified notation means the scalar should T be added to all entries of Pm,n (V ). The resulting algorithm is depicted in figure 3 and described below: Algorithm SAT
S.1 In parallel for all m and n, compute and store the Pm,n (Y¯ ) T and Pm,n (Vˆ ) according to (42) and (43). S.2 Sequentially for each m, but in parallel for each n, compute and store the Pm,n (Y ) according to (44) and using the previ ously computed Pm,n (Y¯ ). Compute and store s Pm,n (Y ) . S.3 Sequentially for each n, but in parallel for each m, compute T and store the Pm,n (V ) according to (45) using thepreviously T computed Pm−1,n (Y ), Pm,n (Vˆ ) and s Pm,n (Y ) . S.4 In parallel for all m and n, compute Bm,n (Y ) then compute and store Bm,n (V ) according to (41) and using the previously T computed Pm,n (Y ) and Pm,n (V ). 2 Our overlapped algorithm SAT uses only (3 + 8/b + 2/b )hw2 of hw bandwidth and requires O cp (4 + 3/b) steps for completion. In contrast, using a stock implementation of prefix scan to process rows, followed by matrix transposition, followed by column processing (as in [Harris et al. 2008]) requires at least 8hw bandwidth. Even our transposition-free, fused, bandwidth-saving variant of the multiwave approach of Hensley [2010] requires at least (4 + 9/b)hw bandwidth. As shown in section 8, the bandwidth reduction leads to the higher performance of the overlapped algorithm SAT.
7
Additional considerations
A more general formulation for causal recursive filters would include an additional a0 coefficient as well as a feedforward filter component with coefficients bk : ! q r X 1 X (46) yi = bk xi−k − ak y i−k . a0 Feedforward in recursive filters
k=0
k=1
The added complication due to the feedforward coefficients is small compared to the basic challenge of the feedback coefficients, because the feedforward computation does not involve any dependency chains. One implementation approach is to simply pre-convolve the input with the feedforward component and then apply the appropriate feedback-only filter. To reduce memory bandwidth, it is best to fuse the feedforward computation with the parallelized feedback filter. This is relatively straightforward. 2 Actually,
O
hw cp (4
+ (b+1)/b2 + 2/b) , but (b+1)/b ≈ 1.
Higher-order causal or anticausal recursive filters can be implemented using different schemes: direct, cascaded, or in parallel. A direct implementation uses equations (1) or (3). Alternatively, we can decompose a higher-order filter into an equivalent set of first- and second-order filters, to be applied in series one after the other (i.e., cascaded) or independently and then added together (i.e., in parallel). See [Oppenheim and Schafer 2010, section 6.3] for an extensive discussion. Flexibility with higher-order filters
Although the different schemes are mathematically equivalent, in practice we may prefer one to the other due to bandwidth-to-FLOP ratio, implementation simplicity, opportunities for code reuse, and numerical precision. (Precision differences are less of a concern in floating-point [Oppenheim and Schafer 2010, section 6.9].) Our overlapping strategy is compatible with all these schemes. The tools we provide allow the implementation of overlapped recursive filters of any order, in a way that maps well to modern GPUs. Overlapping can be used to directly implement a high-order filter, or alternatively to implement the first- and second-order filters used as building blocks for a cascaded or parallel realization. In our algorithms, incomplete prologues and epilogues are completed sequentially (i.e., without inter-block parallelism). It is possible to perform such computations in a hierarchical fashion, thereby increasing parallelism and potentially speeding up the process. Completion equations such as (24) have the same structure as a recursive filter. This is clear when we consider the formulation in (52) of appendix A, where we process the input in groups of r elements at a time. Taking stage 5.2 as an example, we could group b prologues Pib,n (Y¯ ) to P(i+1)b,n (Y¯ ) into input blocks and recursively apply a variant of algorithm 1 to compute all Pm,n (Y ). Step 2 of this new computation is b times smaller still, consisting of only w/b2 prologues. We can repeat the recursion until we reach a problem size small enough to be solved sequentially, the results of which can be propagated back up to complete the original filtering operation. We skipped this optimization in our work due to diminishing returns. For example, the amount of time spent by algorithm 5 on stages 5.2–5.5 for a 40962 image amounts to only about 17% of the total. This means that, even if we could run these stages instantaneously, performance would improve by only 20%. Hierarchical reduction of prologues and epilogues
In our imageprocessing applications, the processing of each two-dimensional block exposes sufficient parallelism to occupy all cores in each SM. When processing one-dimensional input with algorithms 1 or 3, however, it may be necessary to increase intra-block parallelism. Recursive doubling for intra-block parallelism
Step complexity
RT
hw cp 4r
2
hw cp
8r + 4 1b (r2 +r)
4
hw cp
6 1b (r2 +r)
hw cp
5 SAT
8r +
8r + 1b (18r2 +10r) hw cp (8
+ 3b )
Max. # of threads
Bandwidth
h, w
8hw
1 hw b 1 hw b 1 hw b 1 hw b
(9 + 16 rb )hw
Recursive doubling [Stone 1973] is a well known strategy for firstorder recursive filter parallelization we can use to perform intrablock computations. The idea maps well to GPU architectures, and is related to the tree-reduction optimization employed by efficient one-dimensional parallel scan algorithms [Sengupta et al. 2007; Dotsenko et al. 2008; Merrill and Grimshaw 2009]. Using a block size b that matches the number of processing cores c, the idea is to break the computation into steps in which each entry is modified by a different core. Using recursive doubling, computation of b elements completes in O(log2 b) steps. The extension of recursive doubling to higher-order recursive filters has been described by Kooge and Stone [1973]. The key idea is to group input and output elements into r-vectors and consider equation (1) in the matrix form of (52) in appendix A. Since the algebraic structure of this form is the same as that of a first-order filter, the same recursive doubling structure can be reused.
5 4 2 RT
6 5 4 3 2 1 642
(5 + 18 rb )hw (3 + 22 rb )hw (3 + 8b + b22 )hw
128 2
2562 5122 10242 Input size (pixels)
20482
4096 2
20482
4096 2
2D Biquintic B-spline Interpolation
5 Throughput (GiP/s)
Alg.
2D Bicubic B-spline Interpolation
7 Throughput (GiP/s)
Table 1: Properties of the presented algorithms, for row and column processing of an h × w image with causal and anticausal recursive filters of order r, assuming block-size b, and p SMs with c cores each. For each algorithm, we show an estimate of the number of steps required, the maximum number of parallel independent threads, and the required memory bandwidth.
4 3 2 1 642
128 2
2562 5122 10242 Input size (pixels)
Figure 4: Throughput of the various algorithms for row-column causal-anticausal recursive filtering. (Top plot) First-order filter (e.g. bicubic B-spline interpolation). (Bottom plot) Second-order filter (e.g. biquintic B-spline interpolation). images, whereas with computation it reaches 6GiP/s (72GB/s). The second-order, causal-anticausal, rowcolumn separable recursive filter used in our tests solves the biquintic B-spline interpolation problem. Figure 4(bottom) compares three alternative structures: 2×51 is a cascaded implementation using two fused fully-overlapped passes of first-order algorithm 5, 42 is a direct second-order implementation of algorithm 4, and 52 is a direct fully-overlapped second-order implementation of algorithm 5. Our implementation of 42 is the fastest, despite using more bandwidth than 52 . The higher complexity of second-order equations slows down stages 5.4 and 5.5 substantially. We believe this is an optimization issue that may be resolved with a future hardware, compiler, or implementation. Until then, the best alternative is to use the simpler and faster 42 . It runs at 3.1GiP/s for 10242 images, processing each image in less than 0.32ms, or equivalently at more than 3200fps. Second-order filters
8
Results
Table 1 summarizes the main characteristics of all the algorithms that we evaluated, in terms of number of required steps, the progression in parallelism, and the reduction in memory bandwidth. Our test hardware consisted of an NVIDIA GTX 480 with 1.5GB of RAM (480 CUDA cores, p = 15 SMs, c = 32 cores/SM). All algorithms were implemented in C for CUDA, under CUDA 4.0. All our experiments ran on single-channel 32-bit floating-point images. Image sizes ranged from 642 to 40962 pixels, in 64-pixel increments. Measurements were repeated 100 times to reduce variation. Note that small images are solved in sequence, not in batches that could be processed independently for added parallelism and performance. As an example of combined row-column causal-anticausal first-order filter, we solve the bicubic B-spline interpolation problem (see also figure 7(top)). Algorithm RT is the original implementation by Ruijters and Thévenaz [2010] (available on-line). Algorithm 2 adds blocking for memory coalescing, inter-block parallelism, and kernel fusion. (Algorithms 1 and 3 work on 1D input and are omitted from our tests.) Algorithm 4 employs overlapped causal-anticausal processing and fused row-column processing. Finally, algorithm 5 is fully overlapped. Performance numbers in figure 4(top) show the progression in throughput described throughout the text. As an example, the throughput of algorithm 5 solving the bicubic B-spline interpolation problem for 10242 images is 4.7GiP/s (gibi-pixels per second). Each image is transformed in just 0.21ms or equivalently at more than 4800fps. The algorithm appears to be compute-bound. Eliminating computation and keeping data movements, algorithm 5 attains 13GiP/s (152GB/s) on large First-order filters
A useful measure of numerical precision in the solution of a linear system such as the bicubic B-spline interpolation problem is the relative residual. Using random input images with entries in the interval [0, 1], the relative residual was less than 2 × 10−7 for all algorithms and for all image sizes. Precision
Our overlapped summed-area table algorithm was compared with the algorithm of Harris et al. [2008] implemented in the CUDPP library [2011], and with the multi-wave method of Hensley [2010]. We also compare against a version of Hensley’s method improved by two new optimizations: fusion of processing across rows and columns, and storage of just “carries” (e.g. Pm,n (Y¯ )) between intermediate stages to reduce bandwidth. As expected, the results in figure 5 confirm that our specialized overlapped summed-area table algorithm outperforms the others. Summed-area tables
Overlapped SAT Improved Hensley [2010] Hensley [2010] Harris et al. [2008]
642
128 2
2562 5122 10242 Input size (pixels)
2D Gaussian filtering
4 Throughput (GiP/s)
Throughput (GiP/s)
Summed-Area Table Computation
9 8 7 6 5 4 3 2 1
20482
4096 2
Overlapped Recursive DIR DIR DIR CUFFT
3 2 1
642
128 2
2562 5122 10242 Input size (pixels)
20482
4096 2
Figure 5: Throughput of summed-area table computation, using our overlapped algorithm, the multi-wave algorithm of Hensley [2010], a revisited version of it, and the CUDPP code of Harris et al. [2008].
Figure 6: Throughput of 2D Gaussian filtering, using our overlapped 3rd-order recursive implementation, direct convolution with d taps (σ = d/6), and the Fast Fourier Transform.
As mentioned in section 1, Gaussian filters of wide support are well approximated by recursive filters (see also figure 7(bottom)). To that end, we implemented the third-order approximation (N = 3) of van Vliet et al. [1998] with cascaded first- and second-order filters. The first-order component uses algorithm 5, and is fused with the second-order component, which uses algorithm 42 . Figure 6 compares performance of the overlapped recursive implementation with convolution in the frequency domain using the CUFFT library [2007], and with highly optimized direct convolution (separable in shared memory) [Podlozhnyuk 2007]. Because computation in the direct convolution is proportional to the number of filter taps d, we show results for filter kernels with d = 17, 33, 65 taps. Although the direct approach is fastest for small filter windows, both the recursive filter and CUFFT are faster for larger windows. For reasonably sized images, overlapped recursive filtering is fastest. Moreover, its advantage over CUFFT should continue to increase on larger images due to its lower, linear-time complexity.
Finally, we would like to investigate whether the idea of overlapped computation can be used to optimize the solution of general narrowband-diagonal linear systems.
Recursive Gaussian filters
9
Conclusions
We describe an efficient algorithmic framework that reduces memory bandwidth over a sequence of recursive filters. It splits the input into blocks that are processed in parallel by modern GPU architectures, and overlaps the causal, anticausal, row, and column filter processing. The full sequence of filter passes requires reading the input image only twice, and writing only once. The reduced number of accesses to global memory leads to substantial performance gains. We demonstrate practicality in several scenarios: solving bicubic and biquintic B-spline interpolation problems, computing summed-area tables, and performing Gaussian filtering. Although we have focused on 2D processing, our framework of overlapping successive filters should extend to volumes (or even higher dimensions) using analogous derivations. However, one practical difficulty is the limited shared memory (e.g. 48k bytes) available in current GPUs. Making the volumetric blocks of size b3 fit within shared memory would require significantly smaller block sizes, and this would decrease the efficiency of the algorithm. Future work
Another area for future work is to adapt our algorithms to modern multicore CPUs (rather than GPUs). Although row and column parallelism is already sufficient for the fewer CPU threads, the memory bandwidth reduction may still be beneficial. In particular, our blocking structure may allow more accesses to be served by the faster L1 cache (or by the L2 cache in the case of volumetric processing). Whether this would lead to performance gains remains to be seen. The reduced bandwidth may also be beneficial on architectures like the Cell microprocessor in which memory transfers must be managed explicitly.
10
Acknowledgements
This work has been funded in part by a post-doctoral scholarship from CNPq and by an INST grant from FAPERJ.
References B LELLOCH , G. E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers, 38(11). B LELLOCH , G. E. 1990. Prefix sums and their applications. Technical Report CMU-CS-90-190, Carnegie Mellon University. B LU , T., T HÉNAVAZ , P., and U NSER , M. 1999. Generalized interpolation: Higher quality at no additional cost. In IEEE International Conference on Image Processing, volume 3, 667– 671. B LU , T., T HÉVENAZ , P., and U NSER , M. 2001. MOMS: Maximalorder interpolation of minimal support. IEEE Transactions on Image Processing, 10(7):1069–1080. C ROW, F. 1984. Summed-area tables for texture mapping. In Computer Graphics (Proceedings of ACM SIGGRAPH 90), volume 18, ACM, 207–212. CUDPP library. 2011. cudpp/.
URL http://code.google.com/p/
CUFFT library. 2007. URL http://developer.nvidia.com/ cuda-toolkit. NVIDIA Corporation. D ERICHE , R. 1992. Recursively implementing the Gaussian and its derivatives. In Proceedings of the 2nd Conference on Image Processing, 263–267. D OTSENKO , Y., G OVINDARAJU , N. K., S LOAN , P.-P., B OYD , C., and M ANFERDELLI , J. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing, 205–213. G ÖDDEKE , D. and S TRZODKA , R. 2011. Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Transactions on Parallel and Distributed Systems, 22(1):22–32. H ARRIS , M., S ENGUPTA , S., and OWENS , J. D. 2008. Parallel prefix sum (scan) with CUDA. In GPU Gems 3, chapter 39. H ENSLEY, J. 2010. Advanced rendering technique with DirectX 11: High-quality depth of field. Gamefest 2010 talk. H ENSLEY, J., S CHEUERMANN , T., C OOMBE , G., S INGH , M., and
Down
+Up
+Right
+Left
Figure 7: Effect of successive recursive filtering passes on a test image. (Top row) Cubic B-Spline interpolation filter, which acts as a high-pass. (Bottom row) Gaussian filter with σ = 5, which acts as a low-pass. Insets show the combined filter impulse responses (not to scale). Our overlapped algorithm computes all four of these successive filtering passes using just two traversals of the input image. L ASTRA , A. 2005. Fast summed-area table generation and its applications. Computer Graphics Forum, 24(3):547–555.
S TONE , H. S. 1971. Parallel processing with the perfect shuffle. IEEE Transactions on Computers, C-20(2):153–161.
H OCKNEY, R. W. and J ESSHOPE , C. R. 1981. Parallel Computers: Architecture, Programming and Algorithms. Adam Hilger.
S TONE , H. S. 1973. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. Journal of the ACM, 20(1):27–38.
I VERSON , K. E. 1962. A Programming Language. Wiley. K ASS , M., L EFOHN , A., and OWENS , J. D. 2006. Interactive depth of field using simulated diffusion on a GPU. Technical Report #06-01, Pixar Animation Studios. K IRK , D. B. and H WU , W. W. 2010. Programming Massively Parallel Processors. Morgan Kaufmann. KOOGE , P. M. and S TONE , H. S. 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, C-22(8):786–793. L AMAS -RODRÍGUES , J., H ERAS , D. B., B ÓO , M., and A RGÜELLO , F. 2011. Tridiagonal solvers internal report. Technical report, University of Santiago de Compostela.
S UNG , W. and M ITRA , S. 1986. Efficient multi-processor implementation of recursive digital filters. In IEEE International Conference on Acoustics, Speech and Signal Processing, 257– 260. S UNG , W. and M ITRA , S. 1992. Multiprocessor implementation of digital filtering algorithms using a parallel block processing method. IEEE Transactions on Parallel and Distributed Systems, 3(1):110–120. V LIET , L. J., YOUNG , I. T., and V ERBEEK , P. W. 1998. Recursive Gaussian derivative filters. In Proceedings of the 14th International Conference on Pattern Recognition, 509–514 (v. 1).
VAN
M ERRILL , D. and G RIMSHAW, A. 2009. Parallel scan for stream architectures. Technical Report CS2009-14, University of Virginia.
Z HANG , Y., C OHEN , J., and OWENS , J. D. 2010. Fast tridiagonal solvers on the GPU. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
O PPENHEIM , A. V. and S CHAFER , R. W. 2010. Discrete-Time Signal Processing. Prentice Hall, 3rd edition.
A
PARHI , K. and M ESSERSCHMITT, D. 1989. Pipeline interleaving and parallelism in recursive digital filters–Part II: Pipelined incremental block filtering. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(7):1099–1117.
In this appendix we derive the r×r-matrix T (AFP ) that is used to transfer a prologue vector across a zero-valued block in the recursive filter computation of (24).
P ODLOZHNYUK , V. 2007. Image convolution with CUDA. NVIDIA whitepaper. RUIJTERS , D. and T HÉVENAZ , P. 2010. GPU prefilter for accurate cubic B-spline interpolation. The Computer Journal. S ENGUPTA , S., H ARRIS , M., Z HANG , Y., and OWENS , J. D. 2007. Scan primitives for GPU computing. In Proceedings of Graphics Hardware, 97–106.
Derivation of T (AFP ) = AbF
When dealing with a higher-order recursive filter F , it is convenient to group input and output elements into r-vectors, so that F is expressed in a simplified matrix form. To that end, let y x 0 i−r
(47)
i−r
.. . .. . , x˙ i = .. and x . . y˙ i = ¨i = yi−2 xi−2 0 yi−1 xi−1 xi−1
Now applying (11), (13) and (14):
It is easy to verify that if
T T Pm,n (U ) = Pm,n−1 (U ) T T F T (Ir , 0)
y˙ 0 = p, then y˙ i = x ¨i + AF y˙ i−1 , where 0 1 ··· 0 0 .. .. .. .. .. . . . . . AF = 0 0 ··· 1 0 . 0 0 ··· 0 1 −ar −ar−1 · · · −a2 −a1
(48)
(49)
(60)
y˙ i = x ¨i + AF y˙ i−1 = x ¨i + AF x ¨i−1 + A2F y˙ i−2 = · · · ! r−1 X = AkF x ¨i−k + ArF y˙ i−r k=0
= A¯F x˙ i + ArF y˙ i−r .
(52)
(54)
A¯F =
1 −a1
0 , 1
and
A2F =
−a2 a1 a2
−a1 a21 − a2
(62)
When xi is null, equations (48) reduce to
T (F (p, 0)) = y˙ b ,
(57) we have (58)
B
T (AFP ) p = AbF p
⇒
T (AFP ) = AbF .
Derivation of full overlapping
Consider the transposed prologues of U . From the definition and repeated application of properties (9) and (10): T T Pm,n (U ) = T T F T (Pm,n−1 (U ), 0)
(59)
+ T T F T (0, R(0, Em+1,n (Z))) + T T F T (0, R(F (Pm−1,n (Y ), 0), 0)) + T T F T (0, R(F (0, Bm,n (X)), 0)) .
T
T (U ) H(ARB ) AFP + Pm,n−1
(65)
T
+ ARE Em+1,n (Z) H(ARB ) AFB
T
+ (ARB AFP ) Pm−1,n (Y ) H(ARB ) AFB T + Em,n (V˜ ),
T
where
(67)
Since the last r output elements of the b-wide block are given by
ˇ ) = F T 0, R F (0, Bm,n (X)), 0 Bm,n (U ˆ . = F T 0, Bm,n (Z)
T = Em,n+1 (V ) (AbR )
(66)
y˙ i = AF y˙ i−1 = AiF p.
y˙ 0 = p,
(56)
T
Following a similar derivation for the transposed epilogues of V : T T Em,n (V ) = H T RT (0, Em,n+1 (V )) T (U ), 0), 0) + H T RT (F T (Pm,n−1 (64) + H T RT (F T (0, R(0, Em+1,n (Z))), 0) + H T RT (F T (0, R(F (Pm−1,n (Y ), 0), 0)), 0) + H T RT (F T (0, R(F (0, Bm,n (X)), 0)), 0)
Recall that (55) T (F (p, 0)) = T (F (Ir , 0) p) = T (AFP p) = T (AFP ) p.
T
In (61) we have exploited matrix-product associativity to select the best computation ordering.
Equivalent derivations appear in the works of Kooge and Stone [1973] and Blelloch [1990]. Anticausal filters can also be expressed in terms of r-vectors in a form similar to (52), with corresponding matrices A¯R and AR .
+ ARE Em+1,n (Z) T (AFB )
ˇ is such that where U
(63)
Column k of matrix A¯F is simply the last column of Ar−k−1 , F for k ∈ {0, 1, . . . , r − 1}. Under this formulation, r new output elements are produced and r elements of input are consumed. For example, given a second-order filter: 0 1 (53) , AF = −a2 −a1
T
+ (ARB AFP ) Pm−1,n (Y ) T (AFB ) T ˇ ), + Pm,n (U
A nicer expression involves x˙ i rather than x ¨i . Iterating (48):
(51)
+ R(Ib , 0) F (Ir , 0) Pm−1,n (Y ) T T F T (0, Ib ) + T T F T (0, R(F (0, Bm,n (X)), 0)) T = Pm,n−1 (U ) (AbF )
(61)
(50)
+ R(0, Ir ) Em+1,n (Z) T T F T (0, Ib )
Bm,n (V˜ ) = RT F T (0, R(F (0, Bm,n (X)), 0)), 0 ˇ ), 0 . = RT Bm,n (U