A Cellular Automaton for Ultra-Fast Watershed Transform on GPU Claude Kauffmann CHUM Notre-Dame Hospital
[email protected]
Abstract In this paper we describe a cellular automaton (CA) used to perform the watershed transform in N-D images. Our method is based on image integration via the Ford-Bellman shortest paths algorithm. Due to the local nature of CA algorithms we show that they are designed to run on massively parallel processors and therefore, be efficiently implemented on low cost consumer graphical processing units (GPUs).
1. Introduction Watershed transform is one of the most popular methods for image segmentation. The watershed transform was originally proposed by [2][7] and later improved in [14]. The watershed method can be formulated in a general framework called image labeling where we associate a label to each pixel from a finite set. The intuitive idea underlying this method comes from geography: hence a landscape or topographic relief is flooded with water, watersheds are the dividing lines of the basins of attraction of rain falling over the region [16]. An alternative approach is to imagine the landscape being immersed in a lake, with holes pierced in local minima. Basins (also called ’catchments basins’) will fill up with water starting at these local minima, and, at points where water coming from different basins would meet, dams are built. When the water level has reached the highest peak in the landscape, the process is stopped. As a result, the landscape is partitioned into regions or basins separated by dams, called watershed lines or simply watersheds[13]. This intuitive concept leaves room for various formalizations, watershed definitions, algorithms and implementation but in practice, it can be divided into two classes, one based on the specification of a recursive algorithm by Vincent and Soille[9], and another one based on distance functions by Meyer[10]. Moreover, watershed methods are usually based on sequential al-
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
Nicolas Piche Objects Research Systems
[email protected]
gorithms but during the last decade serious efforts were made to find parallel implementation strategies[15][11]. Unfortunately, despite the use of all the techniques and architectures, there is always a stage in the watershed transform which remains a global operation. Therefore, only modest speedups are to be expected in the case of parallel implementation[13]. This paper shows efficient parallel implementation of the watershed transforms based on a cellular automaton (CA) which computes Ford-Bellman’s shortest paths. Due to the local nature of CA algorithms we show that they are well designed to run on massively parallel processors and therefore, can be efficiently implemented on low cost consumer graphical processing units (GPUs).
2. Our contribution In the following, we consider a N-D gray scale image as a graph G = (V, E)), which consists of a set V of vertices (or nodes) and a set of edges e ∈ E ⊆ V xV . The edge epq spans two neighboring vertices, p , q , and the cost of this edge, is denoted by wpq . This graph can be defined as a weighted graph, G = (V, E, w), or by a valued graph, G = (V, E, f ), where respectively w : E → < is a weight function defined on the edges, and f : V → < is a valued function defined on the vertices.
2.1. Dijkstra’s algorithm Dijkstra’s shortest paths algorithm[3] is the most popular method to compute the shortest path between two vertices s and v, or between a start vertex s and all other vertices in a graph. In algorithm 1, we consider an undirected weighted graph, G = (V, E, w), with w : E → λ (q) + wpq then 8. λ (p) = λ (q) + wpq 9. label(p) ← label(q) 10. T ← T − {p} if T is not empty go to step 5 Dijkstra’s algorithm is a heap-based method. The entire set T must either be kept sorted or repeatedly searched to produce a minimum λ(p) for each vertex processed. Standard implementation uses Priority Queue and defines a sorting function on the nodes, such as λ(p), with the lowest cost at the top of the Priority Queue. The algorithm will converge to an optimal path, since the search proceeds by expanding the lowest-cost vertices first, and optimal wave fronts are generated that work their way out through the search space; an optimal decision at each step produces a globally optimal solution. From each minimum an optimal wave front is started, labelled by the index of the minimum it started from, and the distance is initialized with the value of the minimum. If a wavefront i reaches a node p after propagating over a distance ` and is less than λ(p), the value ` is placed in λ(p), while label(p) is set to i. If a node v is reached by another wavefront after propagating over the same distance ` but originated from a different minimum, label(p) can remain unchanged or be set to a specific label wshed, designating p as a watershed line. Generally speaking, priority queues are very difficult to parallelize and we elected to find out other implementation strategies to compute the topographical distances.
2.2. Ford-Bellman’s algorithm A different approach used to find the distance of all vertices v from a defined vertex s is the Ford-Bellman’s algorithm[6][1] described on Algorithm 2. This 50year-old algorithm gives a concise generalized expression of the cost-minimization problem. The most crucial, unique and unintuitive aspect of this algorithm is that even though the vertices can be processed in any order (even random), the algorithm will in the end produce the lowest-cost distance from every vertex to the start vertex. A proof of this statement can be found in
1. λ (s) ← 0 and for every v 6= s, λ (v) ← ∞ 2. as long as there is an edge p → q, such that λ (p) > λ (q) + wpq then 3. λ (p) = λ (q) + wpq 4. label(p) ← label(q) This important aspect of the Ford-Bellman’s algorithm has caught our attention because it allows an efficient parallel implementation that can be achieved by cellular automata as presented in the following section.
2.3. From Cellular automata to watershed A cellular automaton (CA) is a collection of cells arranged in an N-D lattice, such that each cell changes state as a function of time according to a defined set of rules that includes the states of neighboring cells. That is, the state of a cell at a given time depends only on its own state at the previous time step and the states of its neighborhood at the previous time step. All cells on the lattice are updated synchronously. Thus the state of the entire lattice advances in discrete time steps. For a 2D image, we can use the Moore or von Neumann neighbourhoods [12]. In 3D, the extension of the von Neumann and Moore neighborhood gives us 6 and 26 neighbours respectively. Following this definition, CA can easily be applied to N-D images (lattice) represented by a graph G, where cells are pixels (or voxels) of this image and vertices of the graph. We can then write the CA rule that computes, for each time step t, the FordBellman’s algorithm by the following equation: (2) λt+1 (p) = min λt (p), min λt (q) + wpq q∈Np
Where Np is the neighborhood of p. Equation (2) shows that the Ford-Bellman’s CA rule can be written in a very efficient and concise form. The principle of CA is then to apply this transition rule (2) synchronously for all cells and to iterate as long as any cell changes its state. For the final implementation of the watershed-CA rule we consider a valued graph G = (V, E, f ), where f : V → < is a valued function defined on the vertices. Using this approach the algorithm becomes more general since we compute the watershed transform on an input lattice which can be a gray scale image, or any function applied on it.
Algorithm 3: Watershed Automata evolution rule 1. 2. 3. 4. 5. 6. 5. 6.
For all minima: si with i ∈ [1, k] λ (si ) ← 0 and for all v 6= si , λ (v) ← ∞ label(si ) ← i and for all v 6= si , label(v) ← 0 for ∀p ∈ V U t = minq∈Np {λt (q) + f (p)} λt+1 (p) = min [λt (p), U t ] labelt+1 (p) = label {min [λt (p), U t ]} End for
Our CA algorithm presented above does not produce watershed lines. All pixels are merged within some basin, so that the set of basins tessellates the image plane. This is a consequence of the local condition of the CA algorithm which is very advantageous for a parallel implementation of the watershed transform [13]. Unlike other parallel algorithms, our CA algorithm is deterministic and the result do not depend on the order in which the pixels are treated during the execution of the algorithm. This is related to a fundamental aspect of CA, that is, that all cells on the lattice are updated synchronously at each time step.
and GPU watershed algorithms were applied to boxsized parts of the 3D data set (Tab.1). For the largest images, the CPU watersheds could not be calculated due to memory allocation errors. Tab.1 summarizes the computing times depending on the image sizes. Fig.1 shows that the GPU CAWatershed performs 2.5 times faster as the Tobogganing watershed. Table 1. Computation times (ms) for the watershed transform for CPU and GPU algorithms. CPU GPU Image size Tobogganing CA-Watershed 64x64x64 156 600 128x128x128 1437 1177 200x200x200 5828 3087 256x256x256 14281 5739 300x300x300 20703 10468 350x350x350 34578 15414 400x400x400 52641 21426 450x450x450 – 43204 512x512x512 – 173320
3. Experimental results 3.1. Simple 2D examples We applied our CA-Watershed transform on two simple 2D images. The first represents an MRI image of a kidney along a sagital plane (Fig.2) and the second one is the popular image of Lena (Fig.3). The algorithm 3 was applied to these images by using valued functions fA = |∇(I)| and fB = |∇(G ∗ I)|, where G is a Gaussian smoothing function. The watershed results are represented by mosaic images where each labeled region is filled by the mean value of the pixels inside this region. Examples, on Fig.1 and 2, show that the watershed regions give a regular partitionning of the image and a coherent and smooth representation of boundaries.
Figure 1. CPU and GPU-Watershed computing time for different image sizes.
3.2. Computing efficiency The computing efficiency is evaluated by comparing the computation times of our GPU-CA-Watershed to a C++ implementation of the Tobogganing algorithm described in [8][5]. The computation tests where performed on a PC with the following configuration: Intel Xeon, 3 GHz, 2MB RAM, Radeon X1950 Pro graphic card (512MB). The initial 3D image used for testing is a CT-scan acquisition of size 512x512x720. The CPU
4. Conclusion A CA-based watershed algorithm has been presented. Performance of our un-optimized version was benchmarked against Matlab flooding style algorithm and with an optimized tobogganing implementation. The performance of our algorithm is unmatchable by these two popular CPU algorithms and the local nature
of CA makes our method easy and straightforward to implement. There is a non avoidable time spent for the transfer of data to and from the video card, this mean that for very small datasets, less than 64 cubic voxels, CPU algorithms are faster than GPU implementation. However, in the context of medical imaging, this limitation is irrelevant because dataset are typically much bigger than 64 cubic. In this article we have demonstrated that an ultra-fast watershed transform is possible, even on large datasets, with the use of parallel programing on GPU. We are confident that our CA-GPU approach to image segmentation problem is very promising and it open new ways for future applications in diverse and numerous domains.
Figure 2. MRI image of a kidney along the sagital plane (top-left), kidney image filtered by a Gaussian (top-right), respective watershed mosaic images (bottom).
References [1] R. Bellman. On a routing problem. Quarterly Applied Mathematics, XVI(1):87–90, 1958. [2] H. Digabel and C. Lantuejoul. Iterative algorithms. In Actes du Second Symposium Europeen d’Analyse Quantitative des Microstructures en Sciences des Materiaux, Biologie et Medecine, Caen:85–99, October 1977.
Figure 3. Image of Lena (left), watershedmosaic computed on gradient of smoothed Lena (right).
[3] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271, December 1959. [4] S. Even. Graph Algorithms. Pitman, London, 1979. [5] J. Fairfield. Toboggan contrast enhancement for contrast segmentation. In ICPR90, volume 1, pages 712– 716, AtlanticCity, NJ, June 1990. [6] L. R. Ford. Technical report p-923. Network flow theory, XVI(1):87–90, 1958. [7] C. Lantuejoul. La squelettisation et son application aux mesures topologiques des mosaiques polycristallines. PhD thesis, Ecole des Mines, Paris, 1978. [8] Y.-C. Lin, Y.-P. Tsai, Y.-P. Hung, and Z.-C. Shih. Comparison between immersion-based and toboggan-based watershed image segmentation. IEEE Transactions on Image Processing, 15(3):632–640, 2006. [9] P. S. L.Vincent. Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Trans. on PAMI, 13(6):583–598, 1991. [10] F. Meyer. Topographic distance and watershed lines. Signal Process., 38(1):113–125, 1994. [11] D. Noguet. A massively parallel implementation of the watershed based oncellular automata. ApplicationSpecific Systems, Architectures and Processors, 1997. Proceedings., IEEE International Conference on, pages 42–52, 1997. [12] G. X. Ritter and J. N. Wilson. Handbook of Computer Vision Algorithms in Image Algebra. CRC Press, Inc., Boca Raton, FL, USA, 1996. [13] Roerdink and Meijster. The watershed transform: Definitions, algorithms and parallelization strategies. FUNDINF: Fundamenta Informatica, 41, 2000. [14] C. L. S. Beucher. se of watersheds in contour detection. In International Workshop on Image Processing, Rennes, France, September 1979. [15] B. A. S. Eom, V. Shin. Cellular watersheds: A parallel implementation of the watershed transform on the cnn universal machine. In IEICE Transactions on Information and Systems, volume E90-D(4), pages 791–794, 2007. [16] J. Serra. Image Analysis and Mathematical Morphology. Academic Press, New York, USA, 1982.