Accelerating 3D Medical Image Segmentation with High Performance ...

Image Processing Theory, Tools & Applications

Accelerating 3D Medical Image Segmentation with High Performance Computing Przemysław Lenkiewicz

Manuela Pereira, Mário Freire

IT-Networks&Multimedia Group Department of Computer Science University of Beira Interior Portugal, Microsoft Portugal

IT-Networks&Multimedia Group Department of Computer Science University of Beira Interior Portugal

Abstract Digital processing of medical images has helped physicians and patients during past years by allowing examination and diagnosis on a very precise level. Nowadays possibly the biggest deal of support it can offer for modern healthcare is the use of High Performance Computing architectures to treat the huge amounts of data that can be collected by modern acquisition devices. This paper presents a parallel processing implementation of an image segmentation algorithm that operates on a computer cluster equipped with 10 processing units. Thanks to well-organized distribution of the workload we manage to significantly shorten the execution time of the developed algorithm and reach a performance gain very close to linear.

Key Words- High Performance Computing, image segmentation, deformable models.

1. Introduction There are numerous motivations for using digital image processing in medical applications. They include in particular improvement in interpretation of examined data, full or nearly full automation of performed tasks, better precision and repeatability of obtained results and also possibility of exploring new imaging modalities, leading to new anatomical or functional insights. One of the most noticeable trends in the current development of biomedical image processing techniques is the transition to 3-dimensional data. The natural reason for that situation is the will to explore as much information as possible, improving the ability to draw to precise conclusions. However, the 3D models introduce too much information to be perceived by a human eye at an instance of time – it is necessary for example to rotate the model to perform full examination. In this case the computer applications allow the image processing in dimensions that are not possible to be perceived by human eye. One difficulty of dealing with images in 3D is the increased amount of data for the image processing

978-1-4244-3321-6/08/$25.00 ©2008 IEEE

José Fernandes Microsoft Portugal

algorithms. This is further amplified by the constant improvement in the field of medical acquisition devices. Shorter acquisition times offer the opportunity to perform denser scanning (includes more scan images for each examination) and better sensors allow the registration of images in larger resolutions, allowing the perception of smaller details. Both improvements result in a major growth of data to process. The current progress in the field of computing performance lets us believe that parallelization is one of the best solutions to improve the effectiveness and shorten execution times of computation and data-intensive algorithms. This can be described in general as distributing the workload necessary to complete a given task between a number of processing units, which execute their respective operations or process some parts of data while keeping some level of awareness about the remaining processing units and communicating with them. Efforts to distribute the work between a number of processing units have already been taken, including an early work presented in [1] by Higgins and Swift. The authors have noticed the growing amount of data generated by radiological imaging devices and proposed their idea of using a distributed processing environment in order to decrease the execution times of medical algorithms. The scenario that they worked with assumed a combination of several heterogeneous processing and communication devices operating together under control of a massive processing virtual machine, which was referred to as a metacomputer [2]. In [3] Salomon and Heitz have used a MIMD parallel processing computer to perform matching of 3D medical images, reducing the execution time of some medical tasks from the level of several hours to around 30 minutes using 32 processing units. Mayer and Meinzer in [4] have constructed a client/server application that provides high computational power to handle medical tasks. Their solution was based on a distributed system granting access to its resources by a network, performing time consuming operations on dedicated machines and returning desired results again by network back to users. The above mentioned solutions, similarly to a lot of work that is currently under research in this field, assume

using some kind of specialized system to perform the demanding operations. The algorithmic work is often focused on applying the medical image processing algorithms to a very specific hardware/software scenario and thus limiting seriously the portability and general usability of developed solution. In this paper we propose a different approach and discuss a solution that will be able to provide its functionality without such strict definition of hardware/software requirements. The paper is organized as follows: section 2 describes the task of image segmentation using a solution called the deformable models and our approach to solve this task in a parallel manner, section 3 presents the results that we’ve obtained along with their justifications and section 4 presents the conclusions and goals for our future work.

2. The Active Volumes algorithm Segmentation algorithm implemented in this research was based on a solution called the Topological Active Volumes, proposed by Barreira and Penedo in [5]. This solution inherits richly from the idea of Deformable Models, which is currently gaining a lot of interest in research and in true medical applications, serving as a medium between low-level computer vision and highlevel geometric object representation. The potency of this solution arises from their ability to segment, match, and track images of anatomic structures by exploiting constraints derived from the image data together with a priori knowledge about the location, size, and shape of these structures. Deformable models are capable of accommodating the often significant variability of biological structures over time and across different individuals. Furthermore, deformable models support interaction mechanisms that allow medical scientists and practitioners to bring their expertise to bear on the modelbased image interpretation task when necessary. The idea was introduced by Kass and Witkin in 1988 with their work about deformable 2D contours, called “snakes” [6]. The basic idea assumed placing a geometrical shape within the image and deforming it until it reaches a shape and position that would segment the object of interest. To achieve this the shape was controlled by two types of energies – internal one, that was responsible for maintaining the geometrical features of the shape, like continuity and curvature; and external one, that was attracting the shape towards image features. Later many authors have proposed their representation models, improvements and significant changes in the original idea, some of the worth mentioning ones being the use of finite element models, subdivision curves and analytical models. The Topological Active Volumes solution includes a different idea for the shape representation than the original one of the deformable models. As a result the entire segmentation process is performed in a slightly

different manner. It assumed that instead of evolving a contour, the image would be covered with a discrete mesh, like the one presented in Figure 1. The nodes of this mesh would be incorporating the whole image and each node would have the ability to move in a predefined neighborhood, thus evolving the shape of the whole mesh. Based on the boundary information all nodes would be classified into two categories, internal and external. The former ones model the inner topology of the object while the latter perform more similarly to the active contours, trying to fit to the edges of the object in the image. As we can see, the segmentation process would be greatly similar to the general assumptions of the deformable models – the nodes of the mesh would be moved by the image and internal forces until they assume positions segmenting the shape of interest. The Topological Active Volumes extended this solution to operate in 3D environment, using a set of slice images combined to form one 3D model.

2.1. Our algorithm and energy formulation Our implementation of the segmentation algorithm is based on the above assumptions with our own definitions of the internal and external energies. Furthermore, in order to simplify and potentially speed up the segmentation process, the nodes of our mesh are not divided into internal and external ones. Instead, the formulation of the mesh energy is performed in a way that assures the desired behavior of all the nodes – attracting the external ones to the contour and distributing the internal ones evenly inside of the object of interest. This is obtained only with a proper adjustment of the balance between the external and internal energies. The results presented in the next section prove that this simplification was a successful choice, as the segmentation results are precise and the execution times are shorter than in the original formulation. The Active Volume is defined parametrically as follows: v(r, s, t) = (x(r, s, t), y(r, s, t), z(r, s, t)), where (r, s, t) ∈ ([0, 1] × [0, 1] × [0, 1]). The state of the model is

Figure 1. Example of a 5x5 Active Net. The black and gray points mark the external and internal nodes respectively.

governed by an energy function defined as follows: , ,

, ,

where Eint and Eext are the internal and the external energy of the mesh. In order to calculate the energy, the parameter domain [0, 1]×[0, 1]×[0, 1] is discretized as a regular grid defined by the internode spacing (k, l, m) [5] and for each node of the grid the energy is calculated using the contributing forces. The internal energy is defined as follows: |

|

arc tan

2

The three sum expressions represent the mesh continuity, curvature and center point gravity forces, respectively. Furthermore, represents the average distance between all the nodes, is the average distance between the node n and its neighbors. The parameter is the average angle between given node and its neighbors calculated as follows for each pair of lines defined by the node of interests and two of his neighbors: tan

Finally, dn is the distance of the given node from the center point of the mesh, calculated as follows: 2

2

with and representing the x and y dimensions of the mesh. The external energy is defined as follows: 1

1

with I(v) representing the intensity values of the image, G(v) being the image gradient values and E(v) the values from the edge detector. The abovementioned forces can be described as follows: • Continuity – this force attracts the nodes to maintain certain distances between each other, described with a parameter. If the parameter is set to a number larger than the initial distance between the nodes, the mesh will show a “growing” behavior, whereas with the parameter set to a smaller value, the mesh will “shrink”. • Curvature – taking into consideration the position of a given node with the respect to its neighbors, this force attracts the nodes to maintain a regular grid in terms of angles between the particular grid lines. • Center point gravity – the points of the mesh are also attracted to the position near center of the image. Experiments have shown that including this simple force allowed to speed up the segmentation process and to maintain a steady progression of the points towards their desired locations. • Image intensity – using the image intensity values the nodes of the mesh are attracted to either the bright or

dark points of the image. This allows detecting regions of interest in the image and keeping the nodes of the mesh in their area; • Image gradient – using the Sobel operator [8] to calculate the first derivative of the image it is then possible to estimate the strength and the direction of the image gradient at any point of the image. This force shows very useful capabilities of attracting the nodes towards the nearest area of intensity changes. • Edge detector – the original image is treated with the Canny Edge Detection algorithm [9] and the result is used to attract the mesh nodes to their final locations. Each of the above forces is weighted with a parameter, describing how strong their influence on the segmentation process is.

2.2. Parallel Active Nets The parallel implementation of the segmentation algorithm has been performed using MS MPI [10] – an implementation of the Message Passing Interface by the Indiana University with collaboration of Microsoft. Our algorithm has been designed from the start with the intention of executing it in a parallel environment. That’s why we have concentrated on the features that are able to provide good scalability along with the growth of available resources. Those features are namely: • Low needs for inter-process communication during the segmentation sequence. • Ability to equally share the workload between any number of processing units. The general idea of the parallelization is presented on Figure 2. In the first steps the 3D mesh is created in a way that its nodes cover the entire set of slice images and are spaced uniformly. This task is performed by only one of the processes participating in the computation, a so-called root process. Next, the root process verifies how many units are available to perform the computations and assigns each of the processes an equal (or nearly equal) number of nodes to operate on. Each of the participating processes will now work with its own portion of the mesh, performing the segmentation steps. The distribution of the nodes is done as presented on the Figure 3. One layer of the mesh is divided among the participating processes – each one receives a range of nodes to work with. That range is maintained also for the remaining layers. We can describe this as cutting out a slice of the mesh for each of the participating processes. After receiving the range of nodes to operate on, the processes start to perform the segmentation steps. Each node is placed in nine locations corresponding to its nearest neighborhood and current position. The energy value in those locations is calculated using the energy function. Next, the position corresponding to the smallest energy value will be chosen as the new location for the given node. The process is repeated for the entire mesh.

Root process

…

Process N

Distribute the mesh nodes on the image Check available processes, assign each node to a process For each of the own nodes

For each of the own nodes

Place the node in its nearest neighborhood

Place the node in its nearest neighborhood

Calculate its energy for each of the locations

Calculate its energy for each of the locations

Choose the location with smallest energy as the new location

Choose the location with smallest energy as the new location

Synchronize and gather information from all the processes

Send local information to the root process

Distribute energy data among all the processes

Receive the energy data from the root process

Check how many of the nodes have changed their location

No

Reached the end condition? Yes End segmentation

Figure 2. Flowchart representing the steps of our parallel implementation of the Topological Active Volumes segmentation algorithm As it can be seen the optimization algorithm is constructed based on a greedy algorithm approach [11]. When the given process finishes processing its own part of the mesh, it holds execution and waits until all the other participating processes have reached the same step of the algorithm. This is performed by calling a function from the MPI interface called the barrier and serves as a synchronization operation – this way we are sure that the whole mesh has been processed and the algorithm can move to the next step, which is information exchange. At this point it is important to notice that the segmentation performed in each of the participating processes is in large part independent from all the other ones and thus the communication performed in this step can be very limited. The energy of the mesh at each node is calculated mainly with local information and the only piece of information exchanged is the current average distance between the nodes of the mesh. Experiments presented in the following part of the document prove that this level of communication is sufficient and doesn’t introduce any unwanted behavior into the segmentation process. When all the nodes of the mesh have been processed the root process checks how many nodes have changed their position during the last loop and based on that number the decision about finishing algorithm’s execution is taken. The data about mesh nodes’ positions is stored by each process in a local array. Each process operates on only a part of the array and this part has to be shared with all the other processes at the end of the segmentation. The root process again assumes control, as it can be seen in the Figure 2, and calls all the processes to send him their parts of the array. It receives it in a form of twodimensional array of the size m*n, where m is the number

of processes participating in segmentation and n is the number of nodes assigned to each process. Using this information it constructs then a single array with the information about all the nodes of the mesh, which is our segmentation outcome. One more thing worth noticing is that the nodes assigned to a single process are not limited to move away from the initial zone controlled by the given process. The assignment of mesh nodes to a given process is the same during the whole segmentation procedure, not regarding the current position of the node. This allows the mesh to deform freely, without any restrictions imposed by the parallelization.

3. Results The segmentation algorithm has been executed on a 5node computer cluster constructed with dual core Pentium D machines, which results in 10 available processing units. Each of them worked at 3 GHz and was equipped with 1 GB memory. The algorithm was executed on a 22 slice CT brain scan [12], each of the slices being a 512x512 image. 12 out of 22 slices are presented on Figure 4. The mesh was initialized with the cell size of 25x25 pixels, thus resulting in 20x20 nodes mesh size. The set of parameters is presented in Table 1. The given parameters describe respectively: α – image intensity energy, β – edge detector energy, γ – gradient image energy, δ – continuity energy, ε – curvature energy, ζ – center point gravity. This selection represents a strong influence of the image intensities and the edge detector energies. As said before, the former one serves to attract and keep the nodes of the mesh to points of high intensity, as they are representing the features of interests. When a given node encounters itself in proximity of such point of interest, the second force, namely the edge detector energy, acts as a ‘pin’, keeping it attached in that location using its own significant weight parameter. The gradient energy multiplier has a relatively low value comparing to the previous ones. The reason for this is because unlike the solutions based on contour evolution, the Active Volume has the nodes of the mesh distributed along the whole image and thus we don’t need to attract them from large distances to approach the shape. Setting the gradient energy to a low value allowed us to attract the nodes towards the nearest edge of the segmented object, but at the same time not to interfere with the integrity of the whole mesh. Experiments with the gradient energy Table 1. Parameters for the segmentation algorithm used for the purposes of this document. α β γ δ ε ζ 0.7 0.8 0.05 0.001 0.001 0.01

100 90

86,42

80

Execution time (seconds)

multiplier set to a higher value resulted w with decrease in the regularity of the mesh. The two internaal energy forces, continuity and curvature, are both set too low values, as their jobs are mainly to secure the model’’s structure from unwanted growing and bending. The thirrd internal force, the center point gravity, has a slightly higgher value, as it helps to maintain a stable and fast proogression of the nodes towards their desired locations. T The experiments included executing the segmentation allgorithm with a number of available processing units varyying in the range from 1 to 10. Each experiment was perfformed 10 times and the average of execution times was calculated. However, it was possible to see that the individual experiments presented very similar values of the execution time. The execution times are depicted onn Figure 5. The horizontal axis represents the number of pprocessing units participating in the segmentation process and the vertical axis shows the time. The black line show ws the averaged values of our experiments and the gray linne represents the ideal scenario of workload parallelization,, namely a linear growth of performance (half of the execuution time for 2 processing units, a quarter of the execuution time for 4 units, and so on). As it can be seen, our implementation shows a growth in performance very cloose to the ideal scenario, only missing it by less than two seconds in each experiment. Figure 5 also shows a peerformance gain obtained with the growth of processinng units (multiprocess execution time divided by the refference time of a single process segmentation scenario). Aggain, the gray

70 60 50

44,56

40 30

23,20 1 5,85

20

12,18

10,06

8

10

10 0 1

2

6

4

Number of pro ocesses 12

10

Performance gain

Figure 3. Example of mesh nodes distribution for a simple example of 3 processing units.

line shows the ideal scenario of linear growth of performance along with the resourrces and the black line shows our experiments. Also, comparing the resultss obtained with one processing unit to the ones preseented in the paper by Barreira and Penedo [5] we can see that our implementation of the segmentatiion procedure shows a major performance growth, from around 20 minutes to 86 seconds (in this particular segmeentation scenario). The machines executing the segmentatiion in [5] were running at 1,8GHz, while in our experimeents it was 3Ghz. The scenario used in that publication assumed 80 images of 256x256 resolution, whereas our ex xperiments included 22 images of 512x512 resolution. As we can see the

8

6

4

2

0 1

2

4

6

8

10

Number of processes p

Figure 4. 8 out of 22 slices of the C CT scan used for our experiments.

Figure 5. Diagrams representing execution times of our experiments (left) an nd the performance gain (right) of the para allel segmentation algorithm. The black series line represents the average of the results obtained in our experiments and the gray series line represents a hypothetical scenario of ideal, linear performance growth.

scenarios are slightly different, but they both represent very similar sizes of volumes to segment and the difference in processors’ speed should not be reflected so much on the execution times. Thus we believe that our implementation showed much desired results. It is however important to note that the version from Barreira and Penedo has some capabilities that are superior to our implementation, namely the possibility to detect unwanted connections between mesh nodes and thus to change the original topology if the shape. This could explain the big difference in execution times. Our algorithm didn’t have that feature implemented, which means that it would fail in a scenario where this would be necessary. However, in this particular example this feature was not required and our version of the algorithm has provided results of the same quality in a much shorter time. Finally, Figure 6 presents the difference of the execution times compared to the scenario of linear performance growth. This time represents the overhead imposed by communication and uneven distribution of the input data. As it can be seen, relatively to the execution times the overhead is showing very low values in the first scenarios and maintains acceptable level in the later ones (from about 3% of the execution time for 2 processing units, until about 14% of the execution time for 10 processing units). This proves that the parallelization is done well and that available resources are utilized effectively. Also the duration of operations performed in a client-server mode, when the root process assumes control over the segmentation and the remaining processes are awaiting the next steps, is relatively low and does not significantly affect the overall execution time. It might occur as strange that the difference values are not growing or decreasing monotonically. This can however be explained by noticing that the communication overhead depends highly on equal distribution of the workload and this in turn depends on the fact if the number of all nodes

in the mesh is dividable without a remainder by the number of processing units. For a certain scenario with smaller number of processes the distribution can be more equal than for another scenario with bigger number of processes. The segmentation results themselves are presented on Figure 8. Starting from the top-left we can see the initialization of the mesh, followed by the segmentation progression. The bottom-right image presents the result after 130 iterations of the algorithm. As it can be seen, the segmentation is smooth and precise, small features of the brain have been successfully detected. To obtain a reference image sample we have performed the segmentation on a single processing unit, eliminating any communication and synchronization steps from the code. The images obtained with the growing number of processing units have been compared to the reference segmentation using a pixel-by-pixel comparison method. The similarity of the reference images and the results obtained with 10 processing units have been on the level of 99,58%.The differences between images are limited to a small number of nodes repositioned just by a few pixels. Examples of such differences are presented on Figure 7. The justification for those differences in the results lies in the characteristic of the internal energy of the model. With each calculation it requires to approximate some values, like the average distance between the nodes, distance from the center of the image or angles between the mesh lines. When the segmentation process is parallelized, these calculations are performed simultaneously by more than one processing unit. Each of them, at a given point of time, holds contemporary information only about the group of

Parallelization overhead (seconds)

1,65 1,60

1,6 1,55 1,5

1,45

1,45

1,42

1,4 1,35

1,37

1,35

1,3 1,25 1,2 2

4

6

8

10

Number of processes

Figure 6. Differences between the execution times of our experiments and the ideal scenario of linear performance growth for different number of processing units.

Figure 7. Examples of differences in results obtained with single processing units (gray lines) and 10 processing units (bright lines). The image represents a slice of the final volume, corresponding to one slice of input scan. The gray mesh is covering the white mesh in most places, meaning that in those places the segmentation results are exactly the same.

Figure 8. Consecutive steps of 3D brain segmentation. The result (bottom right) represents the shape obtained after 130 iterations of the algorithm. nodes that it is currently processing – we can say that the information about all the remaining nodes is overdue by one iteration of the algorithm. Because of that, and because the energy value corresponding to two different node’s locations are sometimes very small, the selection of optimal node location can be sometimes different in scenarios of various numbers of processing units. However, we do not consider this a significant degradation of the results, as the differences are very small and hard to find manually.

4. Conclusions and future work Experiments with our version of Active Nets algorithm show promising results regarding the parallelization of such computationally demanding task as the segmentation. We have shown that introducing a number of processing units instead of just one allows decreasing significantly the algorithms’ execution times, reaching quite closely the ideal scenario of linear growth in performance. This is obtained with equal workload distribution and such construction of the algorithm that requires minimal communication between participating processes during the segmentation. The results of segmentation performed with the parallel algorithm match the results of a serial algorithm with a ratio around 99,58%. This is a major success as our goal was to perform parallel segmentation with results identical as if the process was performed in a serial manner. Comparing our results with other proposed by different authors in this field we can see that we have obtained performance gain similar or better to them. In [13] Benknerand Dimitrov proposed a parallel implementation of a medical image reconstruction system using a combination of MPI and OpenMP technologies. Our solution showed nearly linear performance growth with small numbers of processing units and reached 7 times shorter execution time with 8 processing units (87,5% performance gain) and 8 times shorter execution time

with 10 processing units (80% performance gain). Results presented in [13] reach our results in their best scenarios and do not over perform them in any scenario. Solutions of more general purpose, like the one proposed in [14], show generally significantly lower performance gain comparing to our method. It is also worth noticing that the execution environment for our solution is very flexible. The MPI interface that provided functions for job parallelization and interprocess communication is an open standard with numerous and freely available implementations existing, for various programming languages and operating systems. As for the hardware requirements, it is important to notice that our solution functions on highly available x86 architecture processing units, which are communicating with each other using nothing different than a network connection. For the modern healthcare solutions this can mean that using a small computer cluster or a multi-core machine to handle medical tasks, it would be possible to obtain results of examinations in much shorter times. This can lead to the possibility of using very high precision examination or to very fast, nearly real time examination methods. It is also very promising because of high portability of this idea to everyday usage in hospitals and medical examinations facilities – multi-core computers are becoming a de facto standard in today’s computer market and nearly every home desktop computer is equipped nowadays with dual core processor, with the expectations of 8-core processors becoming a standard in the next two-year period. A single machine like such utilized properly can provide a very significant computational power and combining only several those machines into a small computer cluster can deliver a very powerful solution for the most complex and time consuming medical tasks. The authors are especially interested in the task of medical image segmentation. The tasks for our future work include further development of the segmentation algorithm, broadening its capabilities to

operate on 4D images and utilizing effective optimization techniques, like the genetic algorithms [15].

5. References [1] W. E. Higgins and R. D. Swift, "Distributed system for processing 3D medical images," Computers in Biology and Medicine, vol. 27, pp. 97-115, 1997. [2] A. E. Klietz, A. V. Malevsky, and K. Chin-Purcell, "Mixand-match high performance computing," Potentials, IEEE, vol. 13, pp. 6-10, 1994 [3] M. Salomon, F. Heitz, G. R. Perrin, and J. P. Armspach, "A massively parallel approach to deformable matching of 3D medical images via stochastic differential equations," Parallel Comput., vol. 31, pp. 45-71, 2005. [4] A. Mayer and H. P. Meinzer, "High performance medical image processing in client/server-environments," Computer Methods and Programs in Biomedicine, vol. 58, pp. 207-217, 1999. [5] N. Barreira, M. G. Penedo, C. Mariño, and F. M. Ansia, "Topological Active Volumes," in Computer Analysis of Images and Patterns, 2003, pp. 337-344. [6] M. Kass, A. Witkin, and D. Terzopoulos, "Snakes: Active contour models," International Journal of Computer Vision, vol. 1, pp. 321-331, 1988. [7] Y. Tsumiyama, K. Sakaue, and K. Yamamoto, "Active Net: Active Net Model for Region Extraction," Information Processing Society of Japan, vol. 39, pp. 491-492, 1989. [8] P. E. Danielsson and O. Seger, "Generalized and separable Sobel operators," Machine Vision for Three-Dimensional Scenes, 1990. [9] J. Canny, "A computational approach to edge detection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, pp. 679-698, 1986. [10] A. L. A. R. Jeremiah Willcock, "Using MPI with C# and the Common Language Infrastructure," Concurrency and Computation: Practice and Experience, vol. 17, pp. 895-917, 2005. [11] D. Williams and M. Shah, "A Fast algorithm for active contours and curvature estimation," CVGIP: Image Understanding, vol. 55, pp. 14-26, 1992. [12] M. Häggström, "CT brain scan of Mikael Häggström," Radiology Department of the Uppsala University Hospital, 2007. [13] S. Benkner, A. Dimitrov, G. Engelbrecht, R. Schmidt, and N. Terziev, "A Service-Oriented Framework for Parallel Medical Image Reconstruction," in Computational Science — ICCS 2003, 2003, pp. 691-691.

[14] K. Liakos, A. Burger, and R. Baldock, "Distributed Processing of Large BioMedical 3D Images," in High Performance Computing for Computational Science - VECPAR 2004, 2005, pp. 142-155. [15] D. Goldberg, "Genetic Algorithms" in Search, Optimization, and Machine Learning: Addison-Wesley Professional, 1989.