Proceedings of the 13th Internationall Conference on Pattern Recognition (ICPR’96), August 25-30, 1996 Technical University of Vienna, Austria
Data Distribution Concepts for Parallel Image Processing Michael N¨olle and Gerald Schreiber Technische Informatik I, University of Technology Hamburg–Harburg 21071 Hamburg, Germany email: fnoelle,
[email protected]
set of concurrent functions onto the processors of the parallel system. From this mapping the necessary propagation of input and output data and the schedule of execution can be deduced. Data parallelism takes the opposite approach: by a mapping of data onto processors the locality of functions and the access to non-local data is determined. Both approaches have their individual application domains and can often be used in combination. In scientific computing and especially in image processing many problems require the handling of very large data objects. For many of these problems the data parallel approach is the most promising. Naturally, the choice of the data distribution is the crucial task. The impact of data distributions on parallel implementations may be described by three factors. Load balancing addresses the task of dividing the workload evenly among the processors. The minimization of communication requires to optimize the amount and the distance of data exchange between different processors at runtime. By overlapping computation and communication it is often possible to hide the communication latencies (in case the hardware allows autonomous communication). Since the compiler technology is not yet advanced enough for fully automatic handling of data distributions, the user is required to supply additional information. In this context the use and development of new data parallel languages such as High Performance Fortran (HPF) [10], Vienna Fortran [6] and pC++ [3] gained considerable attention. For dedicated application domains portable libraries have been designed like ScaLAPACK [8] for linear algebra or PIPS [15] for image processing. The major aim of these systems is to ease the programming of parallel computers and to generate portable code. Since the quality of data distributions in terms of efficiency depends very strongly on the application under consideration and the (global) context in which they appear, it can be a difficult and tedious task to find good solutions. This attempt may fail at all when only a limited number of distributions is provided by the compiler or library which do not meet the requirements of a particular application. In
Abstract Data distributions gained a considerable interest in the field of data parallel programming. In most cases they are key factors for the efficiency of the implementation. In this paper we analyze data distributions suited for parallel image processing and those defined by some of todays more popular parallel languages (HPF, Vienna Fortran, pC++) and libraries (ScaLAPACK). The majority of them belong to the class of bit permutations. These permutations can efficiently be realized on networks that are based on shuffle permutations. As a result we propose to widen the scope of data distributions tolerated by parallel languages and libraries towards classes of distributions. For the large class of the so called normal algorithms we demonstrate that it is possible to implement library functions that can handle a large subclass of distributions thereby avoiding redistributions. At the application level of programming data distributions are to be handled analogously to data types.
1 Introduction Parallel distributed memory systems are increasingly used as scientific computing platforms since they provide high performance at relatively low hardware costs. The broad use of these systems depends mainly on the availability of software and the ease to generate solutions for new applications. For the exploitation of the high performance potential of distributed memory multiprocessor systems two programming paradigms, functional and data parallelism, are in the main focus of investigation and are widely used. Roughly speaking, the idea behind functional parallelism is to map a This research was partially supported by the Deutsche Forschungsgemeinschaft (DFG) within the project Compatible mapping of image processing and pattern recognition algorithm onto massively parallel computer architectures.
1
order to achieve high efficiencies it is therefore necessary to provide mechanisms that allow a flexible and application dependent use of data distributions. Programming environments providing a well defined class of data distributions allow flexible and efficient implementations of libraries and can thereby significantly reduce the amount of data redistributions [4]. The paper is organized as follows. Following some notational conventions Section 2 introduces a general framework for the description and handling of a great class of regular data distributions. Section 3 describes an example that illustrates the consequences for parallel implementations offering some flexibility with respect to data distributions. The following Section 4 gives an overview over the impact of the introduced concepts from the applications point of view. A more detailed discussion of the issues of this paper can be found in [14].
1.1
in the course of this paper, we will only require a sequential enumeration of the processors. The above assumption will impose no limitations on the structure of the underlying network, since a sequential enumeration of the processors is always possible. Thus we consider mapping functions
: ZNm? ZN ! ZPd? ZP ; where P [Pd?1; : : :; P0 ] is the set of processor id’s in a Pd?1 P0 processor array. However, of all possi1
2.1
Z
Z Z
P
Z
Z
Z
Z
0
Data distributions defined by HPF
In this section we will familiarize the reader with the data distributions specified by HPF [10] and point out some of its weaknesses. Some proposals for extensions to HPF are mentioned. In HPF exactly d dimensions of a data object must be distributed among the processors of a d dimensional grid. If the dimension of the data object exceeds the dimension of the processor grid, then the excessive dimensions are to be distributed contiguously (see below for definition). The distribution of data objects of fewer dimension as the processor grid is not possible at all. For simplicity, we do only consider data objects and processor grids of identical dimension. Consider a data object A[Nd?1 ; : : :; N0 ] and the array of processors P [Pd?1; : : :; P0]. The mapping : Nd?1 N0 ! Pd?1 P0 is separated into d mappings i : Ni ! Pi with = (d?1 ; : : :; 0) and i 2 d. The mappings i can be either BLOCK (i(a) = baPi =Nic), CYCLIC (i (a) = a mod Pi ), or contiguous, denoted *, (i(a) = 0). Thus, in essence, 2d different types of data distributions are possible. By use of templates and alignments it is possible to realize more complex data distributions.
The set f0; : : :; B ? 1g is denoted B. Some of the operations required in this paper can conveniently be expressed in a string-oriented notation. We denote the set of strings x = xn?1; : : :; x0 of length n by nB, where the xi are taken from B. The concatenation of two strings a 2 n B and b 2 mB is denoted by ajb = an?1; : : :; a0; bm?1; : : :; b0 2 n+m. By interpreting length n strings of base B digits B as polyadic base B representation we gain a natural correspondence between the elements of n B and Bn . Because there is no hazard of ambiguities, we shall henceforth cease to distinguish between n B and Bn . That is, we implicitly use the one-to-one mapping x 2 Bn $ xn?1 ; : : :; x0 2 n : x = n?1 xiB i . B i=0 In this paper we consider m-dimensional data objects A[Nm?1 ; : : :; N0] were the Ni denote the size of object A ?1 in the i-th dimension. Thus A contains Πm i=0 Ni elements. For ease of notation, we only consider Ni with Ni = 2ni here. We use round brackets to denote the coordinates of elements, e.g. by A(2; 5; 1) we denote the element with coordinates (2; 5; 1) of a three-dimensional data object.
Z
1
ble data distributions, only a small subset is of practical interest.
Notation
Z Z
0
Z
Z
Z
Z
2 Data distributions derived from bit permutations
In general, the distribution of a m-dimensional data object A[Nm?1 ; : : :; N0 ] onto the set of processors of a parallel system is given by a mapping function assigning a processor id to each element. For the purposes of this paper, the range of the mapping function, i.e. the set of the processor id’s, is assumed to be a d dimensional array. This assumption is made mainly for the discussion of HPF’s data distributions. For the class of data distributions we shall propose
Z Z Z Z
Z
By aligning a data object to a much larger template is possible to distribute the data object to a subset of the processors. The separate mapping of the data object dimensions onto the processor array dimensions can be relaxed by aligning the data object to a template (or an other data object) of the same size but different shape.
The use of alignments to relax the restriction on data distributions seem to be unnecessarily complicated. Moreover, irregular data distributions and the distribution of subobjects are not possible at all. These shortcomings inspired researchers to propose extensions to the HPF standard (see [7] and the literature mentioned there). E.g. for purpose 2
of load balancing, many suggestions have been made: the GENERAL BLOCK distribution, the RESHAPE directive, or arbitrary user defined distributions. This would eventually allow to dispose of HPF’s templates. There have been numerous other approaches to define parallel libraries or language extensions involving data distributions. We do only mention two here, which seem to have become a kind of standard. The linaer algebra library ScaLAPACK [8] is a distributed memory version of the LAPACK software package for matrix computations. It utilizes subsets and combinations of data distribution. pC++ is an object-oriented extension to the C++ programming language [3] . The data distributions allowed are identical to HPF’s.
2.2
Omega: In the even levels (the root node has level 0), the nodes of the binary tree perform a horizontal split; otherwise a vertical split is performed. Block: In the first q = d p2 e levels, the nodes of the binary tree perform a horizontal split; in the remaining p ? q levels a vertical split is performed. In addition two more data distributions turned out to be of relevance in practise. The first, called Torus, is a permutation of the Shuffle distribution. The stripes of data are arranged following a Hamiltonian cycle of the de Bruijn graph [9]. Thereby the communication distance of neighboring stripes of data is minimized. Secondly, a distribution called 2d-Torus results from a permutation of the Block distribution, such that the communication distance in the four-neighborhood of each subimage is decreased. Details on this data distribution can be found in [1].
Data distributions for parallel image processing
In this subsection we will describe the data distributions that we found to be of interest in the application field of parallel image processing. For low and medium level image processing tasks the aspect of load balancing can most often be solved by simply balancing the amount of data elements. Thus the choice of the data distribution is mainly aimed at minimizing the communication distances and maximizing the possible overlap of computation and communication. In order to minimize communication distances, some knowledge of the underlying communication network is required. We therefore limit our discussion to target systems with a shuffle-based communication network. We say a network is shuffle-based if the set of nodes is given by p2 and the communication distance of nodes x = xp?1 ; : : :; x0 and x0 = xp?2 ; : : :; x0 ; c, c = 0; 1, is limited by a small constant, usually one or two. Two well known examples of shuffle-based communication network are the de Bruijn graph and the shuffle exchange network [11]. These networks feature logarithmic diameter and constant degree nodes and thereby combine some of the advantages of the grid and the hypercube. In image processing the data to be processed is usually gathered by a camera or read from a hard disk, both of which are sequential data sources in most cases. Thus the data distributions are induced by the procedure used for distributing a data object. For data distribution, we split the object by traversing a binary tree, which can easily embedded into shuffle-based networks [15]. Upon traversing the binary tree, all nodes perform the simple task of splitting the data received from their parent node into halfs and sending them to their two child nodes. From this scheme, we derive three basic data distributions. Consider the distribution of the data object A[2l ; 2m ] onto the processors of p2 .
2.3
Bit permutations
In order to gain a compact notation, we describe a data distribution by means of a permutation of the elements of the data object. Please note that this notation is somewhat different from the mappings of elements onto processors used so far. The key elements of our notation are
Z
to consider only one dimensional data objects and a linear numbered set of processors, to perform a permutation on the elements of the data object and then to distribute the vector according to a (plain) standard mapping function.
Any data object A[Nm?1 ; : : :; N0 ], Ni = 2ni , can easily be given a one dimensional ordering of the elements by assigning element A(am?1 ; : : :; a0 ) the rank a:
a = am?1 j : : : ja0 =
i? Xa Y N:
m?1 i=0
1
i
j =0
j
With a slight abuse of notation we write A(a) for element A(am?1 ; : : :; a0) of A[Nm?1 ; : : :; N0]. With n = nm?1 + : : : + n0 , the index space of A is thus mapped onto Z2n. The mapping of the set of processors P [2pd? ; : : :; 2p ] onto the set Z2p, p = pd?1 + + p0 , is done analogously 1
0
by concatenating the processor coordinates. Now we proceed with the definition of a class Γn of permutations of n 2 . Γn is a group and is called group of bit permutations. Each permutation 2 Γn corresponds to a permutation 0 2 Σn.1 is the permutation that results
Z
Z
Shuffle: All nodes of the binary tree perform a horizontal split of the data received from their parent node.
1 Σ denotes the symmetric group of n permutations on n elements.
3
n!
elements, consisting of all
when 0 is applied bitwise, i.e. for each 0 a 2 Γn such that for all x 2 n 2:
Z
2 Σn there exists
look to the well known Fast Fourier Transform (FFT) which belongs to the class of so called normal algorithms [11]. Informally an algorithm is said to be normal when it can be computed on a hypercube interconnected parallel processing system in such way that within each communication step all processing elements use the same hypercube dimension for the data exchange and dimensions are used consecutively. A number of algorithms belong to the class of normal algorithms. Among them are the FFT, the Walsh Transform, translation-invariant transforms for position invariant feature extraction in gray scale images, image restoration algorithms like the Viterbi algorithms, Batchers bitonic sorting algorithm, and the modified shuffle–ring sorter (c.f. Section 2.4) just to mention some [5]. A great variety of parallel implementations exist for these algorithms. Many of those may be derived from each other by the use of a proper data distribution function along with a remapping of computation localities. Moreover algorithms having special properties with respect to their parallel implementation may be designed by the use of the corresponding assignment of data to processing elements. n Given an index space n 2 of N = 2 elements the computation of the aforementioned algorithms can be basically described by a set of nN functions fak , a = 0; :::; N ? 1, k = 1; :::; n, operating on the entire input data object A[N ] in an iterative manner:
(x) = (xn?1 ; : : :; x0) = x 0 (n?1); : : :; x 0(0) Γn = f for all 0 2 Σng: After applying the permutation 2 Γ to A, the data object
A is distributed onto the P = 2p processors such that ( A)(x) is mapped onto processor xn?1; : : :; xn?p . We say A is -distributed. The distribution of A is analogous to HPF’s BLOCK distribution. It is easy to see that the data distributions defined by HPF [7], pC++ [3], ScaLAPACK [8] and the distributions for image processing described in Section 2.2 are all elements of the class Γ.
2.4
Redistribution of data
In general, data redistributions in a parallel system can be performed by sorting algorithms. The sorting key is simply chosen as the id of the destination processor. In the case where the initial data permutation is unknown, redistribution is equivalent to sorting on a defined range of distinct numbers. In the case when some information on the permutation is available, it is often possible to find more specialized but faster algorithms for redistribution. The class Γn of data permutations described in this paper has been investigated intensively in the literature. Some basic work on self-routing algorithms for this class was published by Nassimi and Sahni [12]. The algorithm was originally designed for 2n ? 1 stage Benˇes networks and can be emulated within 2n ? 1 steps on a 2n node de Bruijn network2. A similar result was obtained in [16] for an even wider class, the so called linear permutations. The algorithm of [12] can be refined, as the one that can be found in [13], such that at most 2p communication steps are sufficient to route Γn , n p, on a 2p node de Bruijn graph. In the case of arbitrary permutations the so called ShuffleRing-Sort (SRS) algorithm, presented in [2], can be used. If n 2p the SRS requires 3p communication steps. However, for redistributions within the class Γ, it can easily be shown that a modified version of SRS achieves the same efficiency as the algorithms mentioned above. Thus we derive an algorithm for data redistributions within the class Γ that requires 2p communication steps on a de Bruijn graph and that can easily be extended if arbitrary permutations are required.
Z
A0 Ak (a)
=
A fak (Ak?1(a); Ak?1(a 2k?1));
(1)
where denotes the binary exclusive-or operation. The corresponding data flow of Equation 1 results in the well known n-dimensional base 2 butterfly graph. For any data distribution generated by a permutation 2 Γ it can be shown (c.f. [13] for the details) that we can refine Equation 1 by
B 0 (b) = A0 (a) B k (b) = f k? b(B k?1 (b); B k?1(b 2k?1)) (2) with b = (a) such that B n = An . Informally the permutation 0 induced by has to be applied on the stage 1
numbers of the original butterfly graph. Both computations differ only in the data distribution and the locality of functions. See Figure 1b for an example. A standard way to implement Equation 1 on a parallel system with P = 2p processors is given when we use a BLOCK mapping of the entire data object A0 , i.e. we assign a vector vx0 of N=P continuous elements to each processor x 2 p2. vx0 consists of all element A0 (xjy); y 2 2n?p. The first n ? p layers of computation have to be computed locally on each processor. The other p layers require a data 0 exchange between the processors x and x0 = x 2k ?1 , k0 = 1; : : :; p. If we split all vxk into vxkji = Ak (xjj jijy0 );
Z
3 Example: Normal Algorithms In order to emphasize the applicability and usefulness of different and variable data distributions we will take a closer 2 In
=
fact the class Γ is a subset of the permutations considered in [12].
4
Z
1
2
stage k 3
4
5
1
2
data index
0
0
stage k'
Figure 1. (a) Implementation of Eq. 1 (p = 2; n = 5) (b) Implementation of Eq. 2 ( 0 (4; 3; 2; 1; 0) = (3; 1; 2; 0; 4)). (c) Implementation of Eq. 2 on an iterated de Bruijn graph.
i; j = 0; 1, y0 2 Zn2 ?p?2 it is easy to verify that we may use a
4 Data distributions from the applications point of view
computation/communicationoverlapping technique. I.e. we first compute vxnj?0 p which may be communicated while we
proceed with the computation of vxnj?1 p . In Figure 1 different shades of circles illustrate the overlap of computation and communication. This technique may be applied for all stages where data exchange is necessary. Using this overlap technique will hide communication costs as long as the time to compute vxkj1 , k n ? p, is greater than the communication time for vxkj0 . Therefore it is desirable to maximize the number of computations per communication. An easy way to achieve this is given if we choose an appropriate data distribution of the data object as indicated in Figure 1b and determine the locality of function evaluation according to Eq. 2. Finally we can choose the constant geometry approach, i.e. we simulate the butterfly data flow graph on an iterated de Bruijn graph as shown in Figure 1c. For the sake of brevity we omit the details here. For any fixed set of functions fak we can construct the above algorithms. Let 2 Γ be the bitreverse data distribution. Then with A0 = A the one dimensional decimation in time FFT algorithms is given by:
fak ((a; a0) = a + a0 exp(2 (ak?1 : : :a0 )2?k )
Ideally, the main task of application programming should consist of merely composing the application of existing library functions. The library functions are taken to be parallelized and implemented by some parallelization experts, as we demonstrated in the previous Section 3 with the example of the FFT. As far as possible, all issues of the parallel implementation should by hidden from the application programmer. In todays systems however, the choice of the data distribution is left to the programmer [7, 8, 3]. Like the choice of the appropriate data type, the programmer needs to choose the data distribution according to the requirements of his application. Thus, at the level of application programming, data distributions are to be treated identically to data types. It is our belief that the same syntactical elements should be used both for specifying data types and distributions. It is possible to generate some data type conversion and redistribution commands automatically. For each image processing function contained in the library, it is known at compile time which distribution and data types it can handle. If necessary, the appropriate conversion functions can be called automatically. Nevertheless, it should be clear that the automatic invocation of data redistribution can lead to
(3)
By using Equation 2 for our implementation we are able to handle a wide variety of data distributions. 5
significant loss in efficiency. Therefore it is good programming style to choose the distributiontype explicitly, just as it is required to choose the type of the data elements according to the application. The main advantage of allowing a well structured class of data distributions is that, in many cases, it is possible to have implementations that are equally efficient on all or a large number of distributions. Thereby conversions of the data distribution can be dramatically reduced, if not totally extinguished. Due to space limitation we refer the reader to [15] for further information on the realization of the paradigm described above.
[2]
[3]
[4]
5 Conclusion [5]
An analysis of the data distributions suited for parallel image processing and those defined by some of todays more popular parallel languages and libraries reveals that the majority of them belongs to the small and well structured class of bit permutations Γ. We demonstrate that an important class of algorithms, the so called normal algorithms, can tolerate all permutations of Γ as input data permutation. The permutation of the output data is again an element of Γ. An explicit data redistribution, that might be required by some algorithms, can efficiently be performed for the class Γ. For wider classes of permutations efficient data redistribution algorithms can be derived from sorting algorithms. We propose to widen the scope of data distributions tolerated by parallel languages and libraries towards classes (e.g. the bit permutations Γ or even wider classes) of distributions. The amount of data redistributions can be reduced significantly, thereby increasing the efficiency, by implementing library functions that can operate on all or on large subclasses of distributions. Additionally, the algorithm used for data redistribution can be optimized for the class of distributions under consideration. At the application level of programming data distribution can be seen analogously to data types. We propose to use the same syntactical elements for the declaration and casting of both data types and data distributions. With the concept of data distribution classes the application programmer, which is not necessarily a parallelization expert, is enabled to write efficient programs without much overhead. However, at the time being, the application programmer cannot be released from the task of choosing the appropriate data distribution. This can be seen as a skill analogous to choosing the numerically appropriate data types.
[6] [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
References
[16]
[1] T. Andreae, M. N¨olle, and G. Schreiber. Cartesian Products of Graphs as Spanning Subgraphs of de Bruijn Graphs. In
6
Proceedings of the 20. Workshop on Graph-Theoretic Concepts in Computer Science (WG94). Lecture Notes in Computer Science, Springer, June 1994. A. Bieniek, M. N¨olle, and G. Schreiber. A regular -node bounded degree network for sorting 2 keys with optimal speedup. In S. Miguet and A. Montanvert, editors, Proc. of the 4. International Workshop on Parallel Image Analysis (IWPIA’95), pages 253–266, LIP-ENS, Lyon, France, Dec. 1995. F. Bodin, P. Beckman, D. Gannon, S. Narayana, and S. X. Yang. Distributed pC++: Basic ideas for an object parallel language. Scientific Programming, 2(3), 1993. M. Brunzema, H. Burmeister, and D. Gerogiannis. Parallelization of an Image Analysis Application: Problems, Results and a Solution Framework. In International Conference on Pattern Recognition, pages 406–411, Jerusalem, Oct. 1994. H. Burkhardt and X. M¨uller. On Invariant Sets of a Certain Class of Fast Translation-Invariant Transforms. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP28(5):517–523, Oct. 1980. B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran. Scientific Programming, 1(1):31–50, 1992. B. Chapman, P. Mehrotra, and H. Zima. Extending HPF for advanced data-parallel applications. IEEE Parallel and Distributed Technology, 2(3):15–27, 1994. J. Choi, J. Dongarra, R. Pozo, and D. Walker. ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers. In Proc. of Fourth Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia), pages 120–127, Los Alamitos, California, 1992. IEEE Computer Society Press. N. de Bruijn. A combinatorial problem. In Proceedings of the section of Sciences, pages 758–764, Koninklije Nederlandse Akademie van Wetenschappen, June 1946. High Performance Fortran Forum. High performance fortran language specification. version 1.1. Technical report, Rice University, Houston,Texas, 1994. F. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan Kaufman Publishers, San Mateo, California, 1992. D. Nassimi and S. Sahni. A Self–Routing Benes Network and Parallel Permutation Algorithms. IEEE Trans. on Computers, C-30(5):332–340, May 1981. M. N¨olle. Konzepte zur Entwicklung paralleler Algorithmen der digitalen Bildverarbeitung. PhD thesis, Technische Universit¨at Hamburg-Harburg, Dec. 1994. Ersch. als Fortschr.Bericht, VDI-Reihe 10, Nr. 410, VDI-Verlag, 1996. M. N¨olle and G. Schreiber. Data distribution concepts for parallel image processing. Internal Report 3/96, Technische Informatik I, TU-HH, Jan. 1996. M. N¨olle, G. Schreiber, and H. Schulz-Mirbach. PIPS– a general purpose Parallel Image Processing System. In W. Kropatsch and H. Bischof, editors, 16. DAGM - Symposium “Mustererkennung”, pages 609–623, Wien, Sept. 1994. Reihe Informatik XPress, TU-Wien. C. S. Raghavendra and R. V. Boppana. On Self-Routing in Benes and Shuffle-Exchange Networks. IEEE Trans. on Computers, 40(9):1057–1064, Sept. 1991.
N
N