Feb 26, 1999 - Cite this paper as: Corsaro S., D'Amore L., Murli A. (1999) On the Parallel Implementation of the Fast Wavelet Packet Transform on MIMD ...
On the Parallel Implementation of the Fast Wavelet Packet Transform on MIMD Distributed Memory Environments Stefania Corsaro3, Luisa D’Amore2,3 , and Almerico Murli1,3 1
University of Naples “Federico II” Second University of Naples, Caserta, Italy Center for Research on Parallel Computing and Supercomputers (CPS), CNR Complesso Monte S.Angelo, via Cintia, 80126 Naples, Italy {corsaro, damore, murli}@matna2.dma.unina.it 2
3
Abstract. This work describes the design and implementation issues of the Fast Wavelet Packet Transform (FWPT) 1D and 2D on a parallel distributed memory multiprocessors. In particular, we describe two different approaches in the development of a parallel implementation of the FWPT of a matrix A. In section 2 we introduce some notations and definitions, in section 3 we describe the computational environments, and, in section 4, we discuss the parallel implementation of the bidimensional FWPT. In section 4.3 we finally show a numerical experiment.
1
Introduction
Wavelets have generated a tremendous interest in both theoretical and applied areas, especially over the past few years. In particular, the Fast Wavelet Transform (FWT) is being a very powerful computational tool in Image Processing. Most applications of wavelets deal with compression, transmission, synthesis and reconstruction of signals in one or more dimensions, moreover wavelets have been applied in the numerical solution of Partial Differential and Integral Equations[4], in the Approximation and Interpolation of data[6]. At the present time several routines and mathematical software packages that perform the computation of the FWT and the FWPT (1D anf 2D) are available, but there are only few parallel implementations, especially for the FWPT (see for example [1] and [8]). Most of real applications (for instance Image Processing in astrophysics and in medical imaging), requires either a big amount of data or needs solutions in a suitable “turnaround” time, so the only way to effectively solve them is the use of the advanced architectures resources. In particular, our interest focuses on Image Restoration problems described by Integral Equations of the first kind, which require the solution of ill-conditioned linear systems. In this case, a least square solution is computed by using the Preconditioned Conjugate Gradient (PCG). Moreover, to compute a reasonable solution, we should use a regularization technique to smooth out the noise which perturbs the data: the computational kernel of this problem is the bidimensional FWPT. P. Zinterhof, M. Vajterˇ sic, A. Uhl (Eds.): ACPC’99, LNCS 1557, pp. 357–366, 1999. c Springer-Verlag Berlin Heidelberg 1999
358
2
Stefania Corsaro, Luisa D’Amore, and Almerico Murli
Preliminaries
One of the main features of wavelets is “localization”: wavelets allow to study a signal with finite energy, that is a function f (t) belonging to L2 (R+ ), localizing both the time interval and the frequency one. This property is particularly clear looking at the wavelet functions in the context of a Multiresolution Analysis (MRA): projecting f (t) onto a space of a MRA allows to obtain information about it, depending on the resolution of the space. The mapping that leads from the m-th level resolution to the (m − 1)-th level retaining the information that is lost in this process is the FWT and, more generally, the FWPT[2,9]. Given two sequences (hk )k∈Z and (gk )k∈Z , the filters of the wavelets, and the vector cm = (cm n )n∈Z of the coefficients of the projection of f (t) onto the m-th resolution subspace of the MRA, let us define the FWT operator. Definition 1. The FWT operator, W, is defined as follows: W : cm ∈ l2 (Z) −→ (cm−1 , dm−1 ) ∈ l2 (Z) × l2 (Z) where Z is the set of integers, l2 (Z) = {(ck )k∈Z
ck ∈ C,
P
P = Pk∈Z hk−2n cm cm−1 n k dm−1 = k∈Z gk−2n cm n k
k
|ck |2 < ∞}, and
(1) t u
where h and g are the low-pass and the high-pass filters respectively. ˜ i,j = hj−2i ) is the low-pass operator and H = (˜ gi,j = In matrix form, if L = (h gj−2i ) is the high-pass operator, relations (1) can be written as:
cm−1 dm−1
=
L H
· cm ⇐⇒
cm−1 = Lcm dm−1 = Hcm
The transformation of the vector cm using the filters hk retains the information about the low frequencies, while the filters gk “detect” the high frequencies: so the vector dm−1 contains the details, that is, the information that is lost passing from the resolution m to the resolution m − 1. From a computational point of view, it is worth emphasizing that, if l is the length of the two sequences hk and gk , then the number of floating-point operations required for the computation of the FWT of a vector of length N is O(lN )[9]. Figure 2 shows that the vector dm−1 is not transformed after the first step: if we use Wavelet Packets we have a further subdivision of the details and, from a computational point of view, more efficient parallel algorithms.
On the Parallel Implementation of the Fast Wavelet Packet Transform
359
Definition 2. The FWPT operator, WP, is defined as follows: 2 s , ..., cm−s W P : cm ∈ l2 (Z) −→ (cm−s 0 2s −1 ) ∈ l (Z)
where s is the number of transform steps and (P )n (cm−j l
=
hk−2n (cm−j+1 )k l/2 m−j+1 )k k∈Z gk−2n (c l−1
Pk∈Z
if if
l l
is even is odd
(2)
2
with j = 1, s and l = 0, 2j − 1.
t u
In matrix form, if we define H0 := L, H1 := H and:
Fi :=
s Y
H j
i = 0, 2s − 1
j=1
where j is the j − th binary digit of i, then relations (3) can be written, for j = s, as: cm−s F0 0 m−s F1 c1 = .. .. . .
cm−s 2s −1
m ·c
F2s −1
We define the FWPT of a matrix A as well. Definition 3. If
Qs :=
s−1 Y i=0
diag2i
L H
where diag2i is the block diagonal matrix with 2i diagonal blocks, then the FWPT t u in s steps of a matrix A is defined as As = Qs AQTs In particular, we will consider compressible matrices[2], since they allow to obtain sparse representations of the operators. Definition 4. A square matrix A = (ai,j ) of dimension 2m is said to be compressible if two constants M and C exist such that ai,j = 0 if i = j, and: |ai,j | ≤
2m C |i − j|
M M X X 2m C M−k M M−k M ai+k,j + (−1) ai,j+k ≤ (−1) |i − j|M+1 k k k=0
otherwise.
k=0
t u
360
Stefania Corsaro, Luisa D’Amore, and Almerico Murli
In figure 1 an example of compressible matrix is shown.
1
0.5
0
-0.5
-1 80 80
60 60
40
40
20
20 0
Fig. 1. A(i, j) =
1 i−j
0
if i 6= j, A(i, j) = 0 otherwise.
c HHHH HHjd c H L HHHH j d H .c . m
L
m−1
m−2
.
m−1
m−2
Fig. 2. representation of the “cascade structure” of the FWT operator
3
The Computational Environment
The parallel implementation is based on the Single Program Multiple Data programming model, that is each processor executes the same algorithm on different data. Let nprocs be the number of processors and Pr , Pc two integers such that nprocs ≤ Pr · Pc ; we map the processors onto a logical bidimensional mesh with Pr rows and Pc columns and, if 0 ≤ i < nprocs, we denote by Pi the processor whose id number is equal to i. More precisely, this map can be defined as follows:
On the Parallel Implementation of the Fast Wavelet Packet Transform
c HHH HHjH c c HH H HH H L L H HHjH H j H .c ..c c c . . .. .
361
m n
L
m−1 2n
.
m−2 4n
m−1 2n+1
m−2 4n+1
.
.
m−2 4n+2
m−2 4n+3
.
Fig. 3. representation of the cascade structure of the FWPT operator
F : i ∈ {0, 1, , ..., nprocs − 1} −→ (ri , ci ) ∈ {0, 1, ..., Pr − 1} × {0, 1, ..., Pc − 1} where ri = nprocs/i and ci = mod(nprocs, i). If Pr = 1 or Pc = 1, we have a ring interconnection among the processors. One of the main difficulties about the development of parallel implementations concerning applications of FWPT is the choice of an appropriate data distribution strategy that ensures that the meaningful elements of the compressed matrices, mainly located in a square block whose dimension depends on the number s of FWPT steps, are distributed over all the processors, in order to guarantee a good work load balancing. Here we refer to a theoretical model of parallel computer such as the one used in the Scalapack Library[5], the highperformance library designed to solve linear problems on distributed memory multiprocessors. According to Scalapack conventions, we use a block cyclic distribution of the compressed matrix AsB , so that each processor has a “dense” block of AsB . The block cyclic data distribution is parameterized by the four numbers Pr , Pc , r, c, where Pr × Pc is the process template and r × c is the block size. The generical element of global indexes (m, n) of a matrix A is stored in the position (i, j) of the block (b, d) in the processor (p, q), where jnk j m k modPr , modPc r c j b m c k j b n c k r , c (b, d) = Pr Pc (i, j) = (mmodr, nmodc)
(p, q) =
(3)
In the following section we describe the main idea of the parallel implementation of the FWPT of a matrix A; two different strategies of parallelization are discussed, the former based on a domain decomposition strategy, the latter on the parallelization of the sequence of floating-point operations.
362
Stefania Corsaro, Luisa D’Amore, and Almerico Murli
Fig. 4. On the left, a 9 × 9 matrix partitioned in 2 × 2 blocks; on the rigth, the matrix is mapped onto a 2 × 3 process grid
4
The FWPT Parallel Implementation
We describe two different strategies of parallelization of step 1 and 2 of the algorithm described in figure 5. 4.1
First Strategy
Let A be a compressible square matrix of dimension N = 2m and nprocs the number of processors. The matrix A is distributed in column-block fashion, that is, if M B = (N + nprocs − 1)/nprocs and lp = (N + M B − 1)/M B − 1, then the processors whose identification number belongs to the set {0, 1, ..., lp − 1} have M B columns; the processor whose identification number is lp has M B columns if num = mod(N, M B) = 0, it has num columns otherwise. If 0 ≤ i < lp, let us denote by Ai = A(:, i · M B : i · M B + (M B − 1)) the block formed by the columns i · M B, ..., i · M B + (M B − 1) or, if i = lp, Ai = A(:, i · M B : i · M B + (M B − 1)) if num = 0 6 0 Ai = A(:, i · M B : i · M B + (num − 1)) if num = is the block formed by the columns i · M B, ..., i · M B + (M B − 1) if num = 0, or the one formed by the columns i · M B, ..., i · M B + (num − 1) if num 6= 0; if Pi is the processor whose id number is equal to i, then Pi has the block Ai . This distribution can be derived from (3) by setting Pr = nprocs, Pc = 1, r = c = dN/nprocse. Let us look at figure 6: steps 1. and 3. do not require communication among the processors, while step 2. does.
On the Parallel Implementation of the Fast Wavelet Packet Transform
363
Each processor exchanges with each other a block of dimension k = N 2 /nprocs2 , so the total amount of data exchanged by each processor over (nprocs − 1) steps of communication is O(k · (nprocs − 1)).
4.2
Second Strategy
Let A be a compressible square matrix of dimension N = 2m and nprocs = 2d the number of processors. The matrix A is distributed in row-block fashion, that is, if n = 2m−d , then Ai = A(n · i : n · (i + 1) − 1, :) is distributed to the processor with id number i, 0 ≤ i < nprocs. In figure 7 the second strategy is represented: only the first step, that is the computation of the FWPT of the columns of A, requires communication among the processors, since the columns are distributed and, therefore, each processor has a part of the vectors to be transformed. If s is the number of FWPT steps, then for each 0 ≤ k < s we compute 2k FWT of sequences of length 2m−k : since these sequences could be distributed, we divide the processors in groups so that each sequence corresponds to a group, and only processors belonging to the same group must communicate. At each step k, where 0 ≤ k ≤ min{s − 1, d − 1}, each processor exchanges four blocks, two of dimension nN and two of dimension (2M − 2)N , over four steps of communication, where M is the number of vanishing moments of the wavelets, and it is always M