Implementation of an Automatic Semi-Fluid Motion Analysis Algorithm on a Massively Parallel Computer K. Palaniappan, Mohammad Faisal, Chandra Kambhamettu, A. Frederick Haslert Universities Space Research Association tLaboratory for Atmospheres, Code 9 12 NASMGoddard Space Flight Center Greenbelt, MD 20771 E-mail:
[email protected], faisal @&loria,chandra@gloria, hasler@agnes
Abstract
plex with almost infinite degrees of freedom. Developing models for tracking non-rigid objects and groups of particles exhibiting deformable motion is a challenging problem due to the large number of degrees of freedom and the lack of comprehensive general non-rigid motion models. An analysis of general continuous non-rigid motion was presented in [8] and subsequently extended to semi-fluid motion in [12]. The semi-fluid motion model does not impose a global rigidity and continuity constraint on the surface as assumed by rigid motion models and usual optical flow methods. Estimation and segmentation of optical flow fields for multiple moving objects under the rigid motion assumption have been well studied [7] and a parallel implementation, on the MasPar MP-2, of the Horn and Schunk algorithm for estimating optical flow is described in [2]. Semi-fluid motion can be viewed as a combination of partially continuous non-rigid and partially fluid types of motion. The SMA algorithm allows for fluid motion of smaller surface patches with some global constraints similar to the continuous motion model. Such motion is exhibited frequently in nature such as multilayer and convective clouds, ocean eddies and currents that maintain identifiable features in multispectral imagery, fission and fusion in biological microorganisms. The semi-fluid motion model proposed in [ 121 was successfully applied to estimating non-rigid cloud motion vectors for determining cloud height and winds. Using timesequenced stereo images of clouds from two satellites in geosynchronous orbit, the semi-fluid motion model is used to estimate motion parameters for cloud tracking. Accurate measurement of cloud-top height distributions and winds are important for meteorological weather forecasting, analysis, modeling and assimilation [6] and climate change studies. The semi-fluid motion model can robustly track nonrigid cloud deformations and is also well-suited for tracking multi-layered clouds since tracers in each layer are modeled as separate small surface patches with independent first order deformations.
The implementation of a parallel algorithmfor estimating non-rigid motion vectors using a semi-$uid motion model applied to time-varying satellite imagery is described. Deformable motion tracking of non-rigid biological objects and remotely sensed objects such as clouds, atmospheric aerosols and gases, polar sea ice, or ocean currents are important application domains for the Semi-fluid Motion Analysis (SMA) algorithm. Thefocus of this paper is on the parallelization of the SMA algorithm for the MasPar MP-2 architecture. Implementation issues that were evaluated in order to make it feasible to explore dense semi-jluid motion estimates of rapid-scan time-varying geostationary satellite imagery of clouds and weather pattems are described. Cloud motion vectors from the SMA algorithm can be used to estimate the wind$eld that would be useful in a variety of meteorological applications. Comparisons between the parallel and sequential implementations of the SMA algorithm, and with manual results are briejly discussed.
1. Introduction Automatic tracking of motion has a wide variety of applications ranging from the characterization of biological cells to global weather analysis. In previous research on motion analysis objects in a scene are typically assumed to be undergoing rigid motions which can be succinctly characterized by a translation vector and rotation matrix. However, in natural scenes including cloud tracking non-rigid deformable motion predominates. Various continuous non-rigid motion types such as articulated, quasi-rigid, isometric, homothetic, conformal and elastic motion are described in [9]. The complexity of motion models ranges from rigid motion to full fluid motion, with rigid motion being the simplest to characterize and turbulent fluid motion being the most com-
0 1996 IEEE Proceedings of IPPS '96
1063-7133/96 $5.00
864
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
2. Semi-Fluid Motion Model Acquiring synchronized multiview multispectral remote sensing data from geostationary orbit is an expensive and difficult process. The next generation of improveld Geostationary Operational Environmental Satellites (GOES) recently became available to monitor global weather patterns. In the case of stereoscopic satellite imagery we refer to the two satellite views as the right and left timevarying image sequences, with a non-uniform time interval between two consecutive stereo pairs. Consider left and right stereo image pairs, &(z, y, t ) and I,.(x, y, t ) ,captured at T different timesteps, t E { t o , t l , ..., tT-l}, where I * ( t ) = {1*(Z,y,t),z=0,1 ,..” ] N - l ; y = 0 , 1 ) ..., M 1) are a sequence of multiview image arrays each of size M x N pixels. Each stereo image pair ( I t ( t ) ,I,.(t)) is processed to estimate the corresponding surface map z ( t ) = { ~ ( ~ , y , tz) ,= 0, ..., N - 1 ; y = 0,1, ..., M - I} ofcloudtop heights at time t . The intensity and estimated surface maps R = { ( I e ( t ) , z ( t ) ) ,= t to,tl, . . , t 7 - 1 } are currently used as input to the semi-fluid motion tracking algorithm when a stereo image sequence is available. Semi-fluidmotion tracking can also be applied to amonocular or single satellite time sequence by treating the intensity data as a digital surface. In the latter case the interpretation of surface properties will differ between an intensity surface and depth or z-surface, but the same algorithm can be applied to R = { ( l ( t ) , t= t o , t l , . . , t T - l } to produce non-rigid motion correspondences between intensity pixels rather than surface pixels.
2.1. Stereo Analysis The problem of estimating the disparity map is that of determining pixel correspondences between the two stereo images & ( t )and &(t). The estimated disparity or depth maps can be transformedinto surface maps z ( t ) of cloud-top heights for time instant t using satellite and sensor geometry information. Many techniques have been developed in the past to estimate disparity maps using remotely sensed stereo data for a variety of applications including cloud height measurement, digital terrain models, urban mapping and target identification. We have used an existing correlation-based .Automatic Stereo Analysis (ASA) algorithm that has been parallelized for the Maspar MP-2 [12]. The ASA attempts to model aspects of the human visual system, particular1:y the multiresolution, hierarchical and coarse-to-finebased searching for identifying stereo correspondences. In the multiresolution approach the ASA uses the coarse disparity estimates to warp or transform one view into the other thereby successively estimating smaller disparities at finer resolutions of the hierarchy. The disparity estimates at the coarse level
will typically provide more reliable correspondence information but will be lacking detailed surface structures. The disparity estimates at finer levels are more noisy but will be more accurate using the coarse-to-fine approach. In our implementation, tlhe neighboring region of a pixel of interest is chosen as a square set of pixels centered on that pixel and defined as the stereo-analysistemplate. The size of this template determines the starting resolution level of stereo analysis and image matching is done at several different resolutions, typically four levels to produces the final dense disparity or depth maps.
2.2. Motion Analysis A novel approach to estimate non-rigid motion pixel correspondences using time-varying stereoscopic image pairs, I,(tm) andI,(t,+l) f o r ( m E {O,l ,..., ~ - 2 } ) ispre, sented by Palaniappan, Kambhamettu, Hasler and Goldgof [12]. Oncethesurfacemaps, z(z, y,t,)andz(e‘, y’,t,+l), are estimated from each of (he stereo pairs, I,(t,) and I e ( t m + l ) ,only one sequence of intensity images are used for computing the semi-fluid mapping. A more complex algorithm coupling both stereo images at both time steps is described in [lo]. Typically, &(tm)and &(t,+~)at time 2, and t,+l respectively are used as the intensity images since during stereo analysis the right images are rectified and warped to align them with the left images such that epipolar lines become parallel to scan lines. Referencing the pixel or surface element w.ith Cartesian coordinates (x,y) within image Ie(t) as &(x, y, t ) ,and within surface ~ ( tas) z(x,y,t ) , we are interested iin estimating the after motion uslocation z ( d , y’, t,+l) of surface element z(x,y,t,) ing intensity information le(z’,y’,t,+l) and b(x, y, t m )if required. Surface element z(x’,y’, ’!,+I) is hypothesized to lie within a small bounded neighborhood of the surface element z(x,y,t m ) before motion, where the surface correspondence neighborhood size: is dependent upon the magnitude of the interval tm+l -- t , and the velocity of particles in the scene. For most applications, time-sequential stereo images and consequently surface maps are acquired at more or less regular intervals so a fixed hypothesis neighborhood dependent upon the maximum particle velocity for all surface pairs { ( z (tm) , .z( t ,,&+I )) , m = 0, 1, ..., T - 2 ) can be used. To reduce computational complexity, the motion correspondence search neighborhood is chosen as a square set of (2N,, + 1) x (2N,, + 1) pixels centered around z ( z , y, t,+l). Thesearch orhypothesisneighborhoodin the surface map is given as, q,,(x, y,t,+l) = {(zs,y,), x N,, 5 2, 5 x -tN,,,Y - Nzs I Y, 5 Y NzsI. For each surface element or pixel in the motion correspondence search area qss, the following two steps described next are performed. Although the current implementation uses
+
865
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
square template and search areas, rectangular areas can also be used and may lead to improved motion correspondence results.
(g)2
g)2
Step 1: Select template mapping
+
after a non-rigid motion respectively, under the transfory, t,). In (4)and (5), E = and mation Fcont(z, G= ( are coefficients of the first fundamental form, and { a , , 6, , a,, b, , ak , b k } are the unknown motion parameters of a small patch undergoinga local affine transformation that varies on apixel by pixel basis [8], [12] and is described
+
A square set of ( ~ N , T 1) x (2NzT I ) surface pixels centered around z(x,y,t,) is selected as the z-template neighborhood,q,T(z,y,t,) = { ( a : s , ~ S ) , z - N tI~ a:, I x + N z ~y ,- N,T 5 ys 5 y + N , T } . The candidate template mapping for a small surface patch around a given surface pixel under the assumption of small local continuous deformation is defined as the set,
a, 2’ = z
Y’ = Y z’ = z
+ (a,z + bty + + (a,a: + b J y + Y O ) 20)
(6)
+ (aka: + b k y + zo)
The local transformation models the non-rigid neighborhood , 20) being relationship before and after motion with ( qyo, ) } the rigid translation component of the motion. Each z ( t m ) and z(t,+l) pixel within the neighborhoods r / , ~ ( zy ,, tm> and q Z ~ ( Zg,1 t,+l) respectively, is fitted with a continuwhere, ous quadratic surface patch centered at that pixel. Least Fcont(x,y,tm)= ( z , f j , t m + l j , (2) squares surface fitting using a surface-patch neighborhood which maps pixels at t , from q , ~ ( xy, , t,) to the square of (2N, + 1) x (2N, 1)pixels centered around the pixel of neighborhoodcentered around the hypothesis (5,fj) at tm+l interest leads to solving a 6 x 6 matrix using the Gaussianwith r j z ~ ( ZG, , ~ , + I ) = {(xs,ys),5- N,T I a:, 5 elimination method. These quadratic surface patches are N z ~fj ,- N z 6~ y s 5 y” N,T} and for a template disthen used to compute the unit normals in the surface maps varies for placement of ( E o , go). The mapping Fcont(o) at each pixel within q , ~ ( xy, , 2,) and q z ~ ( i 8, ? , t,+l). each surface pixel since overlapping templates are used for The error expression ( 3 ) can be minimized by differenevaluating motion correspondence parameters. tiating with respect to the six unknown motion parameters and setting the six first partial derivatives to zero [8]. This Step 2: Compute motion parameters to minimize error leads to a another system oflinear equations that were solved Under the assumption of small continuous local surface using Gaussian-elimination to obtain a solution for the six deformations the error associated with the motion corremotion parameters { a , , b, , a,, b, , a k , b k } . spondence hypothesis that surface element z(z, y , t m )corOnceStep 1 andStep2areperformedthen the€(z, y ; 2 , fj) responds to surface element z ( 5 , 0,t,+l) after motion, can for all candidate hypothesis in the search area, qzs( z, y , t,+ 1 1, be evaluated by measuring the difference between the obcan be computed. The estimated non-rigid correspondences served and expected behavior of the surface normals, as between surface elements z(a:,y l 2,) and z ( d ,y’,t,+~) follows, under a small continuous deformation, are given by,
+
+
+
€(z,y;%,fj) =
(3)
&:+E; (zt,yx)€qzT(z,y>tm)
where
2.3. Semi-Fluid Motion
bi dz a. dz 61 = -(ni - n k - nj)+ L ( n j - n k - nij G dY E dX aink ~
dZ
a
bjnk d z x
~a~ +-+-E aknk
bknk
G
ni
+ ni - nj + ni
+
(4)
and E2
a.n. dz b . n . d z +L.L-+---a.n. dz bjnj d z = E-+ 22G dy G dy E ax E dx
The semi-fluid motion paradigm relaxes the local continuity constraint for a small surface patch. A new template mapping Fsemi(0) based on the intensity image is developed, instead of the Fcont(o)transformation in (2). For each pixel (z, y j E V , T ( Z , y , t , j, a square set of ( 2 N s , 1) x (2Nss+ 1) semi-fluid search neighborhood pixels are selected as ~ s s ( f a f, j a , t m + l ) = {(xs,Y , ) , Ea - NSSI x, 20 5 z a + Nss,0 a - Nss I + Y O 5 g a N s s } and centered around &(za, 0 a , t,+~) = I t ( 5 20,7i YO,t,+~). The semi-fluid template mapping is given by the set of semifluid correspondences for each pixel within the template,
niak
E
YS
The vectors [ni,nj , n k ] and [n:,n:, ni]represent the orthogonal components of the unit normal at the surface element z(z, y, t,) andthecorrespondingelementz(x’,y’, t,+l)
866
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
+
+
+
+
Fsemi(X,Yl t m )
=
arg min
fsemi(z,y;fs,gs).
1
I
Neighborhood T y m a b l e Window Size in Pixels (9) I Surface-fittingT= 2 I 5x5 Note that -Tsemi(0)specifies a continuous mapping for a z-Search area 11 Nzs= 6 13 x 13 smallerneighborhoodof size ( ~ N , T 1) x ( ~ N , T+ 1)just z -Temdate II N,r = 60 I 121 x 121 as Fcont(0) specifies a continuous mapping for the entire ~ 1). When template area of size ( 2 N z + ~ 1) x ( 2 N Z+ Semi-fluid template 5x5 N,, = 0 then Fsemi(0) reduces to the mapping Fcont(0). Using the discriminant of the intensity (rather than the depth Kable 1. Neighborhood sizes used for processing or z - ) surface before and after motion, which measures area Hurricane Frederic stereo time sequence datasets of changes of a small intensity surface patch, size (A4x N =I 512 x 512). (58 , ! 7 8 ) € 7 7 8 8 (5a, B . , L + l )
I
+
for double precision and 68 billion integer instructions per second (BIPS). A short description of the computational burden and amount of data that needs to be handled in computing a semi-fluid flow field is given, in order to appreciate the need for parallelization. Table 1 shows a typical set of neighborhood sizes that was used for processing a short 8 = I/( 02) D(D'-D) (11) hurricane stereo sequence as described in $5.1. The images were A4 x N = 512 x 512 pixels and in the parallel (Zt,Yz)EVsT ("=,Y.)ETsT implementation a dense motiton field for 262144 pixels is This semi-fluid template mapping specifies motion correestimated for each image pair, It&) to &(tm+1). In orspondences for pixels from the neighborhood q z (z~, y,t m ) der to find a non-rigid semi-fluidmotion correspondencefor to afragmented non-continuous neighborhoodq,T(f , 5, t,+l) each pixel, 13 x 13 = 169 Gaussian-eliminations are perthat is more general than the continuous mapping defined in formed to solve for the motion parameters in (3) then 169 (2). The (2NsT 1) x (2NsT 1) pixel semi-fluidtemplate error terms are evaluated using (3) to search for the minneighborhoodin(lO)and(ll)isgivenbyqsT(fa, ! j a 1 t m + l=) imum. To compute each error term, 121 x 121 = 14641 {(zz,Yz), E a NsT 5 2, 5 f a N s T , ? a - N s T 5 Yb 5 error terms of (4) and ( 5 ) are: computed which requires a N , T } . The changes in the intensity surface discrimisemi-fluid template mapping (9) to be established for each nant from D to D' at pixel location ( x a ,ya) are computed of the 14641 pixels within tlhe template. Estimating the after fitting local surface patches as described in Step 2 of semi-fluid template mapping for each pixel requires evalu$2.2, but using the intensity image. In determining a new ating 3 x 3 = 9 error terms to obtain (9). In order to obtain semi-fluid template mapping, the components of the surface each of these semi-fluiderror terms, 5 x 5 = 25 parameters element normal after motion [.I, n:, 41 in (4) anid ( 5 )is asof (1 1) need to be computed fior each pixel within the semisociated with the surface element z ( d , y', t,+l) under the fluid surface-patch neighborhood. Local surface patches are y,t m ) . transformation F s e m c ( cy,, t m )instead of Fcont(,c, fit for each pixel in both the intensity and surface images at , z ( t m ) ,and Z ( t m + l ) , by both time steps, & ( t m )h(t,,+l), solving a system of six linear equations, so over one million 3. Massively Parallel SIMD Computer (4 x 512 x 512 = 1048576)separate Gaussian-eliminations are needed to estimate all of the local surface patch paramThe estimation of dense semi-fluid motion fields is cureters. The complexity of the SMA algorithm along with the rently impractical on sequential computers, inc'luding the high performance Silicon Graphics Inc. Onyx 2/VTX R8000/90 eventual aim of handling large volumes (several gigabytes per hour) o f realitime meteorological imagery motivated a with a peak performance of 360 megaflops (Mflops). The parallel implementation on the MasPar MP-2 which is artime to estimate one semi-fluid mapping grows exponenchitecturally identical to the Digital Equipment Corp. DEC tially with the size of the template. The SIMD model has mpp 12000 Sx/Model200. been shown to be a highly cost effective architecture for scientific computing in a variety of Earth and Space Science 3.1. MasPar MP-2 Architecture applications including image processing [ 5 ] . The MasPar MP-2 can achieve a sustained performance that is 60% of The MasPar MP-2 at NASA Goddard is a Single Inthe advertised peak performance of 6.3 gigaflops (GFlops) struction, MultipleData (SIMD) massively parallel machine for single precision floating point operations [5],2.4 GFlops
+
+
+
+
867
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
1
maximally configured with 16384 processors arranged in a rectangular 8-way nearest neighbor mesh of size nyproc x nzproc = 128 x 128 operating under the control of an Array Control Unit. In SIMD or data parallel systems a single program instruction can execute simultaneously on all of the Processor Elements (PES). Each PE on the MP-2 is a custom 32-bit RISC processor with a separate floating point unit, an 80 ns clock cycle (12.5 MHz), 40 user accessible (52 total) 32-bit registers (variables of type double and long long use two registers), and up to 256 kilobytes (KB) of PE (parallel data) memory (RAM) [I], [l I]. The Goddard MP-2 is configured with 64 KJ3 per PE for an aggregate total of one gigabyte (GB) of parallel data storage [SI. The hierarchical hardware organization of the processors starting with two 4 x 4 PE clusters per chip, 32 chips per PE board and 16 boards for a total of 16K processors has important implications for data mapping and interprocessor communication of pixel data. The PE memory to register load and store operations depend upon the size of the operator and have an average bandwidth of 22.4 GB/s for direct plural (parallel) data and 10.6 GB/s for indirect (pointer) plural data. The 2-D array of PES are interconnected in an 8-way nearest neighbor X-net mesh as shown in Fig. 1. Direct communication using X-nets has an aggregate bandwidth of 23.0 GB/s using register to register transfers which offers balanced access from CPU to neighboring CPU comparable to that between CPU and memory. PES are not only mesh connected but can also communicate with each other through a multistage circuit-switched interconnection network known as the Global Router. The Goddard MasPar MP-2 has a three stage global crossbar router network. The router can sustain data transfers between distant processors, in terms of mesh connections, at 1.3 GB/s based on four quadrants of 256 MB/s memory to U 0 RAM channels controlled by the routers [3]. So the X-net bandwidth is 18 times higher than router communication. Exploiting the X-net bandwidth was important to the successful implementation of the SMA algorithm. Due to the potential limitation of having only 64 KF3 of RAM per PE disk U 0 may be necessary. The Goddard MP-2 has two RAID-3 8-way striped MasPar Parallel Disk Arrays that deliver a sustained performance of over 30 MB/s across a 200 MB/s MPIOC channel [3]. The high throughput of MPDA was exploited in running the SMA algorithm on a dense sequence of 490 frames of GOES-9 data.
.. .
.
Figure 1. Two-dimensional PE array-based indexing of processors, (izproc, iyproc), for nyproc x nxproc = 128 x 128, with %way interconnection X-net mesh (toroidal connections not shown). relationships are important, onto the PE array for efficient exchange of the PE pixel data. The data folding can significantly affect the total run-time of the algorithm by introducing inter-processor communication delays. A 2-D hierarchical mapping of plural data onto PE array [4] instead of a cut-and-stack data mapping was chosen to minimize latency and inter-processor communication since neighboring pixels are stored on neighboringprocessors. The SMA algorithm primarily needs access to pixels within a small (square) neighborhood, centered around the (receiving) pixel of interest. A 2-D hierarchical data mapping reduces the total number of mesh transfers needed to fetch all pixels within a local neighborhood. The set of available processors can be viewed or addressed as either a one- or two-dimensionally arranged array. For image analysis algorithms the two-dimensional arrangement is more natural. Each processor in the rectangular array of processors can be indexed using Cartesian coordinates (izproc, iyproc), which are predefined plural variables in MPL, as shown in Fig. 1. In order to store an M x N image the number of pixels stored per PE in a 2-D hierarchical format is, yvr x zvr =
'=
M N I------1 nyproc' '
For example, to map a 512 x 512 image onto a 128 x 128 PE array would require storing 16 pixels per PE. A schematic representation of the 2-D hierarchical data mapping is shown in Fig. 2 with data layers being contiguous memory locations within a PE. The mapping of data pixels to PES within the PE array and to memory locations within a PE are given by,
3.2. Mapping Data onto the PE Array A typical image with dimensions M x N = 512 x 512 pixels, cannot be stored on the MasPar MP-2 128 x 128 processor grid without storing several pixels per PE. This motivates deriving an efficient data mapping or folding of the regular 2-D image array, where spatial neighborhood
iyproc = y d i v y v r , ixproc = x d i v x v r , mem = (z mod zvr) z v r ( y mod yvr),
+
(12)
where (x, y) is the Cartesian coordinate of the data pixel stored, (izproc, iyproc) is the Cartesian coordinate of the
868
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
D6 L2
I
,
DI ! L3
I
that the template mapping is also a function of (s,, y,). Since (zs, ys) E ~,,(z, y , t,.+l) spans a neighborhood of (2Nzs 1)x(2NZ, l)pixels, wepre-compute(2N2, + 1)x (2N,, 1) template mappings for each pixel &(x, y , in) within I e ( t m ) . A template mapping is computed for to each pixel ( x , , y,) in the (2N,, 1) x (2N,, 1) neighborhood centered around ( x , y ) and stored in the processor housing (2:y) for which the template mappings are computed. Additional, optimization is, achieved by computing the (2NZs 1)2template mappings for each pixel by first computing the error term in (10) for all pixels in a (2N,, 2N,, 1) x ( 2 N z , 2N,, f 1) neighborhood centered around the pixel being tracked, and then applying a (2Nss+ 1) x (2N,, 1) window centered on each pixel within the 1) x (2N,, 1) neighborhood and performing the (2N,, minimization given in (9). Each template mapping (9) is dependent on the search neighborhood, (2N,, 1) x (2N,, 1) and the n/k component of the nonmal after motion as described in $2.2. In order to reduce the amount of redundant computation, each template mapping could be represented by storing the three normal components [n;,n;, TIL] for each possible motion correspondence within the search area. But the minimization of (3) can be shown to be a function of only (n: + ni) and nk. However, the 64 KIB of PE memory is a severe constraint at this point in the implementation of the SMA algorithm. For example, even istoring just two floatingpointing numbers for each precomputed template mapping for a relatively small search area of 23 x 23 and with 16 pixel elements stored per PE would still require 67.7 KB per PE which exceeds the available 1.0 GB of data memory. So the total space required to store the precomputed template mappings will need to be segmented or chunked. The template mapping data cannot be segmented, since each segment would correspond to multiple layers within a PE of data pixels being tracked, since the data is summed in (3) over a neighborhood requiring all layers to have available their corresponding template mappings. Instead the key observation is that the template mapping data can be segmented by hypothesis or search area. The data chunks or segments are in multiples of irows of the search or hypothesis neighborhood with each1 row containing ( 2 N z , + 1) template mappings. Each segment can be independently computed and processed, by computing and storing the corresponding (2Nz, 1) error 'terms in (3). The segment can then be discarded and next chunk computed to generate the remaining error terms. Once all the segments are processed, the equivalent minimization of (7) is complete.
+ +
I
+
+ +
I
D = data element or pixel L =data layer - Processor Element (PE) boundaries _--- Memory layer boundaries on the same PE
Figure 2. Two-dimensional hierarchical data mapping for n y p r o c = n x p r o c = 2 and M x N = 4 x 4.
+
PE and mem is the memory location within that PE where the pixel data is stored. The 2-D hierarchical mapping, is a one-to-one mapping with the inverse being given by, z = (izproc x zvv)
y = (iyproc x y v r )
+ (mem mod x w ) + (mem div x v r ) .
+
Although satellite imagery used for cloud tracking will in general be rectangular 2-D arrays we assume without any loss of generalitythat the image arrays are square ( M = N ) .
4. Parallel Implementation Issues A sequential (un-optimized) version of the semi-fluid motion tracking algorithm was used to form a baseline for comparing the correctness of the parallel algorithm results, for testing and for selecting neighborhoodparameters to use in the parallel version. The parallel implementation was designed to track all pixels in the memth memory layer in parallel and then repeat the process for each layer. Two areas of optimization that were identified and explored as part of the design issues concerning the parallel implementation include reducing computationalcomplexity and interprocessor communication delays.
4.1. Optimizing Computation
+
+
+
+
+
(13)
+
+
+
The template mapping in (9) is computed for all pixels in the template neighborhood & ( t m ) and is computationally very expensive as apparent from (10) and (1 1). Since we track all pixels l e ( z ,y , t m ) within I e ( t m ) , the corresponding template neighborhoods overlap each other. To avoid recomputing the template mapping (9) for overlapping pixels within the template neighborhood, it is more efficient to pre-compute the template mapping for d l pixels in &(&), but from (lo), (11) and (2) it is obvious
4.2. Reducing Inter-Processor Communication With typical neighborhood sizes as given in Table 1, the most inter-processor communication delays are introduced
869
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
4.3. PE Memory Requirements One of the bottlenecks while designing the parallel implementation was the memory constraint of 64 KB per PE for the MasPar. A great deal of effort was devoted to handle this constraint as efficiently as possible leading to the template mapping data segmentation scheme. For the purpose of this implementation we have chosen the same size for the fluid-template and surface-patch neighborhood i.e. N, = N,T. Defining each segment as 2 rows of the (2N,, 1) x (2N,, 1) pixel hypothesis neighborhood, our implementation requires a total of
+
+
[15 Z(2hr,, max{8(2N,
Figure 3.Snake-like read-out of data: shading shows data array to be read, dashed lines are pixel boundaries, and arrow shows the data read-out path.
+ l)] x + 1)'+
x[Z(2Nzs
ZWT
x ywr x 4
+
+ 1 + 2N,,) + 442(21\3,, + I)}
4 x xvr x ywr x (2N,,
+ 1) + 2N,,],
288
bytes of parallel data memory per processor.
during the accumulation of two matrices, just before using Gaussian-elimination to minimize (3). Since geometric parameters are only fetched from a neighborhoodof PES, using the mesh connections to transfer data will be faster than using the router. Two different schemes of accumulating geometric parameters from neighboring PES were explored and are described below.
5. Results GOES-8 and GOES-9 are the first of a new generation of weather satellites that were successfully launched and became operational in 1995 and 1996. Experiments were performed using GOES stereoscopic time sequences of Hurricane Frederic, Hurricane Luis, and a mid-afternoon thunderstorm over southern Florida. For Hurricane Luis and the Florida thunderstorm sequence, temporally dense rapidscan data with an approximate 1.5 minute time interval from only one satellite (GOES-9) was used. For these latter two sequences the intensity images were treated as digital surfaces in applying the non-rigid motion analysis algorithm since stereo imagery was not used. For Hurricane Luis the model Fcont(e)was used with a z-template of 11 x 11, and z-search of 9 x 9 to process a dense sequence of 490 frames. The MP-2 parallel SMA algorithm took approximately 6.0 min per pair of images resulting in a speed-up of over 150 when compared to the sequential version.
Ordered Memory-Queued Mesh Transfer Using Snake Read-Out Since the data is stored using the same two-dimensional hierarchical mapping as the images, each data to be fetched from the neighborhood can be viewed as an image of the same dimension M x M . Let's assume that each pixel in image I ( t ) needs to fetch (2N,* 1) x ( ~ N , T 1) neighborhood pixels of surface or disparity data array z ( t ) centered on that pixel. Surface data z ( t ) is shifted in a snake like fashion as shown in Fig. 3. Each shift is equivalent to one inter-processor X-net mesh shift of z ( t ) with the pixel popped from one end of the memory array and mem sequential shifts of ~ ( twithin ) the PE to accommodate the incoming mesh shifted pixel within the static memory array. The data read is ordered in the fashion the snake path is followed.
+
+
+
5.1. Hurricane Frederic Data Stereoscopic visible channel cloud imagery of Hurricane Frederic obtained using the GOES-6 (East) and GOES-7 (West) satellites on September 12, 1979, were used to test the parallel implementation of the SMA algorithm using the semi-fluid model. When the stereo imagery was acquired GOES-6 and GOES-7 subtended an angle of about 135' with respect to the center of the Earth providing a very large baseline for stereo analysis. Four time sequential 512 x 512 pixel image pairs ( T = 4) obtained at approximately 7.5 minute intervals were processed using the neighborhood sizes given in Table 1. Pixels in the center of the image span approximately 1 sq-km whereas pixels near the borders span
Unordered Variable PE Window Mesh Transfer Using Raster Scan Read-out In this approach, data is read one memory layer at a time. For each memory layer, a PE bounding box and PE memory bounding box is established marking the neighborhood pixels corresponding to that layer. The layer is then read-out in raster scan fashion (snake read-out though more efficient can not be used here since the bounding boxes are not necessarily square). This approach was found to be faster and was thus incorporated within the implementation.
870
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
Subroutine 11 Time (sec) J Surface fit 11 2.503216 I 0.037088Compute geometric variables Semi-fluid mapping 66.85848 33403.162992 Hypothesis matching 33472.561776 Total
-
;:;;
121x121
Table 2. Timing analysis for a single Hurricane Frederic image pair.
approximately 4 sq-km due to the larger field-of-view. The wind barbs show the manual estimate of cloud-top wind velocity and direction which was obtained for 32 particles (pixels) for comparison with SMA estimates. Manual cloud tracking was done by an expert meteorologist and the manual results were treated as the reference or true estimate. The parallel algorithm obtained the same result as the sequential implementation, with a root-mean-squared error of less than one pixel with respect to the manual estimates. Though all 512 x 512 pixels within the image were tracked using the parallel algorithm, only 32 pixels (marked by 3 x 3 crosses) corresponding to the manually tracked wind barbs were compared and visualized in [12]. Table 2 gives the breakdown of the execution times for the parallel algorithm applied to the Hurricane Frederic dataset. The template mapping data was not segmented during this run i.e. 2 = 2N,, 1. The total time taken to process a pair of Hurricane Frederic images is 9.298 hours as compared to a projected time of 397.34 days using the sequential implementation, yielding an execution speedup of 1025 which is over three orders of magnitude. Figure 4 shows timing results for the sequential SMA algorithm running on an SGI R8000 90MHz CPU (compiled with the - 0 3 flag) for various z-Template sizes ranging from 11 x 11 to 131 x 131, and can be used to estimate the time required to compute semi-fluid or continuous motion results by multiplying the per pixel times with the number of pixels in the z-Search window and the number of pixels in the image. Using Fig. 4 gives a slight underestimate of 313 days compared to 397 days, due to the nonlinear scalability factor in the timing dependence on the %-Search window parameter.
Figure 4. Time to compute a single pixel correspondence for varying %-Template sizes.
I
Neighborhood Type Search Area Temulate Surface-patch
11
Varialble N,, -= 7 N,r := 7 N, =: 2
I
Window Size in Pixels 15 x 15 1 5 x 15 5x5
Table 3. Neighborhood sizes for processing GOES-9 datasets of size ( M x N = 512 x 512).
We redefine Ie(tm) = I ( t m ) where I ( t m ) is the image obtained at time t , and approximate the disparity maps by D ( t m ) = I ( t m ) . Since the temporal sampling is dense compared to the Hurricane Frederic dataset the continuous template mapping of (2) was used rather than the semi-fluid model (9), using the window sizes given in Table 3. Figure G shows the results of the parallel SMA algorithm for four out of 48 time steps, with an average interval of about twenty minutes. All of the 512 x 512 pixels within the images were tracked but we show the results only for every 10th pixel and over cloudy regions for the purpose of visualization. In order to compare the time needed to compute semifluid motion with continuous motion analysis as well as between sequential and parallel implementations, Table 4 gives the execution times of the parallel implementation for one timestep. It can be seen that the total run-time is 12.854 minutes compared to 41.357 hours for the sequential implementation with a run-time gain factor of 193. This run-time gain is much smaller than the run-time gain of 1025 for the Frederic data set because the semi-fluidtemplate mapping of (9), where the parallel implementation was optimized most is not needed for the continuous non-rigid motion model.
+
5.2. GOES-9 Florida Thunderstorm Data GOES-9 was the latest to be launched in the series of the GOES satellites and will become operational in 1996. A sequence of 49 images of clouds over the Florida area were obtained during a test phase on July 2, 1995. These images have been processed to extract a 512 x 512 region of interest around Florida. The interval between the images is approximately 1 minute and stereo data was not available.
6. Conclusions A parallel algorithm for semi-fluid motion analysis that can produce dense motion fields (at every pixel) for a time-
87 1
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.
7. Acknowledgements Subroutine
Time (sec)
Surface fit & compute geometric variables Hypothesis matching Total
2.4609 768.7578 77 1.218708
Research funding was provided by the Applied Information Systems Research Program (NASA NRA-93-OSSA09) managed by Glenn Mucklow, in the Office of Space Science at NASA HQ. Albert E Shpuntoff from MasPar Computer Corporation provided useful advice regarding efficient ways of neighborhood pixel fetching. GOES-9 data sets were acquired with the help of Marit Jentoft-Nilsen using the real time ingest system set up by Dr. Dennis Chesters, the GOES project scientist at NASNGSFC.
Table 4. Timing analysis for one timestep or pair of GOES-9 Florida thunderstorm images.
References [ 11 MasPur MP-2 Parallel Application Language (MPL) User
Guide. Software Version 3.2, Rev. A5, MasPar Computer CO oration, 1993. [2] A . %ranca, A . Distante, and H. R. P. Ellingworth. Parallel motion computing on the MasPar MP-2 machine. Proc. of the 9th Int. Parallel Processing Symposium, pages 7 12-7 16,
1995. [3] T. A. El-Ghazawi. Characteristics of the MasPar parallel I/O system. Proc. 5th Symp. on Frontiers of Massively Parallel Com utation, pages 265-272,1995. [4] M. aisal, A. D. Lanterman, and D. L. Snyder. Implementation of a modified Richardson-Lucy method for image restoration on massively parallel computer to compensate for space-variant point spread of a charge-coupleddevice camera. Journal of the Optical Society of America A , 12(12):2593-2603, 1995. [5] J. R. Fischer,L. E. Hamet, C. M. Mobany, and J. A. Pedelty et. al. The practicality of SIMD for scientific computing. Proc. 5th Symp. on Frontiers of Massively Parallel Computation, pages 258-264,1995. [6] A. E Hasler. Stereoscopic measurements. In P. K. Rao, S. J. Holms, R. K. Anderson, J. Winston, and P. Lehr, editors, Weather Satellites: Systems, Data and Environmental Applications, volume Section VII-3, pages 23 1-239. Amer. Meteor. Soc., Boston, MA, 1990. [7] Y. Huang, K. Palaniappan, X. Zhuang, and J. Cavanaugh. Optic fiow field segmentation and motion estimation using a robust genetic partitioning algorithm. IEEE Trans. Pattern Anal sisandMachineInteZZi ence 17(12):1177-1190,1995. [8] C. dmbhamettu. Nonrigif Motion Analysis Under Small Deformations. PhD thesis, University of South Florida, December 1994. Department of Computer Science and Engineering. [9] C. Kambhamettu, D. B. Goldgof, D. Terzopoulos, and T. S. Huang. Nonrigid motion analysis. In T. Young, editor, Handbook of PRIP: Computer vision, volume 11, pages 4 0 5 4 3 0 . Academic Press, San Diego California 1994. [lo] C. Kambharnettu, K. Palaniippan, and d. E Hasler. Coupled, multi-resolution stereo and motion analysis. IEEE International Symposium on Computer Vision, ages 4348,1995. [l 11 J. R. Nickolls. The Design of the MasAr MP-2: A cost ef fective massively parallel computer. MasPar Computer Cororation, Tech. Re ort, 1994. Palaniappan, C? Kambhamettu, A. E Hasler, and D. B. [12] Goldgof. Structure and semi-fluid motion analysis of stereoscopic satellite images for cloud tracking. Proc. of the International Conference on Computer Vision,pages 659-665, 1995.
E'
Figure 6. Cloud tracking results for GOES-9 Florida thunderstorm rapid scan imagery showing four timesteps: Top left, 18:02:45 GMT;Top right, 18:20.10;Bottom lefr 18 39'42: Bottom nght 18:5X.06
varying sequence was implemented on the massively parallel
MasPar MP-2 computer. T h e SMA algorithm can be applied t o a pair of time-varying intensity data, surface o r range data, or both intensity and surface data. T h e parallel implementation is over two orders of magnitude faster that the sequential version and makes the estimation of dense non-rigid motion fields more practical. A number of experiments using real satellite GOES imagery were done to successfully validate the parallel implementation. Future work involves using adaptive hierarchical non-square template and search windows, using multispectral information, coupling stereo and motion estimation, improving the accuracy of the estimated motion field by using robust estimation, relaxation labeling o r regularization, and post processing the motion field by using cloud classification.
k.
872
Authorized licensed use limited to: University of Missouri Libraries. Downloaded on July 31, 2009 at 20:38 from IEEE Xplore. Restrictions apply.