Wavelet-Based Image Registration With JPEG2000 Compressed Imagery Derrick S. Campbella and William D. Reynolds, Jr.b a ITT
Corporation Space Systems Division, Rochester, New York, U.S.A. Corporation Space Systems Division, Vienna, Virginia, U.S.A.
b ITT
ABSTRACT This paper describes a registration algorithm for aligning large frame imagery compressed with the JPEG2000 compression standard. The images are registered in the compressed domain using wavelet-based techniques. Unlike traditional approaches, our proposed method eliminates the need to reconstruct the full image prior to performing registration. The proposed method is highly scalable allowing registration to be performed on selectable resolution levels, quality layers, and regions of interest. The use of the hierarchical nature of the wavelet transform also allows for the trade-off between registration accuracy and processing speed. We present the results from our simulations to demonstrate the feasibility of the proposed technique in real-world scenarios with streaming sources. The wavelet-based approach maintains compatibility with JPEG2000 and enables additional features not offered by traditional approaches. Keywords: JPEG2000, wavelets, remote sensing, motion estimation, registration, alignment
1. INTRODUCTION Persistent monitoring of high-valued targets using various sensor platforms has been a growing requirement for military, national security, and law-enforcement operations. These sensor platforms are moving towards large frame, low frame rate cameras to provide the necessary area coverage. In addition, multi-camera array systems are being developed to increase the field of view (FOV). To handle the volume of imagery produced by these systems, advanced compression schemes (e.g., JPEG2000) offering superior compression, progressive image delivery, and region of interest access are currently being integrated into some of these systems.1 The need for real-time understanding and event prediction has increased the demand for robust image analysis algorithms, such as target detection, identification, and tracking. Image registration is often a pre-processing step for these analysis algorithms. Thus, efficient and robust image alignment methods still remains a challenge for real-time event understanding and prediction. We present a registration algorithm for aligning large frame imagery compressed with the JPEG2000 compression standard. The images are registered in the compressed domain using wavelet-based techniques. We globally register the images using piecewise information from the wavelet coefficients of the individual tiles. Unlike traditional approaches, our proposed method eliminates the need to reconstruct the full image prior to registration. The proposed method is highly scalable allowing registration to be performed on selectable resolution levels, quality layers, and regions of interest. The use of the hierarchical nature of the wavelet transform allows for the trade-off between registration accuracy and processing speed. We present the results from our simulations to demonstrate the feasibility of the proposed technique with imagery collected from a prototype persistent surveillance system.2 The wavelet-based approach maintains compatibility with JPEG2000 and enables additional features not offered by traditional approaches. The approach represents an integrated solution combining JPEG2000 compression, wavelet-based processing, and motion estimation for near real-time image-to-image registration. Further author information: Derrick S. Campbell:
[email protected] William D. Reynolds, Jr.:
[email protected]
Visual Information Processing XVII, edited by Zia-ur Rahman, Stephen E. Reichenbach, Mark Allen Neifeld, Proc. of SPIE Vol. 6978, 69780D, (2008) · 0277-786X/08/$18 · doi: 10.1117/12.776908
Proc. of SPIE Vol. 6978 69780D-1 2008 SPIE Digital Library -- Subscriber Archive Copy
Wavelet Coefficients
Codestreani •
Reconstructed Image
Figure 1: Baseline block diagram of the JPEG2000 compression algorithm. The paper is organized as follows. Section 2 describes the details of the JPEG2000 compression scheme that form the basis for the registration algorithm. Section 3 describes the motion model considered for the image-toimage transformation. Section 4 describes the details of the proposed registration algorithm. Section 5 presents the simulations and results of applying the registration algorithm to real-world imagery. Finally, Section 6 concludes with a summary and future work.
2. JPEG2000 JPEG2000 is the international standard for image compression developed by the International Standards Organization (ISO) Joint Photographic Experts Group (JPEG).3 The JPEG2000 standard improves upon the previous image compression standard, JPEG, based on the Discrete Cosine Transform (DCT) and Huffman Encoding by replacing these techniques with the Discrete Wavelet Transform (DWT) and embedded encoding. These new techniques give rise to improved compression performance, multiresolution representations, and progressive transmission. The standard is published jointly with the International Electrotechnical Commission (IEC) in 12 parts that cover a wide range of technologies ranging from baseline image compression, image security, and wireless communication protocols. This effort focuses on components associated with Part 1, which encompasses the baseline compression scheme. The baseline JPEG2000 algorithm is shown in Figure 1 highlighting the processing steps in both the encoding and decoding paths. In the encoding path the input image is divided into individual tiles and the DWT is applied to each tile independently. The wavelet coefficients from each tile are quantized and entropy encoded to produce the resulting codestream. The decoding path reads the input codestream and performs the inverse of each operation. Observe that the wavelet coefficients are available at two access points. The first access points is located on the encoder side prior to quantization. The second access point is located on the decoder side after inverse quantization. These access points are indicated in the Figure 1. The advantage of these access points enables a common framework for performing both compression and advanced wavelet processing algorithms. The proposed registration algorithm leverages this state-of-the-art compression standard by integrating novel wavelet-based techniques to render aligned images for analysis. The motivation for this approach is centered on providing an efficient solution for image dissemination and analysis for persistent surveillance applications. The wavelet representation also facilitates hierarchical processing as opposed to traditional pixel-based processing and using old legacy compression technology.
2.1. Tiling Scheme Under the JPEG2000 standard, the input images are normally tiled and each tile compressed independently. This tiling scheme is shown in Figure 2 with the typical organization of the image data in the wavelet domain.
Proc. of SPIE Vol. 6978 69780D-2
SHL I
2HL 3LH
SHH IHL
I
2HH
2L1H
ILIH
rile-based Wavelet Represenlation (J2K Compressed Image Frame)
IHH
Full Frame Wavelet Representation (3 Levels of Wavelet Decomposition)
Figure 2: JPEG2000 Tiling Scheme. This example illustrates the tiling scheme (left-side) for three levels of wavelet decomposition. The right-side shows the result of rearranging the tiles to construct the full frame wavelet representation. This piecemeal rearrangement of the tiles is equivalent to applying the wavelet transform to the entire image. Typical tile sizes range from 256 × 256 to 1024 × 1024. The advantage of reconstructing the full frame wavelet representation include registering the images independent of tile size and estimating a smooth motion field over the entire image. We exploit this inherent tile-based data structure to develop the proposed registration algorithm. The main step involves rearranging the codestream packets of the tiles to facilitate direct access to the subband information at a specific resolution. The progression order of the JPEG2000 codestream is stored in the resolution-layercomponent-position (RLCP) order.4 This progression modes allows a straightforward method for re-constructing the full wavelet subbands. The packets are arranged so that the tile information corresponding to adjacent regions of interest are co-located. Once the tiles are co-located the subbands can be readily subjected to traditional image processing procedures.
3. MOTION MODEL The sensor platform is an aerial persistent surveillance system where the sensor is distant from the scene. Given this target platform, we assume a 2-D planar scene such that the spatial transformation between adjacent frames can be modeled by a projective transformation (homography).5 Denote xi−1 and xi as the N × 2 vectors of matching tiepoints for the previous and current frames, respectively, as: ⎤ ⎤ ⎡ ⎡ xi−1,1 yi−1,1 xi,1 yi,1 ⎢ . ⎢ . .. ⎥ .. ⎥ ⎥ ⎥ ⎢ (1) xi−1 = ⎢ . ⎦ and xi = ⎣ .. . ⎦ , i = 1, . . . , F ⎣ .. xi−1,N yi−1,N xi,N yi,N where N is the number of tiepoints and F is the number of frames in the sequence. The projective transformation between the previous frame Ii−1 and the current frame Ii can be expressed as ˜ i = T˜ xi−1 x where T is defined by the non-singular 3 × 3 matrix ⎡ T11 ⎢ T=⎢ ⎣T21 T31
T12
(2)
T13
⎤
T22
⎥ T23 ⎥ ⎦
T32
T33
˜ i are the homogeneous coordinate representations of xi−1 and xi , respectively. ˜ i−1 and x and x
Proc. of SPIE Vol. 6978 69780D-3
(3)
T
Let θ = [T11 , T12 , T13 , T21 , T22 , T23 , T31 , T32 , T33 ] denote the unknown vector of parameters for the transformation in (3). Given the set of tiepoints in (1), the parameters of the homography can be estimated by solving the following equation Gθˆ = b (4) where the 2N × 8 matrix G and ⎡ xi−1,1 yi−1,1 ⎢ ⎢ 0 0 ⎢ ⎢ ⎢ xi−1,2 yi−1,2 ⎢ ⎢ ⎢ 0 G=⎢ 0 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢xi−1,N yi−1,N ⎣ 0
0
2N × 1 matrix b are defined by 1
0
0
0
−xi−1,1 xi,1
0
xi−1,1
yi−1,1
1
−yi−1,1 xi,1
1
0
0
0
−xi−1,2 xi,2
0
xi−1,2
yi−1,2
1
−yi−1,2 xi,2
.. . 1
0
0 xi−1,N
.. .
0
0
−xi−1,N xi,N
yi−1,N
1
−yi−1,N xi,N
−xi−1,1 yi,1
⎤
⎡
xi,1
⎤
⎥ ⎢ ⎥ ⎢ yi,1 ⎥ −yi−1,1 yi,1 ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ −xi−1,2 yi,2 ⎥ ⎢ xi,2 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ −yi−1,2 yi,2 ⎥ ⎥ ; b = ⎢ yi,2 ⎥ . ⎥ ⎢ ⎥ ⎥ ⎢ .. ⎥ ⎥ ⎢ . ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ −xi−1,N yi,N ⎦ ⎣xi,N ⎦ −yi−1,N yi,N yi,N
(5)
Since the homography is computed up to an arbitrary scale factor, T33 = 1 and reduces the number of free parameters to eight. The solution to (4) can be determined using least squares techniques and requires a minimum of four points (N ≥ 4). Typically, more points are available and the equation is over-determined which minimizes the effect of noisy tiepoints. The estimated parameters θˆ are used in (3) to align the images.
4. WAVELET-BASED IMAGE REGISTRATION Our approach to image registration combines the features of wavelet-based processing and JPEG2000 compression technology to align remotely sensed imagery. Figure 3 depicts the components of the registration algorithm, which include the discrete wavelet transform (DWT) of the image, multiresolution motion estimation (ME), a warping operation (Warp), an update (Update) procedure, and feature extraction (FE). We also include an outlier rejection method to reduce the effects of false matches in the motion estimation procedure. This novel approach exploits the multiresolution nature of the wavelet transform and provides an efficient processing scheme for inter-frame motion estimation of large frame imagery compared to computationally expensive pixel-domain techniques. The registration problem is addressed from the decoder side since we assume that the images have been compressed prior to the registration algorithm. Therefore, we do not concentrate on the details of the compression parameters, except where clarity deems necessary.
4.1. Wavelet Representation The Discrete Wavelet Transform (DWT) is the fundamental building block used in the compression and registration algorithms. The DWT decomposes the input image into spatial-frequency components, which retains both spatial and frequency information of the image content. This multiresolution image representation provides an effective means for localized feature extraction and analysis of the scene content. The advantages of the wavelet representation include exploiting the spatial-frequency content of the image data, exploiting the scale properties of the objects or features present in the image, and hierarchical processing of the image. Furthermore, registering the images in the wavelet domain eliminates the need to process the full-size image. Figure 3 illustrates the DWT of an image obtained from the JPEG2000 codestream up to three levels of decomposition or resolution levels. Each frame, Ii , is represented as a sequence of subband images: 3 3 3 3 1 1 1 (k, l), WV,i (k, l), WD,i (k, l), . . . , WH,i (k, l), WV,i (k, l), WD,i (k, l) , i = 1, . . . , F (6) Ai (k, l), WH,i j j j (k, l), WV,i (k, l), WD,i (k, l) denote the horwhere A3i (k, l) denotes the lowpass or baseband subband and WH,i izontal, vertical, and diagonal high-frequency subbands at resolution j = 1, 2, 3, respectively. The wavelet decomposition results in a natural pyramid structure that facilitates hierarchical motion estimation. The motion estimation step can be cast into this processing framework using the lowpass subband, A3 (k, l), as an initial approximation and subsequent high-frequency subbands to improve the result at different resolutions.
Proc. of SPIE Vol. 6978 69780D-4
/
(DWT)
(DWT)
T=
Figure 3: Wavelet-based Image Registration.
4.2. Multiresolution Motion Estimation A multiresolution motion estimation (MRME) approach was adapted to exploit the inherent wavelet representation of the compressed imagery. The wavelet representation provides an efficient data structure for hierarchical processing. The processing benefits include handling large translational offsets and eliminating the need to decode the entire image during registration. A wavelet domain approach provides the mechanism to take advantage of the unique properties of the compressed imagery to improve the registration performance while reducing computational costs. Various multiresolution motion estimation schemes using the wavelet transform have been reported in the literature.6–11 The general approach is to estimate the motion vectors at the lowest resolution. These motion vectors serve as initial estimates to the motion vectors at the next resolution level. The hierarchical block matching scheme uses a variable block size given by 2j no × 2j no , where j is the wavelet decomposition level and no is the initial block size. The variable block size results in a constant number of blocks at each level. The blocks are also located in the same spatial position which ensures that the same object motion is tracked across resolutions (motion consistency). The most common matching criterion is the sum of absolute differences (SAD) defined as n n |Ii−1 (q + u, r + v) − Ii (q, r)| (7) SAD(u, v) = q=1 r=1
where Ii−1 is the previous frame and Ii is the current frame. The SAD has the advantage of being computationally efficient, but sensitive to intensity variations in the image. The estimated motion vector is given by the block that yields the minimum error (8) (ˆ u, vˆ) = arg min {SAD(u, v)} (u,v) Ω
where Ω defines the size of the search area.
Proc. of SPIE Vol. 6978 69780D-5
Our approach differs from the above methods in the following ways. First, we use a sparse wavelet tree to perform the motion estimation. Unlike previous methods, the motion estimation is only performed in the lowpass subband at each level. While this approach reduces computations and processing time, it also minimizes the effect of translation variance of the wavelet transform.12 The high-frequency subbands are used for feature extraction to adapt the motion estimation to the scene content. Second, we apply a pre-warp to one of the lowpass subbands prior to motion estimation.13 The pre-warp is based on a priori knowledge about the transformation between the adjacent frames. Since we may experience large translational motion, the pre-warp is designed to reduce the burden of the motion estimation algorithm. The pre-warp transformation is estimated from the metadata, if available. If the metadata is not available, the pre-warp transformation can be set to the identity matrix without loss of generality. Finally, we are concerned with the global motion between the frames and not the local motion of individual objects in the scene. We present three variations of the MRME algorithm: Algorithm I: Motion estimation is performed on a single resolution level (j = 3, 2, 1) using only the lowpass subband. The motion estimate is projected to the original image resolution and serves as the final estimate. No refinement of the motion estimate is performed. The motion estimate is given by ˜j ˆ jT Tj = T
(9)
˜ j is the initial estimate, T ˆ j is the update estimate from the block matching, and Tj is the final motion where T ˜ = I the identity matrix. The final motion estimate estimate. If we assume no metadata is available, then T ˆ j and is scaled appropriately to the original image resolution. becomes Tj = T ˜ 3 = I, the final Algorithm II: Motion estimation starts at the lowest resolution level (j = 3). Assuming T ˆ 3 . This estimate serves as the initial estimate for the motion estimate at the motion estimate is given by T3 = T next higher resolution level L2. The motion estimate for L2 is given by ˜2 ˆ 2T T2 = T
(10)
˜ 2 is the initial estimate and T ˆ 2 is the estimate from the block matching. Similarly, the motion estimate where T for L1 is given by ˜1 ˆ 1T (11) T1 = T ˜ 1 is the initial estimate derived from T ˜ 2 and T ˆ 1 is the estimate from block matching. The final motion where T ˜ j serves as the pre-warp for estimate, T, is obtain by scaling (11) to the original image resolution. Note that T the lowpass subband and is applied prior to block matching. Algorithm III: Incorporates a feature extraction technique to Algorithm I and II that adapts the motion estimation to the scene content. The technique is based on the localized energy of the wavelet coefficients in the high-frequency subbands across resolution and orientation. The energy of each block is computed and compared against a threshold prior to motion estimation. This scene adaptive process reduces the number of blocks processed during motion estimation and minimizes motion estimation in smooth areas of the image.
4.3. Homography Generation from Motion Estimation Given the localized estimates performed by the block matching, a parametric motion model representing the overall global motion of the scene is desired. The general global motion can be represented by a homography, which captures the geometric transformation associated with the overall motion of the scene. The block matching algorithm provides local matches or estimates of the motion field over a small region of support. The local estimates are combined to estimate the global motion between the frames. This approach represents an alternative method of finding matching tiepoints. This approach is slightly different from the traditional approaches used in the MPEG family of compression standards.14 For each resolution level (j), a set of tiepoints is generated from the block matching algorithm. The top-left coordinate of the blocks represents the coordinate of each tiepoint. The set of tiepoints at the j-th resolution for
Proc. of SPIE Vol. 6978 69780D-6
the previous and current frames is denoted by {xji−1 } ↔ {xji }. The transformation at the j-th resolution level, ˆ j , is given by T ˆ jx ˜ ji−1 ˜ ji = T (12) x ˆ j is combined with the initial estimate T ˜ j to produce the final where the least squares solution is given in (4). T motion estimate in (9).
4.4. Update Procedure The update procedure adjusts Tj according to the level of the wavelet transform. The final motion estimate at each level is adjusted by: ⎤ ⎡ T12 2L−j T13 T11 ⎥ ⎢ T21 T22 2L−j T23 ⎥ (13) Tj = ⎢ ⎦ ⎣ −(L−j) −(L−j) 2 T31 2 T32 T33 where L = 3 and j denotes the resolution level (j = 1, 2, 3). The estimate serves as the initial estimate for the next higher resolution level and is updated prior to the block matching procedure.
4.5. Feature Extraction The goal of the feature extraction algorithm is to reduce the number of blocks processed during motion estimation. This enhancement step tothe motion estimation algorithm requires direct access to the wavelet high-frequency
j j at each level. These subbands correspond to the horizontal, vertical, and diagonal subbands WH , WVj , WD edge information at each resolution level, respectively. We apply a localized feature extraction algorithm to j analyze the local scene content of the image. The energy measure EB of each block is given by j = EB
NB NB
E j (k, l),
(14)
k=1 l=1
where NB × NB denotes the size of the block and E j (k, l) is given by 2 2 2 j j (k, l) + WVj (k, l) + WD (k, l) . E j (k, l) = WH
(15)
The blocks are eliminated based on the energy of the wavelet coefficients and are thresholded according to the binary decision rule: ⎧ j ⎨ 1, EB ≥ τ B= (16) ⎩ j 0, EB