Parallel Pipelined Fractal Image Compression using Quadtree Recomposition DAVID JEFF JACKSON
AND
WAGDY MAHMOUD
Department of Electrical Engineering, 317 Houser Hall, The University of Alabama, Tuscaloosa, AL 35487, USA Email:
[email protected]
In this paper we present a model and experimental results for performing parallel fractal image compression using circulating pipeline computation and employing a new quadtree recomposition approach. A circular linear array of processors is employed and utilized in a pipelined fashion. In this approach, a modification of the scheme given by Jackson and Blom, quadtree sub-blocks of an image are iteratively recombined into larger blocks for fractal coding. For complex images, this approach exhibits superior parallel runtime performance when compared to a classical quadtree decomposition scheme for fractal image compression, while maintaining high fidelity for reconstructed images. Quantitative results include parallel runtime comparisons with the decomposition approach for several images of varying complexity, an evaluation of attained compression ratios and SNR for reconstructed images. Experimental results using an nCUBE-2 supercomputer are presented. Received May 30, 1995; revised November 17, 1995
1. INTRO DUCTION
Data compression has become an important issue in relation to information storage and transmission. This is especially true for databases consisting of a large number of detailed computer images. In recent years, many methods have been proposed for achieving high compression ratios for compressed image storage. A very promising compression technique, in terms of compression ratios and image quality, is fractal image compression. Another intrinsic advantage offered by the fractal method is a fast decoding time. Fractal image compression exploits natural affine redundancy present in typical images to achieve a high compression ratio in a lossy compression format. However, fractal based compression algorithms have high computational demands. Currently, fractal methods are employed for write-onceread-many archival media, such as CD-ROM, where encoding time is not as critical as high speed decoding and high quality reconstruction. To obtain faster compression, or feasible compression times for very large images, a parallel algorithm may be employed which exploits the inherently parallel nature, from a data domain viewpoint, of the fractal transform process. The implemented parallel algorithm, employing a circulating pipeline of processors, is a modification of an algorithm developed by Jackson and Blom [1]. The existing parallel approaches to fractal compression developed [1, 2] employ quadtree decomposition (QD) and are similar to a scheme proposed by Fisher [3]. However, an analysis of the QD approach indicates that, like other decompositions used for fractal image compression, many calculations performed are not, ultimately, required. Therefore, a new quadtree recomposition (QR) approach is introduced which reduces parallel runtime by effectively eliminating many unnecessary calculations. This THE COMPUTER JOURNAL,
approach is shown to improve significantly the parallel runtime required for the fractal compression of typical images of varying complexity without reducing significantly the fidelity of the reconstructed images. In Section 2, a basic summary for fractals and iterated function systems is presented succinctly. Also, we describe a general fractal image compression scheme and approaches to image partitioning. In Section 3 we give a choice for distortion measure, describe the transformations which comprise the iterated function system, and address the complexities of the matching transformation search and storage of the iterated function system. In Section 4, parallel processing issues relevant to both the QD and QR algorithms are presented. A summary of the parallel QD approach and an analysis of the parallel runtime performance which justifies the new QR approach are given in Section 5. The QR scheme is introduced in Section 6. Parallel runtime comparisons for the QD and QR methods are also given. In Section 7 we present a performance analysis of each method in terms of execution time, attained compression ratios and signal-to-noise ratios (SNR) for reconstructed images. All test images are of size 256 2 256 pixels. In Section 8 we conclude and suggest areas in which additional research is required. 2. FRACTALS AND ITERAT ED FUNCTIO N SYSTE MS
2.1. Iterated function systems (IFS) and contractive mappings The basis for fractal image compression is the construction of an iterated function system that approximates the original image. An IFS is a union of contractive trans-formations that maps to itself. For a transformation W to be contractive, V O L . 39 ,
NO. 1,
1 9 96
2
D. J. J A C K S O N
AND
equation (1) must be satisfied. This equation states that the distance d
P1 ; P2 between any two points, P1
x1 ; y1 and P2
x2 ; y2 , is diminished by applying the transformation W. A metric to measure distance, the standard Euclidean metric, in a two-dimensional space is given in (2). In practice, we will use a different measure presented in Section 3. d
W
P1 ; W
P2 < d
P1 ; P2 :
1
q d
P1 ; P2
x2 ÿ x1 2
y2 ÿ y1 2 :
2
An example of a one-dimensional contractive function is f
x 0:2x 2. When iterating on f
x (i.e. applying f
x consecutively), the result always approaches 2.5, regardless of the initial value of x. The point 2.5 is referred to as a fixpoint for f
x. The following IFS example [4] represents a simple introduction to the concept. Consider the map W , given in (6), as a collection, i.e. union, of the three linear, contractive transformations w1 3 given in (3–5) which map the plane R2 to itself. W is applied to sets which are collections of points in the plane. For a given set S, computing and taking the union of wi
S for each i, we obtain a new set W
S . Thus, W is a map on the space of subsets of the plane. ÿ
" #
w1
"
x y
#" #
0
0:5
0: 5
0
0
0:5
0: 5
0
0
0:5
#" # y
#" #
0: 5
#
0 0:25
y
4
:
"
x
y
"
x
3
:
0
"
x
0
y
y
" #
x
"
x
" # w3
0
" # w2
0: 5
# :
5
0: 5
W
x; y w1
x; y [ w2
x; y [ w3
x; y:
6
Let S be the image of a ball as shown in the initial image in Figure 1. If W is applied to S, the result of the first iteration is a triangle of balls. Next, W
S is transformed by applying W again, thus yielding W
W
S . This is shown in Figure 1 as the second iteration. Additional iterations are also shown. After iterating W on the order of 10 times, the image does not exhibit visible change, which is true when the resolution is limited (i.e. S is discrete). In case of a continuous space, the image acquires added details of decreasing size. The resulting image is known as the Sierpinsky triangle and is denoted an attractor for W. S is defined as an attractor of W if the equation S W
S is satisfied.
W. MAHMOUD
The Sierpinsky triangle is a fractal, a complex image with self-similar details in different scales. Yet this complex image is formed from the union of only three linear transformations. An important concept is that the initial image does not play any role in the final result. Thus, any initial image will be transformed to the Sierpinsky triangle after a number of iterations using W. Provided that a mapping is contractive and maps to all points in the set, the Contractive Mapping Fixed Point Theorem also proves that there will be one and only one attractor for W [3]. Thus, by applying an IFS consecutively, any initial image will converge to one and only one final attractor. Additional discussion of IFS theory is given by Barnsley and Hurd [5]. 2.2. A general fractal image compression scheme With fractal image compression, an image is encoded as the fixed point of a contractive mapping—constructed piecewise. The image is partitioned into blocks, denoted range blocks, of square shape. The goal for the encoding is to determine, for each range block, a best matching block, denoted a domain block, under an affine mapping. The domain block should be larger in size than the range block to which it maps in order to fulfill contractivity requirements. The set of range–domain mappings form an IFS that has the decoded image as the attractor. Furthermore, that attractor should exhibit less complexity than the original image. Complexity should here be loosely understood as the amount of storage needed to describe the object or the transformations. To store the compressed image, only the coefficients of each iterated function system must be stored. Given an image, S, the algorithm must produce a union of transformations w1 ; w2 ; . . . ; wn that satisfies (7). S
W
S w1
S [ w2
S [ w3
S [ . . . [ wn
S :
Equation (7) suggests how the construction of an attractor can be achieved; i.e. by the partition of S into n pieces. A contractive mapping from S into each piece must then be determined. This problem can also be denoted the inverse problem of iterated transformation theory. It is important to notice that each mapping exploits self similarities of different sizes present in most images. Thus, images with random content are not likely to be compressed very well as only few similarities of different size are likely to exist. Finding W so that S W
S exactly with less complexity than the original image is not likely [6]. However, as stated earlier, fractal image compression is a lossy compression algorithm. Therefore, the equality S W
S does not have to be fulfilled exactly. Instead, S can be approximated by W
S . Assuming a distortion measure d
S ; S between the images S and S exists, the problem is to minimize d
S ; W
S and the complexity of W. This can also be stated as minimizing the loss and maximizing the compression ratio. For all practical purposes, the choice is more a trade-off between compression ratio and distortion. When implementing the algorithm, another vital factor for the application is the compression time. Current research in fractal image compression 0
0
F IGURE 1. Results of applying the iterated function system, W .
THE COMPUTER JOURNAL,
7
V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
concentrates on investigating different transformations and improving search algorithms for matching transformations to decrease compression time [7, 8]. 2.3. Image partitioning schemes An obvious partition scheme of S is to divide the image into a number of non-overlapping quadratic square range blocks as implemented in the first algorithms developed [4, 6]. An image of size 256 2 256 is partitioned into squares of 8 2 8 pixels. As required by equation (7), the union of the set of range blocks covers the entire image. A corresponding set of domain blocks is similarly constructed—with the exceptions that the domain blocks are chosen to be twice the size of the range blocks (to insure contractivity for the domainto-range block transformation) and the domain blocks are allowed to overlap. Overlapping increases the size of the domain block pool thus increasing the probability of finding a good range–domain match. The partition of an image of size 256 2 256 into domain blocks of size 16 2 16 is shown in Figure 2. Notice the large number, 58081, of domain blocks. This large search space is the reason for the high computational costs associated with fractal image compression. To obtain the best match, the entire domain pool must be searched, with the best match chosen for each range block. In practice, the first domain block matching the range block within an acceptable distance measure is used. This simple partition scheme, with a fixed block size, has limitations. For large range blocks, good matchings with domain blocks become unlikely. To overcome this limitation, a quadtree partitioning scheme can be employed. A quadtree decomposition initially partitions the image into large range blocks, typically 16 2 16 pixels. Then, the best possible transformation for each block is found. This best transformation is compared with the original block using a distance metric. An acceptance threshold is set before the transformation. The transformation is accepted if the
3
distance between the blocks is lower than the threshold. If the transformation is discarded, the range block is divided into four sub-blocks and the search for a best transformation for each sub-block is initiated. This partition scheme can be recursively continued for several levels (typically 2–4) until either all blocks are covered with an acceptable transformation or until a certain minimum range block size is reached for which the best matching transformation is used. The range block sizes for the quadtree partitioning employed here are 16 2 16, 8 2 8 and 4 2 4 pixels. The corresponding domain block sizes are 32 2 32, 16 2 16 and 8 2 8 pixels, respectively. Other schemes, using not only different partition sizes but also different partition geometries, have been implemented [4]. These schemes typically attempt to decompose the image in some content dependent manner. Different partition geometries which have been employed include horizontal–vertical (HV), triangular and polygonal decompositions. 3 . CH O I CE F O R I M AG E D IS T O R T I O N M EASURES AND TRANSFORMATIONS
3.1. MSE and SNR distortion measures Several distortion measures were used throughout this research; the basis for these measures is the mean square error (MSE) metric. This metric expresses the difference, or distance, between two images or two range blocks. Difference and distance will here be used interchangeably. Assuming two images or image blocks, S and S 0 , possessing n pixels with intensities S1 ; S2 ; . . . Sn and S10 ; S 20 ; . . . S n0 , the distance between the blocks can be expressed as (8). d
S ; S 0
n 1X
S ÿ Si0 2 : n i 1 i
8
The difference between two images with n pixels can also be expressed as a SNR as given in (9). SNR 10 log10
dr
S2 d
S ; S 0
:
9
The term dr
S in (9) denotes the dynamic range of S. Note that whenever SNR is used, S represents the original image and S 0 the decompressed image. SNR and the MSE metric are image quality measurements. When employed, conclusions concerning perceived image quality should only be taken with care and after careful visual inspection of the final result. 3.2. Choice of block transformations
F IGURE 2. Domain block partitions.
THE COMPUTER JOURNAL,
The requirements for the IFS transformations include not only contractivity but also that the resulting size of each transformation be the size of the range block to which it maps. The transformation from a domain block to a range block is split into three steps: a geometric, a massic and an isometric step. The geometric transformation is spatial, transforming an arbitrary domain block into the size and V O L . 39 ,
NO. 1,
1 9 96
4
D. J. J A C K S O N
AND
position of a specific range block. First, a 2B 2 2B sized domain block is pixel averaged to form a B 2 B sized block for matching with a B 2 B sized range block. The contracted domain block is then translated to the position of the range block. The massic transformation affects the individual pixel values by performing a linear transformation equivalent to a luminance shift (Fisher gives this as brightness) and a contrast scaling. Denoting the luminance shift s, and the contrast scaling o, the distance, R MSE, between the transformed domain block, D, and a range block, R, and possessing pixel intensities D1 ; D2 ; . . . ; Dn R 1 ; R 2 ; . . . ; R n is given in (10). Notice n denotes the total number of pixels in a block, not the side length. R MSE
n 1 X 2
s 1 Di o ÿ R i : n i 1
TABLE 1 . The eight isometric transformations
n
Di R i
n X ÿ
!
i 1
s
n
n X
D2i
o
Ri ÿ s
i 1
n X
n X
Di n X
ÿ
:
10
rang
:
12
Finally, by substituting (11) and (12) into (10), the distance R MSE can be computed as (13). A careful inspection of (11), (12) and (13) shows that, for each comparison between a domain block and a range block, only one summation has to be performed. That is 6Di R i . The remaining sums can be computed once in the initial part of the algorithm and stored for subsequent use. The distance metric given in (13) is used for the presented experimental data. The final transformation, the isometric, is a shuffling of pixels within the block in a deterministic way. The isometric transformation is used to increase the pool of domain blocks. Eight different isometric transformations are used as suggested by Jacquin [6, 9]. Consider the block, D, with n X
R MSE
i 1
R 2i
s s
n X
Di2
i 1
ÿ
2
n X
ÿ
0
0
ÿ
ÿ
1
0
ÿ
0
ÿ
ÿ
1
0
ÿ
14
2
15 2 2 O
: B B Each range block must be compared with all domain blocks to determine the best match. Thus, the total number of comparisons has an O(4) complexity. In practice, however, we accept the first match which is within some predetermined threshold. This implies that, at least for typical images, the average complexity is somewhat less than O(4) and varies according to the size of the blocks being compared. Although image dependent, the probability of finding a good match for large blocks is typically low. Thus, the time required to find good matchings at higher quadtree levels is greater than that for lower quadtree levels. A more detailed analysis of this is given in Section 7. To decrease the execution time, most fractal compression algorithms employ a classification scheme, i.e. each domain block and range block is classified according to certain characteristics. Then, each range block is compared only with domain blocks possessing the same characteristics. A number of different approaches to decrease complexity have
Di
Di
n
ÿ
0
dom
ÿ 2B 1 2
ÿ 2B 1 2 O
2 :
11
i 1
i 1
Di; j Di; j Di; B j 1 Di; j DB i 1; j Di; j Dj; i Di; j DB j 1; B i Di; j Dj; B i 1 Di; j DB i 1; B j Di; j DB j 1; i
Di; j
The first fractal compression schemes typically used brute force search algorithms. That is, all possible comparisons were tried and the best matches were used [4]. To evaluate that solution in terms of execution complexity, the number of comparisons between a range block and a domain block as a function of the sidelength, , of the image must be determined. Note that one domain-to-range block comparison can be performed in constant time. The sidelength of each range block is B. Thus, the sidelength of each domain block is 2B. The total number of domain blocks, dom, is given in (14) and the number of range blocks, rang, is given in (15).
Ri
i 1 2
0
0
3.3. Matching transformation search
!
!
i 1
i 1 n X
!
D ! D transformation
Identity Reflection about mid-vertical axis Reflection about mid-horizontal axis Reflection about first diagonal Reflection about second diagonal Rotation through 908 Rotation through 1808 Rotation through ÿ908
The luminance shift, s, and the contrast scaling, o, must be determined in advance to calculate the distance between D and R. The objective when determining s and o is to minimize R MSE. This is true when the partial derivatives of R MSE with respect to s and o are zero. Then, s can be found as (11) and o as (12).
!
0
Function
n X
W. MAHMOUD
Di R i 2o
i 1
n X
! Di
o on ÿ 2
i 1
i 1
n
n B 2 pixel values denoted Di; j where 1 i; j B. Then, the eight isometric transformations from D to D can be expressed as in Table 1. 0
THE COMPUTER JOURNAL,
n X
! Ri
:
13
been implemented. As this research is concerned with a parallel implementation of the fractal image compression algorithm, the block classification algorithm implemented is V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
fairly simple. This is not only to ease the implementation, but also to avoid potential load imbalance problems introduced by a classification scheme. The implemented scheme uses a simple classification in which all domain and range blocks are compared with a constant grayshade block. Each block is then classified as either a grayshade block or an edge block. If a range block is classified as a grayshade block, the contrast is set to the average grayscale of the block and the remaining transformation variables are set to zero. If a domain block is classified as a grayshade block it is excluded from the pool of possible domain blocks used for matching range blocks. The comparisons are thus limited to edge-classified range and domain blocks. 3.4. IFS storage Once all transformations have been determined, the transformation information is stored. To store each transformation, two different formats are used: one for edge blocks and one for grayshade blocks. These formats are shown in Figure 3. The first bit determines the block type: either grayshaded or an edge. In the case of a grayshade, only the size of the decoded range block and the corresponding grayscale value has to be stored. The size field is two bits, allowing up to four different range block sizes. To store an edge block, additional information is required. The first field, containing a zero, indicates an edge block. This is followed by the size field identical to that for the grayshade format. The next two fields, xpos and ypos, hold the position of the domain block from which to transform. Their size is dependent on the image size. The flip field indicates which of the eight isometric transformations is employed. Finally, the contrast scaling and luminance shift is stored, both requiring eight bits. Higher compression ratios might be obtained by reducing the number of bits used to store the contrast and luminance. As few as 7 bits for the contrast scaling and 5 bits for the luminance shift have been implemented by Fisher [3, 4]. The contrast and luminance exhibit some structure. Thus, further compression can be obtained by entropy coding as
FIGURE 3. Storage format of transformations for a 256 2 256 image.
THE COMPUTER JOURNAL,
5
suggested by Fisher [4] and implemented by Frigaard et al. [7]. 4. PARAL LEL IS SUES FOR TH E Q D AND QR AL G O R I T H M S
4.1. The nCUBE parallel architecture The parallel architecture utilized for this research is the nCUBE-2 parallel supercomputer. The nCUBE-2 supercomputer is a medium grain size multiple instruction multiple data (MIMD) computer. Each processor executes a unique instruction stream and has a local memory. The nCUBE-2 supports a hypercube array of up to 8192 nodes. Each node consists of a processor with a communication unit and 1–64 Mbytes of memory, with a typical node memory of only 1 Mbyte. This memory constraint prohibits using the same data decomposition as previously implemented [2] where a full image is loaded at each node. An objective with the nCUBE-2 implementation is to be capable of compressing an image of size 1024 2 1024 with 256 graytones. Such an image has a size of 1 Mbyte, the same as the available space for most nCUBE-2 nodes. Thus, the image would require all available memory and leave no space for code and other data. This memory constraint, often exhibited in massively parallel systems, motivates the pipelined node-to-node communication employed. 4.2. Range block versus domain block distribution The nCUBE-2 memory constraint requires the image data to be distributed between the nodes. Two solutions, both using a host/slave model, are considered. One solution requires distributing the range blocks evenly at the nodes and the other distributes the domain blocks evenly. Both schemes operate in a pipeline configuration as shown in Figure 4. The scheme for distributing the range blocks evenly between the slave nodes was discarded when a load balance problem was noticed. Specifically, when a match between a range block and a domain block is found, no reason exists to continue matching that range block with other domain blocks. Consider a distribution of range blocks on the slave nodes. Some nodes can have range blocks with a high complexity (i.e. they are difficult to match). Other nodes have range blocks where a match is found immediately. Thus, the execution time for matching a domain block with the unmatched range blocks varies between the slave nodes. In a pipeline, the time between each advance in each stage depends on the slowest stage (i.e. the slowest node determines the speed of the pipeline). Thus, the unbalanced execution times between the nodes causes idle nodes. This imbalance will be even more apparent when quadtree partitioning, either QD or QR, is used. This load balance problem led to the use of the current scheme for distributing the domain blocks among the slave nodes. The utilized scheme, for both QD and QR algorithms, distributes the domain blocks to the slave nodes and creates a queue of unprocessed range blocks, on the host, as V O L . 39 ,
NO. 1,
1 9 96
6
D. J. J A C K S O N
AND
F IGURE 4. Circulating pipeline communication model for N nodes.
assignments to the slaves. The host controls pipeline input and output from the slave nodes, except in the slave–slave communication steps. The slave nodes communicate in a circulating pipeline fashion in which tagged range blocks advance through pipeline stages. Assignments are transmitted, from the host, to idle nodes requesting an assignment. When a slave has received a range block from the host, the block enters the pipeline with a tag carrying the number of the entering node. The range block will circulate in the pipeline until all nodes have been visited or a range– domain match is found. Example events in the pipeline are shown in Figure 5. It is important to note that many such events may occur simultaneously within the pipeline. 5. THE PARAL LEL QD AL GORITHM
5.1. The QD host program The QD host program is loaded on node zero in the allocated cube executing the program. The host first loads the original image and creates a data structure where each range block is marked as covered or uncovered. Initially, all range blocks are marked as uncovered. Then, the MSE threshold value and the minimum and maximum size of the range blocks are transmitted to all nodes. The image is then divided into domain blocks of an appropriate size according to the current
FIGURE 5. Examples of pipeline events.
THE COMPUTER JOURNAL,
W. MAHMOUD
quadtree level. These domain blocks are then distributed evenly between the slave nodes. Subsequently, the host waits either for slave nodes to request range block assignments or for a result of a comparison. If a message received by the host is a request, the data structure marking range blocks covered or uncovered is searched for an uncovered range block that has not been assigned to any slave. If an uncovered and unassigned range block is found, it is transmitted to the requesting node and marked as assigned. If all range blocks at the current quadtree level are assigned or covered, the request is removed from the incoming buffer and no action is taken. If a message indicating a successful match is received, or the current level is the last level of the quadtree decomposition, the transformation parameters are stored and the range block is marked as covered. Otherwise, the range block is decomposed into four new uncovered range blocks. Whenever a result is received, the number of results received is compared with the total number of assignments. If all results are received, an end-of-level command is broadcast to all slave nodes. If more quadtree levels exist, the host transmits the new set of domain blocks and slaves continue searching for appropriate transformations for all remaining range blocks. 5.2. The QD slave program The QD slave process executes on all nodes, except for the one executing the host process. A slave process begins by receiving the MSE threshold value and the minimum and maximum size of the range blocks. Then, the slave waits for the host to transmit the domain blocks dedicated for the slave. As the domain blocks arrive, they are preprocessed. The preprocessing involves classifying each domain block and summing its pixel values. These sums are used in equations (11)–(13). The slave checks the status of the communication buffer for possible assignments. If the input communication buffer is empty, a range block to process is requested from the host. Then, the slave waits for either a range block or an end-oflevel message. The range block can arrive from another slave, as well as from the host. If a range block arrives from the host, it is preprocessed. If the range block is classified as a grayshade block, the average grayscale is determined and the result is returned to the host. Otherwise, the result of the summation is stored in a data structure containing information about that particular range block, including the pixel data. Furthermore, the number of the node that received the assignment is stored in the startnode field for further use. The data structure also contains the result of the current best transformation obtained. After preprocessing the range block, it is processed as though it originated from a slave. This involves comparing the range block with the domain blocks classified as edge blocks. For every comparison, the match result is compared with the current best match found. Eventually, the fields for the best transformation are updated. If the match is better than the MSE threshold value, initially received from the V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
7
5.3. QD parallel runtime performance To analyze parallel runtime performance, three test images are utilized. The images are referred to as Boy, Lena and USA, and are shown in Figure 6. These test images represent three different degrees of complexity. The Boy image has the lowest complexity with several larger smooth areas with little grayscale change. The Lena image is a medium complexity image, possessing minor smooth areas and having certain distinct features such as the shoulder-to-background transition. An additional justification for using the Lena image is its status as a reference image in the image compression literature. The last test image is a weather image, taken from the METEOSAT satellite, with few smooth areas and a complex cloud pattern producing large contrast differences. Figure 7, as given by Jackson and Blom [1], shows the parallel execution time, as a function of the number of nodes, for each of the three test images. Attainable speedup is seen to be a function of the image complexity. However, nearly linear speedup is observed for all test images. Execution time is also a function of the specified MSE tolerance, with the more complex image requiring a larger portion of the image to be processed at the lowest quadtree level. An example is shown in Figure 8. As seen, much of the Lena image is processed at the lowest level of the quadtree decomposition for low MSE tolerance levels. Efficiency for the QD algorithm, as measured by Jackson and Blom, is based on the time the algorithm spends performing range–domain block comparisons. Using this measure, the overall efficiency of the algorithm is seen to be high. An exception to this is when grayscale blocks are unevenly distributed between slave nodes. Recall that grayscale blocks, as part of the classification scheme, are not considered for range–domain block comparisons. Efficiency depends on how regularly the grayshade blocks are distributed on the nodes. Consider the ratio of edge blocks to the total number of blocks for a node. That ratio is proportional to the time required to compare a range block to the pool of domain blocks at that particular node. If that ratio is different from
FIGURE 6. Test images.
host, the comparison is terminated and the result is transmitted to the host. If no match is found, the startnode field in the data structure is examined. If the field points to the next node in the pipeline, all domain blocks have been tried and the best found match is transmitted to the host. Otherwise, the range block is transmitted to the next node in the pipeline. THE COMPUTER JOURNAL,
F IGURE 7 . QD execution time versus number of nodes (MSA 10).
V O L . 39 ,
NO. 1,
1 9 96
8
D. J. J A C K S O N
AND
W. MAHMOUD
FIGURE 9. QD efficiency versus node number for test images.
smaller fluctuations as their grayshade blocks are more evenly distributed. 5.4. Limitations of the QD algorithm
F IGURE 8. Illustration of QD partitions for Lena.
node-to-node, the slowest nodes become bottlenecks and thus determine the speed of the pipeline. Thus, an uneven distribution of the number of edge blocks results in idle nodes. This is confirmed by Figure 9 which shows the efficiency of the 15 slave nodes in a 16 node configuration. Clearly the first two and last three nodes exhibit the lowest efficiency for the Boy image. By decomposing the Boy image vertically into 15 equal size pieces, the first two pieces and the last three are seen to have the most grayshade blocks. By distributing these pieces to the nodes, the first two and last three nodes in the pipeline will process their assigned range blocks faster as there are fewer edgeclassified domain blocks allocated to that processing node. This leads to idle time at these nodes and lower efficiency as confirmed by Figure 9. The Lena and USA images show THE COMPUTER JOURNAL,
Although the QD algorithm exhibits good speedup, a more detailed analysis illustrates that for compressing complex images with a low distortion tolerance, many calculations performed are not, ultimately, required. For example, consider the quadtree decompositions of the Lena image as shown in Figure 8. For an MSE threshold of 10, it is observed that there are many small range blocks which comprise the reconstructed image. These blocks are formed by iteratively decomposing larger range blocks for which no successful range–domain match exists. In fact, many of the small range blocks do not, themselves, match within a specified MSE tolerance. Rather, a best range–domain match is used, which exhibits the lowest MSE difference. Therefore, many calculations are performed which result in a failed range–domain match search. If the efficiency of the algorithm is measured as a percentage of the image area covered by each QD level, the QD algorithm is observed to be rather inefficient. This can also be stated as the percentage of range–domain comparisons which fail at each QD level. Figure 10 gives the percent image area coverage versus MSE tolerance for the Lena image for each of three QD levels. Total failure percentage is also given. Additional success/failure data for all test images is given in Table 2. From this data, it is observed that, even for simple images, many range–domain comparisons result in failed matches. This inefficiency, in the form of redundant (unused) range–domain block comparisons, may be minimized by reversing the order in which quadtree levels are considered. By beginning the range–domain comparisons with the smallest blocks, i.e. the lowest quadtree level, a larger percentage of the image may be ‘covered’ more quickly, thus resulting in a reduced parallel runtime. This bottom-up recompositon is the basis for the QR algorithm described in Section 6. V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
9
FIGURE 10. QD Lena area coverage per level versus MSE.
6. THE PARAL LEL QR AL GORITHM
6.1. The QR host program In the QR algorithm, processing begins from the lowest level in the decomposition, i.e. the image is initially partitioned into range blocks of small size. This decomposition is shown in Figure 11(a). As in QD, all range blocks are initially assumed to be uncovered. Elements of the data structure marking range blocks as covered/uncovered are assigned accordingly. The range–domain comparisons proceed in the same fashion as the QD algorithm. As the QD and QR slave programs are practically identical, the QR slave program will not be discussed further. For the first QR level, transformation parameters are stored in a linked list for all successful range–domain comparisons. At this level, for failed range–domain comparisons, the optimal transformation parameters are stored and the corresponding range blocks are marked as covered. Thus, only those range blocks in which successful domain matches are found will be considered for recombination at the next QR level.
FIG URE 11. Example QR range block recomposition.
At the next QR level, a block is considered uncovered if and only if its four constituent smaller blocks were uncovered. If any of the smaller blocks are covered, the entire larger block is considered covered. In Figure 11(b) groups of four blocks which may be combined are highlighted. Their subsequent recombination is also shown. It is important to note that only uncovered blocks
TABLE 2. Number of QD blocks per level for the three test images Boy Image MSE 3 5 10 20 40 60 80
Lena Image
USA Image
Block type
large
medium
small
large
medium
small
large
medium
small
edge shade edge shade edge shade edge shade edge shade edge shade edge shade
3 1 7 7 32 87 39 111 59 128 66 134 68 141
57 3 248 41 191 79 191 81 116 61 107 50 93 43
3662 130 2468 248 978 134 525 83 328 68 224 44 172 36
0 0 10 0 26 7 36 20 40 32 49 40 48 51
94 3 192 14 210 50 232 64 260 100 263 107 264 109
3603 105 3007 105 2392 136 1809 207 1307 197 1021 171 878 142
0 0 0 0 0 0 2 3 7 9 9 14 14 17
2 0 8 2 22 15 53 26 75 42 92 55 95 67
4002 86 3953 103 3781 167 3492 208 3132 240 2890 250 2673 279
THE COMPUTER JOURNAL,
V O L . 39 ,
NO. 1,
1 9 96
10
D. J. J A C K S O N
AND
must be processed, i.e. compared with domain blocks, at this level. This reduces, significantly, required comparisons at higher QR levels. The algorithm continues at subsequently higher levels of the quadtree recomposition until range blocks of a predetermined largest size are considered. This is shown in Figure 11(c). After range–domain comparisons at the highest QR level are complete, the linked list storing the transformation parameters is traversed and redundant entries are deleted. For every successful entry stored in the linked list at QR level i 1, there are four entries stored from QR level i that are deleted. The process continues until all pairwise levels are considered. Although we require four successful range–domain matches before blocks may be recombined, other scenarios wherein less than four blocks are combined are feasible. However, we find that, as the QR approach produces practically the same result as QD, the additional computations required to support other recombinations are not justified.
W. MAHMOUD
FIG URE 1 2. QR execution time versus number of nodes (MSE 10).
6.2. QR parallel runtime performance
tolerance for the Lena image for each of three QR levels. Additional success/failure data for the test images is given in Table 4.
The Boy, Lena and USA images are again used as test cases to analyze parallel runtime. Figure 12 gives the execution time versus number of nodes for the QR algorithm. For each case, the MSE threshold is 10. As can be observed, the parallel runtime, as compared to the QD algorithm, is improved significantly. Similar observations can be made for other MSE tolerance values. QD and QR runtime data for additional MSE values for the three test images is given in Table 3. From the experimental data, the QR algorithm performs better than QD for all MSE values. Although the execution time difference between QR and QD is less for higher MSE values and for less complex images, an improvement is shown consistently for the QR algorithm. Efficiency obtained is now, more appropriately, measured as a percentage image area covered at each QR level. Figure 13 gives the percent image area coverage versus MSE
7 . Q D A ND Q R CO M P AR IS O NS
7.1. Execution time analysis In analyzing the execution time required for searching a given quadtree level, we may consider that complexity is approximately proportional to the area, irrespective of the level in the quadtree. This would seem to be justified by equations (13) and (14). However, it is the percentage of successful matches which is clearly important—and minimizing computations which are likely to result in a failed range–domain match. Another consideration here is that the matching search terminates when an acceptable match is found. Since the probability of finding a match at higher
TABLE 3. QD and QR parallel runtimes (16–64 processors) for various MSE values Boy Image
Lena Image
USA Image
MSE
Method
16
32
64
16
32
64
16
32
64
3
QD QR QD QR QD QR QD QR QD QR QD QR QD QR
144.78 58.01 113.36 44.70 55.89 21.01 37.84 19.07 25.30 14.49 21.61 14.10 19.52 13.27
72.40 29.87 56.15 23.63 27.43 11.54 18.91 9.79 12.85 7.64 10.75 7.93 9.26 7.04
35.97 16.67 28.38 13.39 13.60 6.70 5.99 7.74 6.79 6.58 5.92 7.10 5.32 4.82
169.46 70.23 146.77 59.46 121.93 47.76 98.27 39.90 79.74 32.97 68.83 30.99 60.98 30.58
84.32 37.18 74.37 31.18 61.74 24.99 49.82 21.14 40.11 18.02 36.01 16.90 32.13 16.49
41.32 20.70 36.56 18.03 30.32 14.91 19.09 12.88 21.90 10.50 18.99 9.76 17.10 9.10
205.04 85.38 200.35 81.89 189.88 75.61 173.49 66.17 151.52 53.23 134.51 45.81 122.67 41.94
100.43 43.91 97.69 42.45 93.46 39.93 85.86 35.05 73.78 28.72 66.19 25.11 60.97 23.28
49.18 24.33 48.19 23.35 45.67 21.52 33.89 18.66 37.91 15.52 33.83 13.96 31.26 13.14
5 10 20 40 60 80
THE COMPUTER JOURNAL,
V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
11
probability of encoding at higher quadtree levels, but will suffice for our argument. Let us also eliminate shade block consideration as the encoding complexity is practically negligible compared to that of edge blocks. If the encoding consists of 10% large blocks, 20% medium blocks and 70% small blocks, then the total QD complexity is C 0:9C 0:7C 2:6C. For the first level we must consider the entire image. Subsequent levels only consider the area not already covered. For QR, the complexity becomes C 0:3C 0:1C 1:4C which is nearly a 2-fold performance improvement. We note that encoding times for QD and QR would be approximately the same if the distribution of large, medium and small edge blocks were equal. However, we do not find that to be the case with realistic images. If we consider the simplification with respect to shade blocks and consider that the complexity of finding small range–domain matches is less than that of finding large range–domain matches, then the difference between QD and QR is more pronounced—which is observed in the experimental data.
FIG URE 1 3. QR Lena area coverage per level versus MSE.
7.2. Compression ratios and SNR for the QD and QR algorithms
quadtree levels is lower for typical images, the search time for matching a given range block at higher quadtree levels is subsequently greater. Thus the complexities between quadtree levels are not as equivalent as an area analysis might indicate. For realistic images, the area covered by small edge range blocks is significantly greater than that for large edge blocks. Observing the data in Table 2 for the Boy image, the entry for MSE 10 shows 978 small edge blocks and 32 large edge blocks. This represents an area coverage of 23.87 and 12.5%, respectively. A greater disparity in area coverage between large and small blocks, favoring the QR method, is observed for the more complex Lena and USA images. Let us consider further the complexities of the three-level quadtree. Assume the complexity associated with matchings for the first quadtree level is C (for either QD or QR). As stated earlier this is not strictly true given the lower
Important performance factors are the compression ratios obtained and the information loss introduced. The loss is referred to as distortion. The distortion is measured as a signal-to-noise ratio (SNR as defined in equation (9)). The SNR, as a function of the compression ratio for the test images, is shown in Figure 14. Three sets of data apply to each image, obtained from the QD, QR, and JPEG algorithms respectively. The JPEG data is included for a simple comparison with other lossy methods. Similar results for quadtree fractal image compression and other lossy compression techniques (JPEG and wavelet) have been published by Fisher [4]. As predicted, when choosing the images, the performance with respect to compression ratios varies considerably. However, little variation is observed between the QD and QR algorithms for a given image. The minor variation in
TAB LE 4. Number of QR blocks per level for the three test images Boy Image MSE 3 5 10 20 40 60 80
Lena Image
USA Image
Block type
large
medium
small
large
medium
small
large
medium
small
edge shade edge shade edge shade edge shade edge shade edge shade edge shade
3 0 6 7 32 86 37 111 58 128 64 134 66 141
56 5 249 43 191 81 192 86 118 62 108 55 93 49
3669 135 2471 249 982 138 530 86 331 69 230 46 178 38
0 0 9 0 26 7 36 20 39 32 48 40 46 51
94 3 193 16 210 50 232 64 261 102 262 110 264 114
3603 105 3009 107 2392 136 1809 207 1309 199 1026 174 886 146
0 0 0 0 0 0 2 3 3 9 7 14 12 17
2 0 8 2 22 15 53 26 86 42 97 55 99 67
4002 86 3953 103 3781 167 3492 208 3152 240 2902 250 2689 279
THE COMPUTER JOURNAL,
V O L . 39 ,
NO. 1,
1 9 96
12
D. J. J A C K S O N
AND
W. MAHMOUD
compression attained for each algorithm is emphasized by observing the similar quadtree partitioning in Figure 15 for QD and QR encodings of the Boy image (MSE 10). In this case, a single block, corresponding to the first QD level, is seen to be the only block encoded differently for the QD and QR algorithms. Similarity in QD versus QR partitionings can be observed for other images. Both compression algorithms perform well on the Boy image with compression ratios ranging from 10 to 15 without visible distortion. The algorithms also perform well for the Lena image; although distortion becomes visible when the compression ratios exceed 10:1. When compressing the USA image, only modest compression ratios not exceeding 4:1 without a visible distortion are attained for either algorithm. Note that the SNR for an image without visible distortion varies from one image to another. That implies that SNR measurements are not comparable from one image to another. 8 . CO N CL USIO N S
A very promising compression algorithm is fractal image compression. Fractal image compression exploits natural affine redundancy present in most images to achieve high compression ratios. However, the search for the redundancy, represented as contractive transformations, has high computational demands and exhibits an O(4) complexity. Much research is directed toward development of image classification algorithms to decrease the compression time. Several solutions have been suggested and implemented [4, 7]. The presented algorithm employs a quadtree recomposition technique wherein inefficiency, in the form of redundant (unused) range–domain block comparisons present in the QD algorithm, may be minimized by reversing the order in which quadtree levels are considered. The QR algorithm employs a parallel host/slave model, wherein the slaves are connected in a circulating pipeline.
FIG URE 14. SNR versus compression ratio for QD, QR and JPEG.
THE COMPUTER JOURNAL,
FIG URE 15. QD and QR partitions for the Boy image (MSE 10).
The image is distributed evenly among the slave nodes with the host managing a queue of work assignments. The QR algorithm proves very efficient and is shown to provide consistently superior parallel runtime when compared to the QD algorithm. Additional research is necessary to incorporate the QR technique into content sensitive image partitioning techniques, such as HV partitioning. Furthermore, more elaborate image block classification schemes may be employed to reduce further the compression time. However, utilization of more complex block classification will require additional research into appropriate load distribution schemes to optimize execution time. Finally, we note again that the fractal algorithm is competitive with other lossy (DCT and wavelet) methods in terms of compression and reconstructed image quality. Compression is obviously better than lossless methods. Notably, additional research is necessary to make fractal compression competitive with other methods from an execution speed perspective. However, the current real strength of the fractal method is the fast decompression time. These characteristics make the fractal scheme well suited for write-once-read-many archival media, such as V O L . 39 ,
NO. 1,
1 9 96
PARALLEL PIPELINED FRACTAL IMAGE COMPRESSION
CD-ROM, wherein fast decoding and high image quality are important and encoding time is not as critical. REFERENCES [1] Jackson, D. J. and Blom, T. (1995) A parallel fractal image compression algorithm for hypercube multiprocessors. In Proc. of The 27th Southeastern Symp. on System Theory, Mississippi State, MS, pp. 274–278. [2] Jackson, D. J. and Tinney, G. S. (1995) Performance analysis of distributed implementations of a fractal image compression algorithm. Concurrency: Practice and Experience. J. Wiley & Sons Inc., New York. [3] Fisher, Y. (1995) Fractal Image Compression: Theory and Application. Springer-Verlag, Berlin. [4] Fisher, Y. (1992) Fractal image compression. SIGGRAPH ’92 Course Notes. [5] Barnsley, M. F. and Hurd, L. P. (1993) Fractal Image Compression. AK Peters Ltd, Wellesley, MA.
THE COMPUTER JOURNAL,
13
[6] Jacquin, A. E. (1992) Image coding based on a fractal theory of iterated contractive image transformations. IEEE Trans. Image Process., 1, 18–30. [7] Frigaard, C., Gade, J., Hemmingsen, T.T. and Sand,T. (1994) Image Compression Based on a Fractal Theory. Internal Report, Institute for Electronic Systems, Aalborg University, Denmark, pp. 1–10. [8] Monro, D. M. (1993) Class of fractal transforms. Electron. Lett., 29, 362–363. [9] Jacquin, A. E. (1993) Fractal image coding: a review. Proc. IEEE, 81, 1451–1465. [10] Blom, T. (1995) Fractal Image Compression Using a Parallel Pipeline Computation Model. M.Sc. Thesis, The University of Alabama. [11] Fisher, Y., Rogovin, D. and Shen, T.P. (1994) A comparison of fractal methods with dct (jpeg) and wavelets (epic). In SPIE Proc. Neural and Stochastic Methods in Image and Signal Processing III, vol. 2304–16, San Diego, CA. [12] Tinney, G. S. (1994) Parallel Fractal Image Compression. M.Sc. Thesis, The University of Alabama.
V O L . 39 ,
NO. 1,
1 9 96