Real time fractal image coder based on characteristic vector matching S. Samavi1, M. Habibi1, N. Roshanbin1, S. Shirani2 Department of Electrical and Computer Engineering, 1 Isfahan University of Technology, Isfahan, Iran, 2 McMaster University, Hamilton, ON Canada,
[email protected]
Abstract: In this paper a new hardware method for fast and low power fractal coding is presented. The introduced method is based on binary classification of domain and range blocks. The technique not only reduces power consumption and increases processing speed but also has minimal effect on the degradation of the output result compared to the available fractal techniques. In order to show the functionality of the proposed algorithm, the architecture was implemented on a FPGA chip. It was further shown that the power consumption is reduced by the proposed architecture. The resulted compression ratios, PSNR error, gate count, compression speed and power consumption are compared against the existing designs. Applications of the proposed design in certain fields such as mass volume database coding are also discussed.
Keywords: Fractal, image compression, VLSI, low power. 1
Introduction
With the continuous increase in the use of internet and wireless devices the importance of data compression is increasing too. The main advantage of compression is that it reduces the data storage requirements. It also offers an attractive approach to reduce the communication cost in transmitting high volumes of data over long-haul links via higher effective utilization of the available bandwidth in the data links. This significantly aids in reducing the cost of communication due to the data rate reduction. The fractal coder algorithm has recently received great attention as a powerful image compression technique. The fractal image compression (FIC) when applied to certain type of images and database application, leads to relatively high compression ratios [1]. Furthermore the decoding of fractal coded data is straightforward and fast. The major drawback of the coder is its low compression speed [2]. Recently fractal coding has been applied other fields such as image restoration, medical image classification and watermarking [3, 4]. While there has been extensive research to improve compression speeds, there have been fewer attempts to find a solution compatible with VLSI circuits which results in an efficient parallel hardware implementation. Using dedicated hardware implementation to perform the compression, gives the opportunity to exploit the ample parallelism available in the coder algorithm. Fractal image coding is based on partition iterated function system (PIFS). It builds on local self-similarities existing in an image. A fractal compression algorithm first partitions an original image into non-overlapping R×R blocks called Range blocks and forms a domain pool containing all of possibly overlapped D×D blocks, associated with 8 isometries from reflections and rotations, called Domain blocks [1]. The size of each domain block should be larger than that of the range block. For each range block, the best matching domain block must be found by affine transformations wk as equation (1).
1
x
wk
ck d k fk
y = ek z 0
0
gk hk ok
0 0 +
sk
(1)
Where sk controls the contrast and ok controls the brightness. Z=f (x, y) is the gray level of pixel in position (x,y) and ck, dk. ek, fk, gk, hk; denote the eight symmetries : identity, 90° clockwise rotation, 180° clockwise rotation, 270° clockwise rotation, x reflection, y reflection, y=x reflection and y=-x reflection. Obviously the matching between a range and its corresponding domain is not perfect and an error exists. When the jth domain is being tested for matching to the kth range block the matching error is calculated by the following equation. N
(2)
( s.ai + o − bi )2
R( d j , rk ) = i =1
where ai and bi are respectively pixels of the domain and range blocks. This error is minimized for specific values of o and s. Hence, values of o and s that minimize R are calculated in the following two equations. N ai bi − ( ai ) × ( bi ) i i i (3) s= N an2,i − ( an ,i )2 i
i
ai
bi − s . o=
i
(4)
i
N
Where N is the number of pixels in the range block. The domain block is down-sampled so that only N pixels are involved in the computations. Now plugging o and s back into equation (2) would result in the minimum RMS error. bi2 − (
N R( d j , rk ) =
i
bi )2 i
N
ai2 − (
N −
i
ai )2 i
N
.s 2
(5)
For a specific range block a domain is a match if the domain or any of its eight symmetries produce minimal error. The encoder must record the position of the best matched domain block and its transformation for each range block so as to reconstruct the decoded block on the decoder side. The size of a domain block is usually chosen to be 4 times that of a range block. To extend the number of domains available they are arranged in an overlapped fashion. Figure 1 shows the arrangement of range and domain blocks in both overlapped and non-overlapped manners.
Figure 1. Arrangement of range and domain blocks; (a) without overlaps and (b) with overlaps
Obviously, the most extensive section of the process is the searching for a domain block among the available domains (also called the domain pool). The algorithm should check every element of the domain pool against a specific range and select the one that produces the minimum error. Even more processing would be required if all possible rotation-flip transforms were also to be checked. The main bottleneck in FIC algorithms is the long search time needed to find matches between two sets of data (the range and domain pools). Some hardware solutions perform the search in parallel and require the presence of the whole range pool inside the chip [5] (domains can be extracted from the range pool), thus require vast amount of storage space on the chip. This means that a redundant copy of the image should always exist inside the chip and parallel access to a huge amount of data makes this approach practically slow. A reasonable alternative is storing the range pool partially and performing the parallel search on a specific window of the image [6]. A number of research groups have used genetic algorithm for speed up
2
purposes [7, 8, 9]. Wu [7] uses a two stage genetic algorithm in order to exploit spatial correlation for fractal coding. Yi-Ming Zhou [10] proposed a hybrid fractal image compression method based on an image feature and a special neural network (FNFC). Zhou claims that his method is a two-part approach. First, an image feature is defined for the best matching search, which is claimed to reduce the encoding time. Second, a special neural network is constructed to modify the mapping scheme for the sub-blocks, in which the pixel values fluctuate greatly. In a number of other research works frequency characteristics of an image have been used through wavelet transforms or DCT to achieve higher fractal coding [11-12]. Other solutions try to classify the range and domain pools into a number of groups. Classification is a scheme that many researchers have applied to the blocks of an image in order to restrain the search space and hence accelerate the coding process [13-14]. Then the search needs only be performed on a range and a number of domains, which belong to the same class. Confining the coder to search a limited area in the vicinity of the range block is another field of interest [15-16]. When the whole image is not searched for the best match, the coding process would be less sluggish. Most of the algorithms presented in the literature are complex routines which require software implementation. Obviously, hardware realization would offer faster coding speed but requires less complex algorithm to begin with. The hardware presented in [17] performs the original full search algorithm sequentially and is relatively slow. Some hardware solutions have been proposed to overcome the long search time required by the algorithm. The architecture presented in [18] uses kick-out conditions to bypass the search for blocks that do not hold a certain criteria and thus increases search speed. There have been multiresolution search techniques reported for fast VLSI compression [19, 20]. The architecture presented in [20] performs the search in parallel using local communication links and requires the presence of the whole range pool inside the chip. This paper a characteristic vector is used to classify the range and domain blocks. The straightforward approach of the classification has made it a perfect candidate for hardware implementation [21]. Then the algorithm is implemented on hardware. A number of schemes are used to exploit the time overlap between different tasks. Parallel custom hardware units are employed in the search process. These hardware novelties along with the simple classification routine have made the overall scheme an attractive one. Hence, this makes high-speed fractal compression possible with little image degradation compared to the full search technique. In section 2 the proposed binary matching classification method is explained and the proposed hardware architecture is presented in section 3. The results from implementation of the proposed circuit on an FPGA are reported in section 4. Concluding remarks are presented in section 5.
2
The suggested classification technique
To facilitate high speed domain pool search, a new binary matching technique is proposed. The first step is to partition a block into a number of sub-blocks. Then a classifier vector (CV) is formed by assigning a single bit to each of the sub-blocks. This is performed on both range and domain blocks. The matching procedure is only performed on the range and domain blocks that belong to a same class. That is, the best matching domain block is selected from a limited set of domains which have the same CV as the corresponding range. This technique is essentially a classification method which classifies each range and domain into a n-bit CV. The resolution of a CV is its number of bits. If each sub-block is only one pixel then the resolution of CV is equal to that of the initial block. When the resolution of the classifier vector is high and the sub-block are small then there is a low probability of finding an exact binary match between a range block and a block form the domain pool. Whereas a low resolution sub sampled binary image of each block increases number of domain blocks with similar CV as that of the range. This would increase the probability of finding a match. Simulation results in Figure 2 show the average probability density of binary matches versus classifier vector resolution. When a block is partitioned into sub-blocks that are only one pixel in size then there is close to zero probability of finding a match for that block. As the size of the subblocks increases and their binary resolution decreases the probability of finding a match increases. For a 256×256 image with 4 pixel domain overlap and 8×8 range blocks, a 64-bit (8×8) vector classifies a range block. Hence, the probability of finding a match is 20×20/264. Figure 3 illustrates a situation where a subblock is only one pixel in size. Large number of classes diminishes the chance of finding a match. If a block is represented by an 8-bit CV, as shown in Figure 4, then 28 different combinations are possible and there is
3
an average chance of 20×20/256 for finding a match which is very much acceptable. Therefore, we use the partitioning that is shown Figure 4 for classification purposes.
20 × 20 = 200 21
20 × 20 = 50 28
20 × 20 ≈0 2 64
Figure 2. Probability of finding a match under different classifier vector (CV) resolutions
Class #1
Class #2
Class #3
Class #4
Class #n
Class #264
(a)
(b)
Figure 3. (a) The Lenna image and a sample block from the picture. Each sub-block is only one pixel. (b) Different formations of the 64-bit classifier vectors define different classes.
At this stage for each desired range only a fraction of the entire domain pool is to be searched which dramatically increases the search process.
4
(a)
(b)
Figure 4. (a) Lena image, a sample block and its 8bit classifier vector (CV). (b) Classification of blocks.
To further extend the number of candid domain blocks, the CV of a domain and all of its symmetries are assumed to be identical and hence are in the same class. Actually a CV is an eight bit number and seven distinct one bit logic rotations are possible for it. We grouped all eight binary numbers in one class. Figure 5 presents an example of the proposed classification process. With this modification, for every partitioned range block, 8 possible rotational and flip patterns (symmetries) are considered, too. Reducing the resolution of classification could increase the matching error but the amendment of the rotated blocks in the same class as the original block introduces no further error in the matching process. For a n-bit CV of a range or domain block, without rotational considerations, there are 2n classes. For each desired range block, all domains that belong to the same class are accessed and fed into the processing unit to select the domain that introduces the least error. When the rotations of a block are grouped in the same class as the block, then the total number of classes is divided by the number of possible symmetries (8 different transforms) and 2n-3 classes are formed. In addition a 3bit rotation type is associated with each block. In this case for each desired range block, all domains of the same class, regardless of their type, are selected. All of the selected domains along with their types are processed to find the domain which produces the least amount of error. The partitioning process and classification vector generation are limited to a level suitable for parallel processing. With a value of 32 classes for a 256×256 image with 4 pixel domain overlap and 8×8 range blocks setup, the average number of domains in the domain pool subset (the average number of matches found for each range) will be about 12 and hence eight parallel processing units will be satisfactory configuration to find the best match among the recognized candidates. The architecture presented to perform the desired tasks and the complete fractal coding will be described in the next sections.
5
(b) (a) Figure 5. (a) The Lena image, a sample block and its 8bit CV, (b) classification of a CV and its 8 symmetries into one class.
3.1
Overview of the proposed architecture
The proposed fractal coder architecture consists of several main blocks each dedicated to perform one of the required stages of the coding process. The overall flowchart of the design is shown in Figure 6. The proposed procedure consists of two phases. In the first phase a block of image is accessed from the memory where the image is stored in. This is done in a pixel by pixel manner. An accessed pixel belongs to both a range and a domain block. Therefore, the classification can be performed simultaneously for a range and a domain. During the first phase the whole image is once accessed and all of the range and domain blocks are classified. The classifier vector (CV) and the symmetry type of each range and block is stored in a different memory. The data structure for storage of information for range blocks is different than that of the domain blocks. The relevant data organizations will be explained in the next sub-section. The second phase of the coder is responsible for the matching process. For each range block a domain is selected from the candid domain blocks which belong to the same class as the intended range. For implementation purposes the matching process is divided into two parts. The computation of Σai, Σai2, Σbi , and Σbi2 can be performed while the pixels are being accessed. The computation of Σaibi involves with pixels of a number of domains as well as the pixels of one range. This means that Σaibi can be computed in a parallel processing unit for comparison of one range block with eight different domain blocks. This is due to the fact that as the pixels are being accessed these quantities can be computed. On the other hand, there are quantities that require a sequential processor. Hence the second part which calculates o, s, and the matching error is a sequential structure which uses the output of the parallel processing unit. After all of the blocks are classified and their class number and symmetry type stored in appropriate memories then the matching process starts. The process starts with class #1. A number of range blocks may have this classification. A range is picked from this class and all of the domains with same classification are compared with it to find the best match. The procedure continues until all of the ranges in class #1 are exhausted. Then class #2 and its ranges are processed. This procedure continues until all of the classes are processed and a matching domain for every range block of the image is found. The proposed architecture is designed so as to avoid redundant memory cycles. Redundant memory cycles are a source for power dissipation and also act as a speed bottleneck. Since the basic full search fractal coding algorithm is based on uncorrelated memory searches, many references to the memory are possible. In the case of an external memory module, this means a high processing activity on the processor I/O pins and the bus and essentially high power dissipation is caused due to the I/O pads and bus capacitances. Furthermore, different memory types introduce different access times. Static RAMs inherently have lower
6
access times while widely used dynamic RAMs have relatively high access times resulting in the degradation of coding speed. Furthermore, high number of memory accesses required for search of best match is a limiting factor in the coding speed since most required computations in the processing are carried out faster than memory access times. Hence, reduction in the number of memory accesses is more essential when dynamic RAMs are employed.
Load image into RAM i= 1 ; j= 1; Access Domain j
Access Range i
Classification
i=i+1; j=j+1;
Classification phase Store classification into RAM
N
i=N j=M Y K=1
Class # = k
Access 8 domains of class i K=k+1 Compare with range #m
Store address of best matched domain
Matching phase
N k=32
Y
m=number of ranges in class i
Y N END
m=m+1
Figure 6. Block diagram of the proposed architecture.
In the presented architecture about 4 image access cycles (IAC) are required to accomplish the complete task. One IAC is the time required for the whole image to be accessed from its resident memory in a pixelwise manner. Obviously, the number of IACs greatly effects both the power consumption and the processing speed. In comparison the full search coding algorithm requires as many IACs as the number of ranges. The first IAC used in the proposed architecture is in the first stage which extracts the classification table and symmetry types. The rest of the processing performed in the second phase requiring about 2 IACs to load similar range and domain blocks in the parallel-processing unit. The procedure is to load the internal cache of the parallel processing with the domain blocks of a specific class (limited to a maximum of 8 blocks). Even though the domains of a class may have different symmetries, while being loaded into the hardware they are rotated to present same symmetry.
7
3.2
The classifier
The function of the first stage of the proposed coder is to classify all of the range and domain blocks. The assignment of a class number without any symmetry considerations is performed by extracting a classifier vector (CV) for each block. This is done by partitioning each block into a number of sub-blocks. If the mean intensity of a sub-block is less that the mean intensity of the block then a “0” is assigned to that sub-block otherwise a “1” is allocated to the sub-block. Since there are eight sub-blocks in a block then a CV has eight bits. A domain is 4 times bigger than a range. Since we want the domains have the same classifier vectors as the ranges, we need a downsampling with a factor of 4. When a domain is compared with a range the down-sampled version of domain is considered for computations. Hence for the classification purposes a domain is first down-sampled and then is partitioned. With the 8-bit CV obtained, the next step is to extract the class number and symmetry type corresponding to the extracted binary number. This step is realized by a lookup table. This is simply done by applying the 8-bit CV to the table as its address and output of the table is a class number and a symmetry type. The content of the table is pre-computed and the table itself is realized in a ROM. The complete classification is performed in one IAC. With no domain overlap, the procedure can be relatively straightforward with a specific order in accessing the range and domain blocks. Figure 9 shows an example for non-overlap domain blocks.
Figure 9. Sequence of range and domain access used for classification. The case shown is with no domain overlap.
As the sequence in Figure 9 shows, range number 1 is accessed first and classified. Meanwhile domain block number 1 is also being processed. With the completion of ranges 2, 3 and 4, the classification of the domain block number 1 will also be completed. If domain-overlap exists, when accessing range numbers 1, 2, 3 and 4, not only domain number 1 but also parts of domains 2, 3 and 4 are simultaneously being accessed, processed and classified. Figure 10 shows an example of overlapped domains. DA DB
RA
RB
DC
Figure 10. Concurrent classification of multiple overlapped domains using a cache.
In Figure 10 a row of domains at the top of the image is called DA which has an overlap with the next row of domains, DB. Rows of ranges do not have any overlaps as shown for RA and RB.
8
A cache is used to store the intermediate calculated results for the blocks that are not completely processed yet. In this case the ranges are accessed as before. Suppose each domain row contains m overlapped domains. Now referring to Figure 10, with the access of the last range in row RA, the domains of row DA are all classified and also m domains of row DB will be under classification and some intermediate results are available in the cache. When ranges in row RB are being classified the domains in row DB are being processed one by one. Each domain that is processed frees up the occupied cache. Meanwhile, new intermediate results of the domains in row DC will be stored in the cache. The maximum number of CVs required to be stored in the cache is m+2, where m is the total number of domains in a row. The policy for placing blocks into the cache is illustrated in Figure 11(a) where numbers inside each domain represent the cache block that the domain should be written to. As can be seen, m+2 cache blocks are required so as to avoid data corruption upon CV calculation of the overlapped domain blocks. Since each image block is partitioned into 8 subblocks and the mean intensity of each sub-block is to be stored hence, for each image block we need to save eight 8-bit numbers plus the mean intensity of the whole block. Figure 11(b) shows the sequence of cache blocks that are accessed when classifying the blocks in the first row of the image. As the first row of domains is being processed the intermediate results from the second row are stored inside the cache at blocks 6, 7, 1, 2 and 3. Storing partial results from the second row into cache blocks 1, 2 and 3 causes no data corruption since they are written to when the first row has already finished all of its transactions with these cache blocks.
m=5 1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
(a)
Cache block fill-up
1
2
6
Cache block free up
7
3 1
1
4 2
2
5
3
3
(b)
Figure 11. (a) Cache block assignment, (b) sequence of cache accesses.
Only the 5 most significant bits of each pixel are used in mean intensity calculation of each subblock. This is due to the fact that obtaining mean intensity of eight pixels is done by addition of these eight quantities and dividing the result by eight. This is almost equivalent to adding only the 5 most significant bits of these numbers to obtain an eight bit number. By doing so we reduce the size of the adders with minimum loss of precision. As mentioned, a range and the domains that belong to its class should be accessed. While the parallel processing unit is loading a block a table search is performed to find the location of the next domain block that is to be fetched. Figure 12 shows two tables. Part (a) of Figure 12 presents the
9
table that is for the domain blocks. This table is addressed with the class number. At each class number at most eight domains are addressed along with their symmetry type. The second table is found in Figure 12(b). Ranges are listed in this table are indexed based on the range number. Associated with each range there is an eight-bit identifier, 5 bits are used for its classification and 3 bits represent one of the eight possible symmetries. The search for a block with a specific class number is performed only on the 5-bit class identifier. Domain classification
Class #1 (8 locations)
Class #2 (8 locations)
Domain address
Symmetry
Domain address
Symmetry
Domain address
Symmetry
Domain address
Symmetry
Domain address
Symmetry
Domain address
(a)
type type
type type type
Symmetry
type
(b)
Fig. 12. Memory address and content for (a) domains classification table, (b) range classification table.
3.3
The parallel processing unit
The proposed architecture is built based on the flowchart of Fig. 6. After all of the ranges and domains are classified the hardware starts with class #1. It means that all of the domains in class # 1 are loaded into appropriate caches. The maximum number of domains that are loaded is eight. These cache units are shown in Fig. 13 as 8×8 byte memory banks. While the domain pixels are being read and loaded into each bank a dedicated accumulation unit, called sigma generator unit, computes Σai and Σai2 and these values are latched for each domain. It is possible that there are a number of ranges in class #1. When the loading of the domains is finished then the first range of class #1 is accessed pixel by pixel. Every pixel of this range that is accessed goes to the sigma generator unit in order to get Σbi and Σbi2. These two values are latched too. Another quantity that needs computation is Σaibi. This is the summation of the multiplication of every range pixel and its corresponding domain pixel. Since this value has to be computed for all of the domains that are being compared with the range parallel processing can be performed. Hence, while every pixel of the range is being accessed it goes into eight parallel multiplication units as shown in Fig. 13. The other operand of each multiplier comes from one of the eight domains. The outputs of these multipliers are accumulated to produce Σaibi. Eventually, all required quantities of Σai, Σai2, Σbi, Σbi2 and Σaibi are latched. The next step is to calculate o, s, and rms error. These values are to be computed for one range and eight domains. Each set of data is loaded from the latch into the sequential processor through a multiplexer. The first computed values of o, s and rms are stored in a register. The next set of values would replace the current values if the computed rms is lower than existing one. By doing so it is guaranteed that the minimum rms is found. Then the o, s and the coordinates of the domain that has produced this minimum rms are stored. While the o, s, and rms of the first range of class #1 is being computed the second range of class #1 is being loaded into the parallel processing unit in a pixel by pixel manner. There is no need to load any other domains and the comparison will be done between this new range and the existing domains. Therefore, there is no need to compute Σai and Σai2. The hardware needs only to compute
10
Σbi, Σbi2 and Σaibi. This process is continued until all of the ranges in class #1 are finished. Then the
process goes to the next class and it will continue until all of the classes are finished.
∑a
∑a
b
∑a
b
1,i i
0 ,i i
i
i
∑a ∑a
0 ,i
i
i
∑a
1,i
i
2 0 ,i
b
7 ,i i
i
∑a
7 ,i
i
∑a
2 1, i
i
∑a
2 7,i
i
∑b ∑b i
i
2 i
i
Fig. 13. Structure of the parallel processing unit and its connection with the sequential processor
By processing all class numbers, every range block will have a domain block which will be similar by applying the appropriate symmetry type, offset and contrast parameters. While the symmetry type is extracted from the classifier results, the contrast and offset is obtained from the parallel processing unit for the best matched domain block. The symmetry type, offset and contrast together with the location of the best matched domain is stored in the memory for each range block. These parameters comprise the coded fractal image. In order to simplify the hardware design we decided to reduce the size of the multipliers. The operands that are to be multiplied by the parallel processor are eight bit numbers. These operands are converted to a sort of floating point numbers with 4-bit mantissa and 3 bits of exponents. The four non-zero most significant bits of the operands are selected to form the mantissa. Then the position of these four bits determines the 3-bit exponent. For example if the operands are 01111010 and 10011000 then they are considered as 15×23 and 9×24. Hence, the position of the most significant 1 of the operand determines which 4 bits are selected as the mantissa. It also determines the value of exponent. The position of the most significant 1 is determined by a priority encoder. A shifter circuit that is built by a group of multiplexers generates the four-bit mantissa.
11
After the operands are converted they are multiplied by a 4×4 multiplier. The output of the multiplier is then shifted to compensate for the pre-shifting of the operands. It should be noticed that for computation of o, s, and rms error there are eventual division with the number of pixels of a range block. It means that even if we use precise arithmetic to perform the multiplication eventually we need to divide the results by a number such as 64. This is equivalent to 6 bits of right-shift. Implementation of an 8-bit multiplier on an FPGA requires in excess of 1Kgate. On the other hand, the proposed multiplier circuit consumes only 0.1Kgates. Furthermore while the 8×8 unpipelined multiplier limits the clock speed to 60MHz, the proposed structure extends the maximum clock frequency to 100MHz.
Figure 14: Architecture of the fast multiplier
4
Implementation Results
The proposed architecture was simulated by the VHDL hardware descriptive language in a structural format. The design was then implemented on a Xilinx Virtex FPGA using the Xilinx ISE6 package. The implemented chip had an eight-bit data bus and a 17-bit address bus connected to a memory chip that contained a 256×256 grayscale image. Each grayscale pixel is 8 bits wide. The coded data are also stored in this memory. The chip was designed to run at a frequency of 100MHz. The access time of the static RAM was compatible with the clock frequency of the FPGA. The fractal compression completes in about 3 IACs (image access cycles) which is equal to 3×256×256×10ns =0.002s. This means that about 400 images can be compressed per second. Since the memory access rate is relatively low (3 IACs per image) compared to other methods, the power consumption is expected to be low, too. Several standard grayscale images were tested with and the coded results were fed to a software decoder to measure the error. Power consumption was also measured using the XPower utility of the Xilinx package. Process delay, PSNR error and Power consumption records are presented in Table 1.
12
Table 1. Various measured parameters for the proposed architecture using different images.
Image number Lenna Fly_boat Coin
PSNR
Power
25.3 25.5 25.6
240µW 210µW 220µW
Compression ratio 26 26 26
Coding time 8.3ms 8.3ms 8.3ms
The summary and specifications of the proposed architecture and implemented design is reported in Table 2. Table 2. Design specifications.
Simulation environment Xilinx chip Gate count Chip frequency Frame rate compression Input image Average power consumption Compression ratio Average PSNR
Xilinx ISE 6.4i Virtex II (XC2V250) 90KGate 100MHz 400 Frames/s 256*256 grayscale 250µW per image 26 25.5
Different compression results obtained from the basic full search architecture and the proposed structure are shown in Figure 15.
(a)
(b)
(c)
Figure 15. Comparison between (a) the original images (b) full search fractal coding, and (c) proposed coder
5
Conclusions
Fractal coding which is a very powerful mean of image compression has the shortcoming of being extremely sluggish. The main bottleneck in fractal coding is the search process. In this paper a new classification scheme was proposed which speeds up the search process. The classification was with the intention of being hardware realizable. A scheme was also devised to store the classified domain and range data in order to accommodate easy hardware retrieval of the information. Another aspect of the proposed algorithm and its hardware implementation was to minimize image accesses. This minimization helped us in boosting the coding speed as well as reducing the power consumption. Once a group of domain blocks are brought into the hardware they are compared with as many range blocks as possible. Furthermore, parallel operations
13
were performed when possible. Testing with variety of standard images proved the accurate operation of the hardware. While high compression ratios were attained, the qualities of the decoded images were compatible with the software implemented fractal algorithms.
References [1] A. E. Jacquin, “Image coding based on a fractal theory of iterated contractive image transformations”, IEEE Trans. Image Processing, Vol. 1, pp. 18–30, 1992. [2] W. O. Cochran, J. C. Hart and P. J. Flynn. “Fractal Volume Compression” IEEE Trans. on Visual. and Comp. Graphics. Vol. 2, No. 4, pp. 313-322, 1996
[3] L. Bocchi, G. Coppini, J. Nori and G. Valli, “Detection of single and clustered microcalcifications in mammograms using fractals models and neural networks”, Medical Engineering & Physics, Vol. 26, No.4, May, pp. 303-312, 2004. [4] Rongrong Ni, Qiuqi Ruan and H.D. Cheng, “Secure semi-blind watermarking based on iteration mapping and image features” Pattern Recognition, Vol. 38, No. 3, March 2005, Pages 357-368 [5] A. M. Ramirez, A. D. Sanchez, M. L. Aranda and J. Vega- Pineda “Simple and Fast Fractal Image Compression for VLSI Circuits” Proc. of the 3rd lntern. Symp. on Image and Signal Processing and Analysis. pp. 112-116, 2003. [6] K. P. Acken, M. J. Irwin and R. M. Owens. “A Parallel ASIC Architecture for Efficient Fractal Image Coding” Journal of VLSI Signal Processing. Vol. 19, No. 1, pp. 97-113, 1998 [7] M. S. Wu, J. H. Jeng and J. G. Hsieh. “Schema genetic algorithm for fractal image compression” Engineering Applications of Artificial Intelligence, Vol. 20, pp. 531–538, 2007. [8] Y. Zheng, G. Liu and X. Niu. “An improved fractal image compression approach by using iterated function system and genetic algorithm” Computers & Mathematics with Applications, Vol. 51, No. 11, pp. 1727-1740, 2006. [9] M. S. Wu, W. C. Teng, J. H. Jeng and J. G. Hsieh. “Spatial correlation genetic algorithm for fractal image compression” Chaos, Solitons & Fractals, Vol. 28, No. 2, pp. 497-510, 2006. [10] Y. M. Zhou, C. Zhang and Z. K. Zhang. “Fast hybrid fractal image compression using an image feature and neural network” Chaos, Solitons & Fractals, 2006. [11] J. Li and C. C. Jay Kua, “Image compression with a hybrid wavelet-fractal coder”, IEEE Trans. Image Processing, Vol. 8, No. 6, pp. 868-873, 1999. [12] D. J. Duh, J. H. Jeng and S. Y. Chen. “DCT based simple classification scheme for fractal image compression” Image and Vision Computing, Vol. 23, No. 13, pp. 1115-1121, 2005. [13] Wu, Y.G., Huang, M.Z., Wen, Y.L., “Fractal image compression with variance and mean”, Proceedings of the IEEE International Conference on Multimedia and Expo, Vol. 1, pp. 353-356, July 2003. [14] Yamauchi, H., Takeuchi, Y., Imai, M., “VLSI implementation of fractal image compression processor for moving pictures”, Proceedings of the IEEE Euromicro Conference, pp. 400 – 409, Sept. 2001. [15] Tong, C.S., Pi, M., “Fast fractal image encoding based on adaptive search”, IEEE Transactions on Image Processing, Vol. 10, pp. 1269 – 1277, Sept. 2001. [16] Hamzaoui, R., Saupe, D., Hiller, M., “Fast code enhancement with local search for fractal image compression”, Proceedings of the IEEE International Conference on Image Processing, Vol. 2, pp. 156-159, Sept 2000. [17] S. K. Bhunia, S. K. Ghosh, P. Kumar, P.P. Das, “Design, simulation and synthesis of an ASIC for fractal image compression” Proc. of the 12th Intern. Conf. on VLSI Design, pp. 544-548, 1999 H. J. Liang and S. S. Wang, “Architectural design of fractal image coder based on kick-out [18] condition”, IEEE International Symposium on Circuits and Systems ISCAS 2005, pp. 1118-1121, 2005. [19] K. P. Acken, M. J. Irwin and R. M. Owens. “A parallel ASIC architecture for efficient fractal image coding” Journal of VLSI Signal Processing. Vol. 19, No. 1, pp. 97-113, 1998 [20] S. Lee, H. Aso, “A parallel architecture for high speed image coding”, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks, 1999 [21] N. Rowshanbin, S. Samavi, S. Shirani, “Acceleration of fractal image compression using characteristic vector classification”, Proceedings of the IEEE CCECE, pp. 2026-2029, Canada, May 2006.
14