Image to tile conversion. Figure 2. .... [4] http://www.adobe.com/products/photoshop/cameraraw.html ... [10] http://www.epcc.ed.ac.uk/HPCinfo/hware-smp.html.
PARALLEL JPEG2000 ENCODING ON A BEOWULF CLUSTER *
António Serra, Luís Dias, Carlos Serrão, Miguel Dias and Paulo Trezentos ADETTI/ISCTE - Associação para o Desenvolvimento das Telecomunicações e Técnicas de Informática Ed. ISCTE, Av. das Forças Armadas, 1600-082 Lisboa, Portugal
ABSTRACT This paper describes a methodology and computing architecture for the parallelization of an open source JPEG2000 (or J2K) digital image encoder, applicable in Beowulf type Clusters. It starts by providing an overview of the technical architecture solution chosen for the J2K image encoding cluster, justifying the choices that were made. The encoding algorithm is then presented, both in it’s original standalone as well as in the parallel versions, discussing which were the modifications performed in the original version to improve its performance on a high performance computing platform, such as the Beowulf Cluster. Several tests were performed on the encoder, with the achieved results assessed by standard evaluation metrics of parallel algorithms. Those results are presented and discussed, showing the appropriateness of our approach. The major focus of this paper is on the High Performance Computing aspects of the solution, reinforcing the idea that high-performance solutions may be obtained through relatively low cost hardware architecture solutions. Finally some conclusions are drawn and the directions for future work are given. KEYWORDS
Image processing and analysis, JPEG2000, parallel algorithms, Beowulf cluster.
1. INTRODUCTION JPEG2000 (J2K) defines a new standard for coding digital images that have proven superior characteristics and flexibility, than those attained with existing standards, such as JPEG [1, 2]. Its high compression capabilities (as a rule of thumb, the quality/compression ratio has increased by a factor of 25% over the previous JPEG) and good image quality performance (J2K supports lossless and lossy compression), make it suitable for a number of different applications, ranging from web browsing to pictorial database storage, as well as remote sensing or medical imag ing where large amounts of visual data may be needed without loss of quality. Some of the outstanding characteristics of the standard, namely regarding the independent capacities of interactive and flexible access to the full or part of a J2K image, via a bit-stream oriented protocol established between the digital image server and a visualization/browsing client, are the following [2]: • Progressive lossy to losseless transmission of the bit-stream and build-up of the image. • Progressive transmission of the bit-stream, by: • Pixel accuracy or quality (controlled by signal to noise ratio and rate distortion metrics) • Pixel spatial resolution • Independent transmission of each of the multi-components of each pixel • Independent access to blocks or tiles that divide the image • Random access to ROIs defined in the image, that can be decompressed with less distortion than the rest of the image This new standard is starting to up-take into the market, namely, with plug-in versions already available for popular image processing tools [4] or Internet browsers [5]. This class of algorithms is now also available * Partially funded by the European Commission, project IST 28646, PRIAM
I - 53
IADIS International Conference Applied Computing 2004
to a wider medical system user base with the approval of JPEG2000 as an accepted image compression option by the DICOM Working Group 4 (compression group) in November 20011. Encoding raw visual data into the JPEG2000 format is a CPU intensive task [6]. The amount of time required for the encoding process can increase drastically, if the volume of data reaches gigabyte of dimensions. In this case, increased computing power is needed to efficiently encode images in this framework. Some available JPEG2000 encoders already exhibit some kind of parallelism, such as JJ2000 (http://jj2000.epfl.ch) that launches threads on a single processor to perform the arithmetic encoding process. Others, like Kakadu (http://www.kakadusoftware.com), are already extremely optimized and could be used as a basis for further performance improvements through parallelization. The parallelisation work reported in this paper has focused on the code block encoding, which is one of the basic operations of JPEG2000 standard encoding process and the most time consuming [6]. Although the minimisation of the overall compression time was not the initial main goal of the presented work (we were more interested in maximising the relative speedup metric [7]), this elapsed time was also positively affected by the parallelisation of the arithmetic encoder algorithm, as it will be presented. Another JPEG2000 encoding block that could have been processed in parallel is the wavelet transform [6]. This was not addressed in the current version of the work, since we have been focusing on the implementation of the parallel algorithm on a Beowulf cluster, which is a well adapted architecture to problems such as the parallel implementation of the code block encoding and not on a shared memory architecture, more appropriate for the parallel implementation of the wavelet transform block.
2. IMAGE ENCODING CLUSTER ARCHITECTURE The JPEG2000 image encoding architecture is based on Beowulf cluster [7], which is composed of dual SMP [10] nodes in a total of 4 nodes (totaling 8 processors). The nodes have the following hardware configuration: • one dual Pentium III, 1133MHz; • one dual Atlhon MP 1333MHz; • two dual Xeon 2400 MHz. All the nodes are equipped with 1024 MB of RAM me mory. A 100Mbps Fast Ethernet network supports the communications between nodes. All nodes are running on SuSE Linux 8.0 operating system and using PVM version 3.4.3 [8]. The adopted architecture is known as SMP-based Beowulf cluster. The use of SMP nodes instead of single processor nodes is justified since they provide a better price/performance ratio when compared with other market solutions.
3. SEQUENTIAL JPEG2000 ENCODER The original JPEG2000 encoder (known as J2000) that was used as a basis of this work is the open-source code that was developed on the scope of the IST PRIAM RTD project (IST- 28646) [12]. This non-parallel encoder has an architecture divided into modules, as it is depicted bellow (Figure1). Most of the computation time is spent in the T1 mo dule, which is the module that deals with the code block encoding [6]. For a more detailed description about the J2000 modules and source code can be found in [11]. The T1 module is called after the tile division of the original image, the multi component transform and discrete wavelet transform. Its main functionality is to compute the code block division for each of the components, resolutions, and bands [13, 15]. When this processing ends, all the code blocks are encoded. Since this process is performed in just one processor, it can become quite a slow process when large dimensions images are encoded.
1 Publication of Digital Imaging and Communications in Medicine (DICOM) Supplement 61.
I - 54
PARALLEL JPEG2000 ENCODING ON A BEOWULF CLUSTER*
J2K
TCD
MCT
DWT
T2
T1
MQC
PI
TGT
Figure 1. J2000 original architecture.
The following table (Table 1) shows a brief description of the role of each mo dule in the J2000 encoder software. The modules were named accordingly to the JPEG 2000 standard [15]. Table 1. J2000 Encoding Modules Description. Module J2K TCD MCT DWT T1 MQC T2 PI TGT
Description This is the API library. Encodes and decodes JPEG2000 codestreams. Tile encoder/decoder. Image Tile encoding and decoding. Multiple component transform. Applies color transformation if required. Discrete wavelet transform. Applies the discrete wavelet transform both in lossless and lossy modes. Tier 1 encoder/decoder. Code block encoding: Entropy encoding and quantization. MQ coder. JPEG2000 arithmetic encoder. Tier 2 coder/decoder. Rate allocation and packet composition. Packet iterator. To parse the packets. Tag tree coder/decoder. Builds the encoding and decoding tag trees.
4. PARALLEL JPEG2000 ENCODER The architecture of the parallel encoder is based on the JPEG2000 original architecture (Figure 1). The T1 module has been modified to handle data distribution and data gathering. These changes are described in depth in the next section. From the T1 module emerged a new module called T1P. This new mo dule deals only with the code block encoding - the most time consuming of all (Figure 2) - thus being the one selected to parallelize [13]. In fact, Taubman raises the idea that the EBCOT algorithm (which is the basis of the JPEG2000 encoding), "introduces the possibility of highly parallel implementations where multiple codeblocks are encoded or decoded simultaneously" [13]. The T1P module (the parallel version of the T1 module) consists of a parent task and several child tasks running in different processors. They will be described shortly. This problem is classified as embarrassingly parallel [14], because the structure of the problem clearly points out where the parallelization can be applied (thus simplifying the parallelization task) and also there is no communication between the child tasks which run independently of each other. Simple and effective load balancing is achieved by a “pool of child tasks” scheme: the parent task distributes the domain data to be processed to the first available idle child task in the pool.
4.1 Parent Task The parent task deals with all the aspects of the image encoding, except for the code block encoding, that, as previously mentioned, will be parallelized. After all transforms are applied and the tile division is completed, the next step consists in the code block div ision and delivery to the child tasks for encoding. Domain data, that is, data obtained from further partitioning of the image tiles, is equally distributed by the parent task amongst the child tasks, which then perform the code block encoding. The JPEG2000 standard [15] defines how the image tiles should be divided in code blocks before passing them to the arithmetic encoder. To reduce the number of messages from the parent task to the child tasks a group of 32 code blocks is served to
I - 55
IADIS International Conference Applied Computing 2004
each child task to perform the arithmetic encoding process. After this code block encoding finishes, data must be again gathered to continue with the encoding process at parent task side. In the end, JPEG2000 code stream is written and the image is encoded. % of time spent in each JPEG2000 block
100% 90% 80% 70%
Tier 2 Encoding Rate Allocation
60%
Arithmetic Entropy Coder
50%
Discrete Wavelet Transform
40%
Multiple Component Transform Image to tile conversion
30% 20% 10% 0% 30395
9120
37652
35575
Image size in KBytes
Figure 2. Percentage of CPU time spent in each JPEG2000 block in sequential encoding.
4.2 Child Tasks The child tasks are launched by the parent task using a specific PVM call “pvm_spawn” [8]. Each child task waits for image tile partitioning data messages from the parent task. The messages have a tag to identify them, used to differentiate data messages from control messages. After the termination of the data processing, each child tasks returns a message to the parent task containing the computing result (code block encoding of a certain partition of an image tile), and then resume to their waiting state, where they remain waiting for the reception of more data to be processed from the parent task.
5. PERFORMANCE ANALYSIS AND TESTS The following sections of this paper describe the tests performed on the parallel JPEG2000 encoder, including the obtained results, and perform their comparison with the non-parallel version.
5.1 Tests Definition The parallel encoder tests were performed, as mentioned, on an SMP-based Beowulf cluster. The hardware details about the cluster were presented on a previous section of this paper (Section 2). In order to have a comparison basis, tests were also made with the non-parallel version of the JPEG2000 encoder on a single node (using only one processor of one of the fastest SMP machines). During the tests images in the PPM format and with different sizes and characteristics were used (see Table 2). Table 2. Test images and sizes. Image pante_o inst2 infoterra senegal toulouse
I - 56
WidthxHeightxColors File Size in KB 3 990 x 3 221 x 24 bpp 37 625 5 804 x 3 809 x 24 bpp 64 768 6 000 x 6 000 x 24 bpp 105 469 24 200 x 12 700 x 24 bpp 900 411 24 000 x 24 000 x 24 bpp 1 687 501
PARALLEL JPEG2000 ENCODING ON A BEOWULF CLUSTER*
5.2 Theoretical Versus Practical Speedup When discussing performance speedups, theoretical assumptions must be taken into account. This is useful to evaluate the maximum theoretical speedup result that can be obtained and check against our own real results [6]. Amdahl’s law [9] gives an upper bound on the achievable parallel speedup, assuming that for concurrent sections in the code, perfect parallelism can be obtained. It can be expressed as:
speedup =
s+p s + Np
(1)
Where ‘s’ is the execution time of the uniprocessor and uniprocess sequential algorithm, ‘p’ is the execution time spent in the parallel code on N available processors. The execution time of a parallel program is the elapsed time from where the first processor starts executing on the problem, to when the last processor completes execution. During execution, each processor is computing, communicating or idling, respectively, on the ith processor. Hence total execution time p can be thought as the sum of computing, communicating and idling times over all the processors, divided by the number of processors.
5.3 Test Results The next table (Table 3) presents the code block encoding times and compares them with the sequential version of the encoder. Table 3. Arithmetic encoding times in seconds.
Parallel (#CPUs) Sequential pante_o inst2 infoterra senegal Toulouse
59,70 74,24 160,19 686,18 2386,14
2
3
4
5
6
7
8
45,38 31,12 61,94 474,97 942,50
31,61 27,85 51,61 284,57 816,36
17,09 24,50 45,88 229,90 703,88
14,51 20,30 38,23 194,68 580,67
13,34 19,49 35,87 186,91 551,75
13,02 18,68 34,34 176,43 528,06
12,76 18,53 33,45 178,71 512,96
The relative speedup results obtained by the parallel version as the number of CPUs increases is showed in Figure 3. For the code block encoding, we have observed that, in most images, as the problem dimension increases (larger images) the achieved speedup also increases, since the sequential percentage of the problem diminishes (see equation 1). The relative speedup obtained in the overall image compression is showed on Figure 4. Naturally, when reporting to the total JPEG2000 encoding, the relative speedup metric has lower values, than the ones for the block encoding mo dule. In this case, the sequential percentage of the problem is larger, due to the fact that the remaining parallel modules of the JPEG200 were not parallelized. This difference can be seen by comparing Figures 3 and 4.
I - 57
IADIS International Conference Applied Computing 2004
Arithmetic Encoder 6
5
Speedup
4
3
2
inst2 pante_o infoterra
1
senegal toulouse 0
2 cpus
3 cpus
4 cpus
5 cpus
6 cpus
7 cpus
8 cpus
#CPUs
Figure 3. Relative speedup of the code block encoding process.
Image Encoding 3
Relative Speedup
2.5
2
1.5
1
inst2 pante_o infoterra
0.5
senegal toulouse 0
2 cpus
3 cpus
4 cpus
5 cpus
6 cpus
7 cpus
8 cpus
#CPUs
Figure 4. Relative speedup of the full JPEG2000 image encoding process.
6. CONCLUSION The parallelization effort of some sequential algorithm (such as the code block encoding of JPEG2000), in the framework of a High Performance Computing solution (such as Beowulf cluster architecture), represents always an important trade-off between the processing time of each of the computing nodes and the
I - 58
PARALLEL JPEG2000 ENCODING ON A BEOWULF CLUSTER*
communication overhead. After studying the JPEG2000 encoding technique and the available system [11], it was possible to identify that one of the most potential points for parallelization was at the code block level, and therefore that was the strategy followed in the implementation. The performance tests have showed that the time needed for encoding an image using the JPEG2000 technique, can be significantly reduced with a parallel encoder. The achieved speedup has an observed additional overhead, which can be explained with the communications overhead between parent and child tasks. Also, the fact that the encoding is made tile by tile, introduces more waiting times for the parent task, which as to wait for results from the child tasks before he is able to move to the next tile. From the tests performed on mega and giga sized images it is possible to conclude that this parallel version of the encoder offers significant gains in terms of performance over the non-parallel version. However, some optimizations should be performed in the sequential version of the J2000 encoder to allow better performance on the parallel version. The result of this work will be available online as open source and will be referenced as PJ2000 (Parallel JPEG 2000 encoder).
REFERENCES [1] Charilaos Christopoulos, Athanassios Skodras and Touradj Ebrahimi, ”The JPEG2000 Still Image Coding System: An Overview”, Published in IEEE Transactions on Consumer Electronics, Vol. 46, No. 4, pp. 1103-1127, November 2000 [2] N. Skodras, C. A. Christopoulos, and T. Ebrahimi, ”JPEG2000: The Upcoming Still Image Compression Standard”, Proceedings of the 11th Portuguese Conference on Pattern Recognition (RECPA00D 20; invited paper). Porto, Portugal, May 11th -12th pp. 259-366, 2000. [3] Information Technology - JPEG 2000 Image Coding System: Part IX Interactivity Tools, API's and Protocols (JPIP), ISO/IEC Committee Draft v1.0, ISO/IEC JTC1/SC29/WG1 internal document, March 2003 (ISO JPEG 2000 Part 9, CD) [4] http://www.adobe.com/products/photoshop/cameraraw.html [5] http://www.luratech.com/ [6] Peter Meerwald, Roland Norcen, Andreas Uhl, “Parallel JPEG2000 Image Coding on Multi-processors”, International Parallel and distributed Processing Symposium, April 15-19, 2002 [7] Trezentos, P., "Projecto Vitara (Módulo1): Software de Comunicação entre Processadores - DIPC/LAM/PVM”, http://vitara.adetti.iscte.pt, 1999 [8] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, Vaidy Sunderam, “PVM: Parallel Virtual Machine – A Users’ Guide and Tutorial for Networked Parallel Computing”, 1994 [9] G. Amdahl, "Validity of the Single-Processor Approach to Achieving Large-Scale Computing Capabilities," Proc. AFIPS Conf., 1967, p. 483 [10] http://www.epcc.ed.ac.uk/HPCinfo/hware-smp.html [11] http://www.openjpeg.org [12] Technical Report “IST28646_THC_DR_R_D33”, THC, Adetti, Telemis, UCL, October 2002 [13] David S. Taubman, Michael W. Marcellin, ”JPEG2000 Image Compression Fundamentals, Standards and Practice”, Kluwer Academic Publishers, 2002
[14] Geoffrey C. Fox, Roy D. Williams, Paul C. Messina, ”Parallel Computing Works”, Morgan Kaufmann Publishers, Inc. 1994 [15] “JPEG 2000 Part 1 FDIS”, ISO/IEC JTC 1/SC 29/WG 1, JPEG 2000 Editor Martin Boliek, Coeditors Charilaos Christopoulos, and Eric Majani, December 2001
I - 59