previous bit-planes, are coded in second, MRP, pass. ... bypasses the three coding passes (SPP, MRP, CUP) and the prescribed ..... network traffic processing.
Efficient JPEG2000 EBCOT Context Modeling for Massively Parallel Architectures Jiˇr´ı Matela
V´ıt Rusˇ na´k
Petr Holub
Masaryk University and CESNET z. s. p. o. Botanick´a 68a, 621 00 Brno, CZ Abstract Embedded Block Coding with Optimal Truncation (EBCOT) is the fundamental and computationally very demanding part of the compression process of JPEG2000 image compression standard. In this paper, we present a reformulation of the context modeling of EBCOT that allows full parallelization for massively parallel architectures such as GPUs with their single instruction multiple threads architecture. We prove that the reformulation is equivalent to the EBCOT specification in JPEG2000 standard. Behavior of the reformulated algorithm is demonstrated using NVIDIA CUDA platform and compared to other state-of-the-art implementations.
I. Introduction JPEG2000 [1] is an image compression standard aimed at providing not only compression performance superior to the current JPEG standard but also advanced capabilities demanded by applications in the fields such as medical imaging [2], film industry [3], or image archiving, both for batch processing and real-time applications. Its advanced features and the superb compression performance results in higher computational demands and thus JPEG2000 is subject to various acceleration approaches on different platforms: many-core CPU, VLSI, and recently also graphical processing units (GPUs). Attracted by their raw computing power, a number of general-purpose GPU computing approaches has been implemented in recent years, including GLSL1 , CUDA, and OpenCL2 . Because of its flexibility and potential to utilize power of GPUs, we have opted for CUDA [4]—a massively parallel computing architecture designed by NVIDIA. Modern GPU architectures are designed to run thousands of threads in parallel. In the context of CUDA, threads are grouped into thread blocks, where threads can cooperate using synchronization primitives, shared memory (fast but visible within thread blocks only), and global memory (slower but visible to all threads). The common CUDA work flow is to copy data from RAM to the GPU global memory. All GPU threads can process the data directly in global memory, or, more preferably, the data is partitioned and fetched into the shared memory to provide higher throughput for more complex operations. It is also important that threads within the same warp follow the same execution path; otherwise the thread divergence is introduced and divergent execution paths are serialized, thus worsening performance [5, Chapter 6.1]. This computing model is called Single Instruction Multiple Threads (SIMT). 1
http://www.opengl.org/documentation/glsl/
1
2
http://www.khronos.org/opencl/
The architecture specifics of GPUs call for algorithms that allow for fine-grained parallelization with minimum instruction divergence and good use of memory hierarchy to utilize available raw computing capacity effectively. Also, high-enough arithmetic operations to memory access ratio is needed in order to mask memory latency on GPU. In our previous work [6], we have optimized DWT transformations for GPU and found out that the major performance limitation lies in EBCOT Tier-1 process, namely in context modeling. This is also witnessed for CPU implementations (e.g., for OpenJPEG running on testbed described in Section VI, context modeling takes 51% of CPU time, 32% is DWT, and 10% is arithmetic encoding). The coarse-grained approach to parallelization available in the JPEG2000 standard is not sufficient to utilize potential of GPU, as demonstrated by other GPU implementations described in Section III. Therefore, we have focused on reformulation of the EBCOT Tier-1 context modeler for fine-grained parallelization on massively parallel architectures in general, that also has properties required for its efficient run on GPUs. The resulting bit-parallel algorithm is described in Section IV and its equivalence to the JPEG2000 specification is proved in Section V. Performance of the proposed algorithm based on its implementation in CUDA is experimentally evaluated in Section VI. Concluding remarks and directions for future work are in Section VII. II. EBCOT Context Modeler in JPEG2000 EBCOT is a two-tiered coder. The input to Tier-1 is DWT-transformed image coefficients partitioned into so called code-blocks. Each code-block is processed independently in Tier-1 using context modeling and arithmetic coding to form an embedded bit-stream representing the compressed code-block. The context modeler analyzes the bit structure of the images and collects contextual information (CX) which is passed together with bit values (D) to the MQ-Coder—a context adaptive binary arithmetic coder. The MQ-Coder codes bit values based on its context information. There are 19 different contexts defined and for each of them, the arithmetic coder maintains and consecutively adapts probability estimate [7]. The context modeling module processes code-blocks bit-plane by bit-plane starting from the most significant bit-plane. Each bit-plane is coded in three passes but each bit is processed in exactly one pass—i.e., each pass scans through the entire bit-plane but processes only some of the bits. The decision, whether to process a bit in current pass, is made based on current state of the bit and states of its neighbors. Note that the state information changes as the bits are processed; the process is defined sequentially with the prescribed scan pattern. According to the scan pattern, bit-planes of a code block are divided into so called stripes having four bits in height and width of the code block. The scan starts from the top-left bit of the first stripe and continues down while a column of four bits is processed. The scan then progressively continues with the other columns of the stripe. After the first stripe is processed, the scan proceeds with the second stripe and continues till all stripes are processed. The scan pattern is illustrated in [8, p. 166]. The three passes are i) Signification Propagation Pass (SPP), ii) Magnitude Refinement Pass (MRP), and iii) Clean-Up Pass (CUP). Each pass encodes a bit using one or more of the following four bit-coding operations defined by JPEG2000 standard: 2
Zero Coding (ZC), Run-Length Coding (RLC), Magnitude Refinement Coding (MRC), and Sign Coding (SC). The state information consists of three state variables σ, σ 0 , η. The σ and σ 0 states are shared by all bits of each coefficient, indicating that the coefficient is significant, i.e. first non-zero bit of the coefficient has already been processed, and that MRC coding has been applied, respectively. The η is not shared, and indicates the bit has been processed in SPP pass on the current bit-plane [8]. A bit is in a so called preferred neighborhood (PN) if at least one of its 8 adjacent neighbors is significant, i.e., has σ = 1. All bits having σ = 0 and being in the PN are coded in SPP pass. The bits of the pixels that have become significant in the previous bit-planes, are coded in second, MRP, pass. Those bits have σ = 1 and η = 0. The rest of bits in current bit-plane is processed in CUP pass—i.e., all bits having σ = η = 0 after the previous two passes. III. Related Work JPEG2000 standard allows for code-block level parallelism, which is rather coarsegrained thus enforces computation in global memory on CUDA platform. Another option is JPEG2000 causal mode [2, 9, 10] allowing for pass-level parallelism which is, from a GPU point of view, also rather coarse grained. Sequential nature of the context modeler requires processing of one code-block by a single thread only; thus yielding (a) not enough threads too utilize massively parallel architecture of GPUs and (b) code divergence that introduces further performance penalty.The code-block level parallelism has been used by the CUJ2K [11], an open source JPEG2000 for CUDA project. A design similar to CUJ2K has been proposed by Datla et al. in [12]. Recent review of image processing algorithm effectiveness in [13] also considers EBCOT Tier-1 hard to parallelize, resulting in only 17% utilization of GPU. IV. Parallel Context Modeler Compared to the coarse-grained parallelism in JPEG2000 standard, the proposed bit-parallel context modeling architecture allows for independent processing of all bits of a bit-plane as well as independent processing of all bit-planes. Our design bypasses the three coding passes (SPP, MRP, CUP) and the prescribed scan pattern, enabling direct concurrent coding by the bit-coding operations (ZC, MRC, RLC, SC), regardless of JPEG2000 causal mode. For the purposes of the following explanation, we define a code-block as twodimensional sequence of coefficients, γx,y (x = 1..m, y = 1..n), m and n being the horizontal and vertical code-block dimensions respectively. A binary representation of a coefficient γ is a sequence [γ P −1 , γ P −2 , . . . , γ 1 , γ 0 ] where P is image bit depth. A p bit of the coefficient located at position (x, y) in bit-plane p is thus denoted as γx,y . Bit-planes are processed ordered from P − 1 to 0. To bypass the passes and to enable the direct coding, we introduce two new state p variables ρpx,y , and τx,y as replacement to the original states. Definition 1. The state variable ρpx,y indicates that coefficient γx,y became significant in either bit-planeWp or in one of the previous bit-planes processed prior to bit-plane p. p0 p Thus ρpx,y = 1 ⇔ Pp0−1 =p γx,y = 1, otherwise ρx,y = 0. 3
Definition 2. We define the following seven surroundings for a location (x, y):
A θx,y
5 θx,y
4 θx,y
3 θx,y
0
5 θx,y
0
4 θx,y
0
3 θx,y
where black squares are included and white squares are not included. Each surrounding 5 is defined as a set of coordinate pairs (e.g., θx,y = {(x + 1, y − 1), (x, y − 1), (x − 1, y − 5 4 3 1), (x − 1, y), (x − 1, y + 1)}). Surroundings θx,y ,θx,y ,θx,y represent sets of neighbors of position (x, y) that have already been processed according to the prescribed scan pattern in serial algorithm. p Definition 3. The state variable τx,y = 1 indicates that γx,y is going to become p p significant during SPP in the current bit-plane p. Value of τx,y is for each γx,y inductively computed as follows: W p+1 p p p 1. τx,y = 1 ⇔ ρp+1 = 1), otherwise τx,y = 0. A (ρi,j x,y = 0 ∧ γx,y = 1 ∧ (i,j)∈θx,y p 2. In each further step, τx,y is set to 1 in one of following three cases 2.I, 2.II, 2.III. p p These cases differ in y-axis position of bit γx,y and define set of surrounding τx,y values that needs to be considered. I. When y mod 4 = 0 (i.e., when processing bits in the first row of a stripe) p 5 values of τi,j , where (i, j) ∈ θx,y , are evaluated. Then hW W i p+1 p p ρp+1 = 0 ∧ γ = 1 ∧ (ρ = 1) ∨ (τ = 1) A 5 x,y x,y (i,j)∈θx,y i,j (i,j)∈θx,y i,j p p Note, that state τx+1,y−1 is evaluated because bit γx+1,y−1 was processed in previous stripe in the original algorithm. II. When y mod 4 ∈ {1, 2}h(i.e., second and third line Wof a stripe), theni W p+1 p p+1 p ρx,y = 0 ∧ γx,y = 1 ∧ = 1) ∨ A (ρi,j 4 (τi,j = 1) (i,j)∈θx,y (i,j)∈θx,y III. When y mod 4 = 3 (i.e., line of a stripe), then hfourth i W W p+1 p p+1 p ρx,y = 0 ∧ γx,y = 1 ∧ (ρ = 1) ∨ (τ = 1) A 3 (i,j)∈θx,y i,j (i,j)∈θx,y i,j p p Note, that state τx−1,y+1 is not evaluated because bit γx−1,y+1 lies in the next stripe and has not been processed yet in the original algorithm. p Note, that at most one of ρp+1 x,y , τx,y is equal to 1: if the bit has already become significant in previous bit-planes (ρp+1 x,y =1), it may not become significant in this p bit-plane (thus τx,y =0), and vice versa. p p Definition 4. We define an auxiliary state variable δx,y = 1 indicating that a bit γx,y p is in preferred neighborhood in SPP. For a particular bit γx,y , the value of the state W W p+1 p p variable δx,y = (i,j)∈θx,y = 1) ∨ (i,j)∈{θx,y A (ρi,j 5 ,θ 4 ,θ 3 } (τi,j = 1) where selection of x,y x,y p surrounding θx,y depends on position of the bit γx,y on y-axis as in Def. 3.
To be able to code a bit-plane p in parallel, the two new coding states ρp+1 x,y and p τx,y need to be precomputed before the actual coding of bit-plane. Following Def. 1, values of ρp+1 x,y are computed concurrently, for each coefficient γx,y by examining values p p of bits above the current bit γx,y . Then using ρp+1 x,y as an input, values of τx,y are 4
p inductively computed in parallel for each bit γx,y following Def. 3. Note, that step 2 p of Def. 3 is repeated till there is no change in τ , i.e., all to-be-significant-in-SPP bits are discovered. Hereby, the significance propagation property of the original algorithm is preserved. Once both ρ and τ state variables are computed, the coding pass and p the coding operations for an arbitrary bit γx,y can be decided. p SPP. A bit γx,y is coded in the SPP if coefficient γx,y is not significant and the bit is in preferred neighborhood. Thus the bit is coded in the SPP by the original W algorithm if σx,y = 0 ∧ (i,j)∈θx,y A (σi,j = 1) right before the bit is coded in order defined p by the scan pattern. Our parallel algorithm codes bits in the SPP if a bit γx,y is not p+1 p significant, i.e., ρx,y = 0, and if it is in the preferred neighborhood, i.e., δx,y = 1. All bits belonging to the SPP are encoded using ZC and non-zero bits are moreover encoded using SC. Equivalence of information provided by σx,y and ρp+1 x,y prior coding p bit γx,y is proved in Lemma 1 in Section V. Equivalence of information provided by σ and ρ ⊕ τ for whole bit plane in SPP is proved in Lemma 2. p MRP. A bit γx,y is coded in MRP if γx,y coefficient became significant in one of the previous bit-planes. Thus the original algorithm codes a bit in MRP if it is significant, i.e. σx,y = 1 and it was not coded in SPP in this bit-plane, i.e. ηx,y = 0. The parallel algorithm codes a bit in MRP if ρp+1 x,y = 1 which means that bit become significant in one of the previous bit-planes (Def. 1), and thus was not coded in SPP in the current bit-plane p. All bits are coded using MRC coding in MRP. MRC processes p a bit γx,y based on σ state information of its neighbors and based on value of state 0 σx,y . Instead of σ, parallel algorithm uses ρ ⊕ τ which is proved to be equivalent to σ for SPP (Lemma 2) and it holds true also for MRP because there is no change in σ state during MRP. The state σ 0 is substituted by looking for the position of the first s p non-zero bit in previous bit-planes. Let γx,y be the the first non-zero bit and γx,y the current bit, then s − p > 1 iff σ 0 = 1, because once the most significant non-zero bit is encoded in SPP/CUP, all bits below are coded using MRC in MRP. CUP. All the other bits are coded in CUP. For the original algorithm it means the bit was not coded in SPP (ηx,y = 0) ant it is not significant hence it cannot be coded in MRP (σx,y = 0). The parallel algorithm verifies that the bit was not coded in MRP (ρp+1 x,y = 0) and was not in preferred neighborhood in SPP, thus it cannot be p coded in the SPP (δx,y = 0). Bits belonging to CUP are processed using ZC or RLC, and non-zero bits are moreover processed using SC. In original algorithm σ states are used to decide whether code a bit using ZC or RLC and are also used by ZC and SC coding operations. The parallel algorithm replaces σ by ρ ⊕ τ combination and it also checks for bit values of previously processed neighbors (i.e., neighbors in θ5 , θ4 , or θ3 p depending on the y-axis position). Here every previously processed non-zero bit γx,y means that coefficient γx,y is already significant, since it was coded either in SPP or in CUP3 , or it could be coded in MRP, which means it became significant in one of previous bit-planes. Specifically for GPUs to avoid execution path divergence, we propose successive execution of the coding operatoins and we also propose to implement bit-to-thread mapping—i.e., thread-blocks are of the same size as the code-blocks and each bit-plane 3
Note, that every non-zero bit coded in SPP or CUP becomes significant right after it is coded.
5
is processed in the four consecutive steps: MRC, RLC, ZC, SC. Note, that each coding operation is executed for all bits in a bit-plane in parallel. The only constraint on bit coding independence stems from diverging number of bits coded by the RLC operation. The RLC is defined to code one to four bits in column and a prediction of the number is virtually as expensive as the RLC coding itself. The only operation affected by this is ZC, so we choose to perform RLC operations in current bit-plane before ZC. Although the proposed design allows for parallelism among bit-planes too, we do not exploit it because of limited shared memory size on contemporary GPUs. The described fine-grained parallel algorithm allows for processing individual bits in parallel threads, resulting in high utilization of multi-processors on GPU. Depending on chosen code block size, the data may be processed entirely in the fast shared memory4 . V. Proof of Equivalence The equivalence of the original algorithm and the proposed parallel algorithm is based on equivalence of information provided by the original state variables σ, σ 0 , and η and by the newly defined ρ, and τ . The proof is broken down into equivalency of states σ and ρ in already processed bit-planes (Lemma 1), and equivalency of σ and ρ ⊕ τ in SPP on current bit-plane (Lemma 2). The equivalence of information provided by η and ρ was shown in the description of MRP above. Replacement of information from σ 0 by γ P . . . γ p+1 has also been described in MRP. Definition 5. A state variable σx,y is set to 1 (i.e., coefficient γx,y becomes significant) p in the original algorithm right after the most significant non-zero bit γx,y is processed p in either SPP or CUP passes. A bit γx,y becomes significant in SPP iff right before W p the bit is processed, it holds that σx,y = 0 ∧ γx,y = 1 ∧ [ (i,j)∈θx,y A (σi,j = 1)]. Otherwise p if σx,y = 0 ∧ γx,y = 1, the bit becomes significant in CUP. p−1 Lemma 1. Value of σx,y is equal to ρpx,y prior to γx,y bit is processed.
Proof. Value of σx,y is set to 1 for each coefficient γx,y once the most significant bit of the coefficient has been processed in the original sequential algorithm. I.e., right p P −1 p after processing bit γx,y , σx,y = 1 iff one of bits γx,y . . . γx,y was equal to 1, thus WP −1 p0 p σx,y = p0 =p γx,y . Following from Def. 1, ρx,y = σx,y . In the sequential algorithm, p−1 this has to occur prior to proceeding to p − 1 bit-plane and thus also before γx,y is processed. p−1 Note, that σx,y of the original algorithm may change from 0 to 1 right after γx,y is p p−1 processed and this is not reflected by ρx,y , but it will be reflected by τx,y .
Lemma 2. For a bit-plane p processed in the first pass (SPP) it holds that σx,y = p 1 ⇔ ρp+1 x,y ⊕ τx,y = 1, where ⊕ denotes XOR operation. Proof. Using mathematical induction we want to show that in each step of the first pass (SPP) in a bit-plane p: 4
Because of shared memory size limitations, older NVIDIA GPUs are limited to 16 × 16 code blocks, while new NVIDIA Fermi architecture allows for larger code blocks.
6
p 1. σx,y = 1 ⇒ ρp+1 x,y ⊕ τx,y = 1 p 2. σx,y = 1 ⇐ ρp+1 x,y ⊕ τx,y = 1 Note, that we assume that values of all state variables are equal to zero for indices x, y beyond borders of a bit-plane. Basis. Let x = 0, y = 0. p 1. We want to show that if σ0,0 = 1 then ρp+1 0,0 ⊕ τ0,0 = 1 p A. Let σ0,0 = 1 before γ0,0 is processed. Then σ0,0 = ρp+1 0,0 = 1 (Lemma 1). p p+1 Following from Def. 3, τ0,0 = 0 because ρ0,0 = 1. p B. Let σ0,0 = 1 right after, but not before, bit γ0,0 is processed . Then coefficient p γ0,0 becomes significant in current bit-plane p, and thus right before bit γ0,0 p is processed it holds that σ0,0 = 0 ∧ γ0,0 = 1 ∧ [σ1,0 = 1 ∨ σ1,1 = 1 ∨ σ0,1 = 1]. Then following from Lemma 1, we can replace every σi,j with ρp+1 i,j for each p+1 p p+1 p+1 p unprocessed γx,y hence ρ0,0 = 0 ∧ γ0,0 = 1 ∧ [ρ1,0 = 1 ∨ ρ1,1 = 1 ∨ ρp+1 0,1 = 1]. p Therefore τ0,0 = 1 from the first step of Definition 3. p 2. We want to show that if ρp+1 0,0 ⊕ τ0,0 = 1 then σ0,0 = 1 p A. Let ρp+1 0,0 = 1 ∧ τ0,0 = 0. Then σ0,0 = 1 from Definition 1 and Lemma 1. p B. Let ρp+1 0,0 = 0 ∧ τ0,0 = 1. Then coefficient γ0,0 becomes significant in current p p+1 bit-plane p, and from Definition 3 we have ρp+1 0,0 = 0 ∧ γ0,0 = 1 ∧ [ρ1,0 = p+1 p+1 1 ∨ ρ1,1 = 1 ∨ ρ0,1 = 1]. Then following from Lemma 1, we can replace every p+1 p p ρi,j with σi,j for each unprocessed γx,y , hence, right before γ0,0 is processed, p it holds that σ0,0 = 0 ∧ γ0,0 = 1 ∧ [σ1,0 = 1 ∨ σ1,1 = 1 ∨ σ0,1 = 1]. Thus p σ0,0 = 1 right after γ0,0 is processed (Def. 5). p p Induction step. We assume that σx0 ,y0 = 1 ⇔ ρp+1 x0 ,y 0 ⊕ τx0 ,y 0 = 1 for each γx0 ,y 0 p processed before γx,y according to the prescribed scan pattern. 1. Let σx,y = 1. Then A. either coefficient γx,y became significant in one of previous bit-planes hence ρp+1 x,y = 1 (Lemma 1), B. or coefficient γx,y is going to become significant in the current bit-plane p, and therefore one of the three hWy mod 4 cases arises: W i p+1 p+1 p 0 (ρ I. ρx,y = 0 ∧ γx,y = 1 ∧ (σ = 1) ∨ = 1) 5 3 i,j i,j hW(i,j)∈θx,y W(i,j)∈θx,y i p+1 p+1 p 0 II. ρx,y = 0 ∧ γx,y = 1 ∧ ∨ = 1) 4 (σi,j = 1) 4 (ρi,j hW(i,j)∈θx,y W(i,j)∈θx,y i p+1 p 0 III. ρp+1 = 0 ∧ γ = 1 ∧ (σ = 1) ∨ (ρ = 1) 3 5 x,y x,y i,j i,j (i,j)∈θx,y (i,j)∈θx,y p Using induction assumption we expand each σx,y to ρp+1 x,y ⊕ τx,y = 1 so in case of 1.B.I we get hW W i p+1 p p+1 p+1 p 0 ⊕ τi,j = 1) ∨ = 1) ρx,y = 0 ∧ γx,y = 1 ∧ 5 (ρi,j 3 (ρi,j (i,j)∈θx,y (i,j)∈θx,y p All terms ρp+1 i,j ⊕ τi,j can be rewritten to a form of simple disjunction because p both ρp+1 x,y and τx,y cannot be equal to 1 at the same time. Hence p ρp+1 x,y = 0 ∧ γx,y = 1∧ hW W i W p+1 p p+1 0 (ρ (ρ = 1) ∨ (τ = 1) ∨ = 1) 5 5 3 (i,j)∈θx,y i,j (i,j)∈θx,y i,j (i,j)∈θx,y i,j
7
Both ρp+1 x,y can be combined together with respect to surroundings θ. Hence p ρp+1 x,y = 0 ∧ γx,y = 1 ∧
hW
W i p+1 p (ρ = 1) ∨ (τ = 1) A 5 i,j i,j (i,j)∈θx,y (i,j)∈θx,y
p Therefore τx,y = 1 from Definition 3. The proof of the cases 1.B.II and 1.B.III is analogical to the 1.B.I. p 2. Let ρp+1 x,y ⊕ τx,y = 1. Then p+1 A. either ρx,y = 1, and thus coefficient γx,y became significant in one of previous bit-planes, hence σx,y = 1 (Lemma 1), B. or coefficient γx,y is going to become significant in current bit-plane, hence one of the three y mod 4 cases arises: p I. ρp+1 = 1∧ x,y = 0 ∧ γx,y h W i W p+1 p p+1 ⊕ τi,j = 1) ∨ = 1) 5 (ρi,j 30 (ρi,j (i,j)∈θx,y (i,j)∈θx,y p II. ρp+1 = 1∧ x,y = 0 ∧ γx,y h W
4 (i,j)∈θx,y
W i p p+1 0 (ρ (ρp+1 ⊕ τ = 1) ∨ = 1) 4 i,j i,j (i,j)∈θx,y i,j
p III. ρp+1 = 1∧ x,y = 0 ∧ γx,y h W
W i p+1 p p+1 0 (ρ ⊕ τ = 1) ∨ (ρ = 1) 3 5 i,j i,j i,j (i,j)∈θx,y (i,j)∈θx,y
p Using induction assumption we expand each ρp+1 x,y ⊕ τx,y = 1 to σx,y = 1 so in case of 2.B.I we get hW W i p+1 p+1 p ρx,y = 0 ∧ γx,y = 1 ∧ ∨ = 1) 5 (σi,j = 1) 30 (ρi,j (i,j)∈θx,y (i,j)∈θx,y
Each ρ can be rewritten to σ because those two state variables has same value p before bit γx,y is processed in current bit-plane p (Lemma 1). Hence p σx,y = 0 ∧ γx,y =1∧
hW
W i 0 (σ = 1) ∨ (σ = 1) 5 3 i,j i,j (i,j)∈θx,y (i,j)∈θx,y
0
5 3 Surroundings θx,y and θx,y can be folded, hence p σx,y = 0 ∧ γx,y =1∧
hW
A (i,j)∈θx,y
i (σi,j = 1)
p Therefore following Def. 5 state σx,y = 1 right after bit γx,y is coded.
VI. Experimental Results Methodology. We set up two benchmark sets focused on the EBCOT Tier-1 processing speed only. Performance of bpcuda (our GPU implementation) was compared with two open-source CPU implementations (OpenJPEG5 1.2, JasPer6 1.900.1), one commercial (Kakadu7 2.2.3) and one GPU implementation (CUJ2K8 1.1). All the CPU implementations were single-threaded. 5 7
http://www.openjpeg.org http://www.kakadusoftware.com
8
6 http://www.ece.uvic.ca/~mdadams/jasper/ http://cuj2k.sourceforge.net
8
The first benchmark set was aimed at comparing processing speed of images with various content. We selected two extreme cases (white, white-noise) and one standard image (Lenna), all sized 512×512 pixels. In the second benchmark set we focused on processing time dependency on image size. A portrait photography sized 1280×720, 1920×1080 and 4096×2160 pixels was used. All the images were 8-bit gray-scale, preprocessed using 3-level reversible DWT transformation. Benchmarks were run 30 times for each configuration. Hardware configuration was as follows: Intel Core i7 950 at 3.07 GHz, 6 GB RAM, GeForce GTX 285. Software stack comprised Linux kernel 2.6.28, NVIDIA drivers 256.53, CUDA Toolkit 3.1, and GCC 4.3.3. Discussion. Table 1 summarizes results for both benchmark sets. It can be seen that for a white 512×512 image, the CPU and GPU implementations are similar except for Kakadu. For non-trivial images and namely for larger images, the computation time prevails and the GPU implementations outperform CPU ones. For efficient bpcuda implementation, even processing of 512×512 non-trivial images is approximately 2× better compared to the best CPU implementation. White Lenna White-Noise 1280×720 1920×1080 4096×2160
OpenJPEG 39.9 ± 2.9 128.9 ± 29.1 185.4 ± 4.9 364.8 ± 2.9 723.3 ± 1.7 2818.0 ± 7.8
JasPer 11.5 ± 2.3 80.6 ± 20.2 129.9 ± 3.3 164.0 ± 0.1 369.3 ± 16.6 1481.5 ± 1.3
Kakadu 1.2 ± 0.4 47.8 ± 3.9 61.8 ± 3.9 145.6 ± 4.9 309.7 ± 4.6 1093.1 ± 4.6
CUJ2K 14.1 ± 0.1 101.0 ± 0.2 98.2 ± 0.2 120.1 ± 0.3 258.6 ± 0.4 914.1 ± 0.8
bpcuda 12.4 ± 0.1 26.3 ± 0.1 30.2 ± 0.1 63.5 ± 0.3 137.2 ± 0.5 662.9 ± 0.3
Table 1: EBCOT Tier-1 processing time [ms]. Lower time means better performance. To gain further insight into the EBCOT Tier-1 components, the profiling results of bpcuda and the best open-source and reference CPU implementation JasPer are compared. The EBCOT Tier-1 is the most time-consuming part of the encoding chain on CPU. In the case of JasPer processing the 1920×1080 pixels image, the context modeler occupies the 76 % (280.7 ms) and the arithmetic coder consumes 24 % (88.6 ms) of the EBCOT Tier-1. When bpcuda processes the same image, the context modeler only consumes 17 % (23.3 ms) and 83 % (113.9 ms) is spent in the arithmetic coder. The overall speedup of bpcuda is thus degraded due to yet not optimized arithmetic coder and speedup of the context modeler itself is 12× compared to JasPer. When analyzing bpcuda implementation efficiency in terms of GPU utilization, the implementation turns out to be compute-bound, as the memory loading from GPU global to GPU shared memory takes only ∼2 % of algorithm run time. Peak performance of GTX 285 is 1 TFLOPS for concurrent MAD+MUL instructions, 710 GFLOPS for MAD and 355 GFLOPS for other arithmetic instructions. The implementation of bpcuda context modeler algorithm reaches ∼260 GFLOPS, which is 69 % of the peak performance, since less than 6 % of MAD instructions is used. VII. Conclusions We have presented a novel approach to reformulating the context modeler algorithm of the EBCOT Tier-1 process in JPEG2000 in a way that enables fine-grained parallelism 9
on massively parallel architectures and allows for a scalable implementation on GPUs with SIMT architecture, regardless of JPEG2000 causal mode. We have proved correctness with respect to the specification in the JPEG2000 standard. The proposed algorithm has been evaluated using a CUDA implementation, showing 1.4–6.1× speedup of entire EBCOT Tier-1 over existing CPU and GPU implementations and 12× speedup of context modeler itself compared with, to our knowledge, the best open-source implementation. The acceleration is significant despite not yet optimized MQ-Coder, on which we will focus in the near future. The proposed parallelization enables efficient offloading the JPEG2000 processing from CPU that may not have sufficient processing capacity, especially in case of desktop and mobile CPUs, and real-time applications, where CPU is loaded by other tasks of the application such as network traffic processing. Even with future generations of many-core server CPUs and threaded implementations of JPEG2000 on CPUs, applications may still benefit from distributing the processing between both CPU and GPU. ˇ 0021622419 Acknowledgments This work has been supported by research intents MSM ˇ and MSM 6383917201 and GD 102/09/H042 grant. We would also like to thank Jana T˚ umov´a for very helpful discussions. References [1] ISO/IEC 15444-1. JPEG2000 image coding system—part 1: Core coding system. [2] D. S. Taubman and M. W. Marcellin. JPEG2000: Image Compression Fundamentals, Standards, and Practice. Springer, 2002. [3] M. W. Marcellin and A. Bilgin. JPEG2000 for digital cinema. SMPTE Motion Imaging Journal, 114(5-6):202–209, 2005. [4] NVIDIA. NVIDIA CUDA Programming Guide 3.0. NVIDIA, 2010. [5] D. Kirk and W. Wen-mei. Programming massively parallel processors: A Hands-on approach. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2010. [6] J. Matela. GPU-Based DWT Acceleration for JPEG2000. In Annual Doctoral Workshop on Mathematical and Engineering Methods in Computer Science, pp. 136–143. 2009. [7] C. Christopoulos, A. Skodras, et al. The JPEG2000 still image coding system: an overview. 46(4):1103–1127, Nov 2000. [8] T. Acharya and P.-S. Tsai. JPEG2000 Standard for Image Compression: Concepts, algorithms and VLSI architectures. Wiley-Interscience, New York, 2004. [9] Y. Li and M. Bayoumi. Three-level parallel high speed architecture for EBCOT in JPEG2000. In Intl. Conference on Acoustics, Speech, and Signal Processing. 2005. [10] M. Dyer, D. Taubman, et al. Memory efficient pass-parallel architecture for JPEG2000 encoding. In 7th Intl. Symp. on Signal Processing and Its Applications, pp. 53–56. 2003. [11] A. Weiß, M. Heide, et al. CUJ2K: a JPEG2000 encoder in CUDA, 2009. [12] S. Datla and N. S. Gidijala. Parallelizing Motion JPEG2000 with CUDA. In Intl. Conf. on Computer and Electrical Engineering, pp. 630–634. 2009. [13] I. K. Park, N. Singhal, et al. Design and Performance Evaluation of Image Processing Algorithms on GPUs. 99, 2010.
10