J Sign Process Syst DOI 10.1007/s11265-010-0491-8
Real-Time Tone-Mapping Processor with Integrated Photographic and Gradient Compression using 0.13 μm Technology on an Arm Soc Platform Ching-Te Chiu · Tsun-Hsien Wang · Wei-Ming Ke · Chen-Yu Chuang · Jhih-Siao Huang · Wei-Su Wong · Ren-Song Tsay · Cyuan-Jhe Wu
Received: 12 November 2009 / Revised: 16 April 2010 / Accepted: 23 April 2010 © Springer Science+Business Media, LLC 2010
Abstract Due to recent advances in high dynamic range (HDR) technologies, the ability to display HDR images or videos on conventional LCD devices has become more and more important. Many tonemapping algorithms have been proposed to meet this end, the choice of which depends on display characteristics such as luminance range, contrast ratio and gamma correction. An ideal HDR tone-mapping processor should have a robust core functionality, high flexibility, and low area consumption, and therefore an ARM-core-based system-on-chip (SOC) platform with a HDR tone-mapping application-specific integrated circuit (ASIC) is suitable for such applications. In this paper, we present a systematic methodology for
C.-T. Chiu · T.-H. Wang · W.-M. Ke · C.-Y. Chuang · J.-S. Huang · W.-S. Wong · R.-S. Tsay · C.-J. Wu (B) Department of Computer Science, National Tsing Hua University, Hsin-Chu, Taiwan, Republic of China e-mail:
[email protected] C.-T. Chiu e-mail:
[email protected] T.-H. Wang e-mail:
[email protected] W.-M. Ke e-mail:
[email protected] C.-Y. Chuang e-mail:
[email protected] J.-S. Huang e-mail:
[email protected] W.-S. Wong e-mail:
[email protected] R.-S. Tsay e-mail:
[email protected]
the development of a tone-mapping processor of optimized architecture using an ARM SOC platform, and illustrate the use of this novel HDR tone-mapping processor for both photographic and gradient compression. Optimization is achieved through four major steps: common module extraction, computation power enhancement, hardware/software partition, and cost function analysis. Based on the proposed scheme, we present an integrated photographic and gradient tone-mapping processor that can be configured for different applications. This newly-developed processor can process 1,024 × 768 images at 60 fps, runs at 100 MHz clock and consumes a core area of 8.1 mm2 under TSMC 0.13 μm technology, resulting in a 50% improvement in speed and area as compared with previously-described processors. Keywords High Dynamic Range (HDR) · Gradient compression · Photographic tone-mapping · SOC platform · Real-time
1 Introduction Real-scene luminance shows images in the dynamic range of 108 :1, while traditional displays can only exhibit those in the range of 100∼1,000:1, a limitation caused by the dynamic range of both capture sensors and display devices. Significant progress has been made in the development of HDR video sensors such as the Lars III (Silicon Vision), Autobrite (SMal Camera Technologies), HDRC (IMS Chips), LM9628 (National), and the Digital Pixel System [1], and with this progress in HDR capture technologies, high dynamic range images and videos become accessible. On the other
J Sign Process Syst
hand, due to the development of high dynamic range OLED, LED, LCD, and laser TV, the ability to display HDR images and videos on these devices has become important. Tone-mapping or tone reproduction is an image-processing technique that can be used to render high dynamic range images on display screens. Over the past few years, a considerable number of tone-mapping methods, commonly classified into “global” and “local” tone reproduction methods, have been proposed [2]. Global tone reproduction adopts the same mapping scheme for all pixels in an image. Tumblin and Rushmeier introduced the tone-mapping concept to computer graphics [3]; Chiu et al. used a spatially-varying exposure ramp over the image [4]; and Reinhard et al. presented the photographic tone reproduction method for digital images [5], which is simple and produces good results for a wide variety of images. Local tone reproduction operators take the local spatial context into account. Pattanailk et al. developed a technique by which patterns, luminance and color processing can be represented in human visual systems [6]. Fattal et al. proposed a gradient field compression method that achieves drastic dynamic range compression and preserves fine detail with success [7]. However, gradient domain HDR compression consumes significant computation time. Simple global tone-mapping algorithms can be run in real time on modern CPUs [2]. For example, to compress a 1,600 × 1,200 image with an Apple iBook G3 (800 MHz) processor takes 0.96–3.7 s using different global tone-mapping schemes and 10–120 s with various local tone-mapping methods [2]. Modern graphic hardware is needed for the process of local tonemapping for real-time videos [8], and neither generalpurpose CPUs nor graphic hardware are suitable for modern TV display applications. An ideal HDR tonemapping processor should contain both global and local tone-mapping algorithms with tunable parameters for different kinds of displays. The choice of tone-mapping algorithm depends on device characteristics such as luminance range, contrast ratio and gamma correction. In addition to high flexibility, such a HDR tone-mapping processor must have the characteristics of robust core functionality and low area consumption. An ARMcore-based SOC platform with a HDR tone-mapping ASIC is suitable for such applications. In this paper, we present a systematic methodology for the development of a tone-mapping processor of optimized architecture on an ARM SOC platform, and illustrate the use of this HDR tone-mapping processor for both photographic and gradient compression. Optimization is achieved through four major steps: common
Figure 1 Logarithmic tone-mapping architecture.
module extraction, computation power enhancement, hardware/software partition, and cost function analysis. A modified photographic tone-mapping scheme that includes color reproduction and gamma correction is presented in this paper and a block-based gradient domain HDR compression scheme is described, which enhances the processing speed significantly. The rest of the paper is organized as follows: a generic tone-mapping processor is described in Section 2; the modified photographic tone-mapping method is described in Section 3; block-based gradient compression is discussed in Section 4; the integrated photographic and gradient tone-mapping processor and ESL design flow are described in Section 5; hardware and software analysis is presented in Section 6; hardware implementation and experimental results are given in Section 7; and a brief conclusion is presented in Section 8.
2 Generic Tone-Mapping Processor In this section, the four main types of tone reproduction scheme are presented: logarithmic tone-mapping developed by Drago et al. photographic tone-mapping by Reinhard et al. bilateral tone-mapping by Durand et al. and gradient compression by Fatal et al. The architectures of these four tone-mapping schemes are presented and a generic tone-mapping processing flow is described. Drago et al. extended the logarithmic response curves to handle a wider dynamic range by adjusting the base of the logarithm according to the value of each pixel [9]; the architecture for this method is shown in Fig. 1. Reinhard et al. proposed the use of both global and local operators to perform tone-mapping [5]. The global operator makes use of the log average luminance and S-shaped transfer functions to predominantly compress high luminance; the architecture for this scheme is shown in Fig. 2. Durand and Dorsey introduced the use of a bilateral filter to split a density image (logarithmic luminance map) into a HDR and a LDR layer [10]. The HDR
J Sign Process Syst
Figure 2 Reinhard’s photographic tone-mapping architecture.
Figure 4 Gradient tone-mapping architecture.
layer is compressed and recombined with the other layer, and the result is then exponentiated to form a LDR image, as shown in Fig. 3. Fattal et al. adaptively attenuated the gradient field of the density image using a multi-scale approach [7]. After derivation of the attenuated gradient field, the compressed density image is reconstructed as shown in Fig. 4. A generic tone-mapping processing flow can be described according to these four common types of tone reproduction scheme, as shown in Fig. 5. As tonemapping is performed in the luminance domain, the luminance and chrominance components of a HDR image are separated first. The luminance is then compressed using miscellaneous tone-mapping algorithms to reduce the dynamic range while preserving the visual quality. The logarithm of luminance is adopted in several algorithms to reduce the computing complexity. If the logarithm operation is performed, then the exponent of the compressed luminance after tonemapping processing is needed. Finally, the compressed luminance and chrominance information is combined to obtain a LDR image. In this paper, we focus on developing a general tone-mapping processor based on the photographic and gradient tone-mapping algorithms that includes consideration of both real-time implementation and image quality. From [2], the overriding success of the photographic tone reproduction method proposed by E. Reinhard is clear. Photographic tone reproduction is of lower complexity than other methods and produces
better quality results using HDR images. The gradient domain HDR compression method was selected as our tone-mapping scheme owing to its high compression and detail preservation characteristics. The functions of these methods are described in the following sections.
Figure 3 Bilateral tone-mapping architecture.
Figure 5 General tone-mapping processing flow.
3 Modified Photographic Tone-Mapping A modified tone-mapping processor for HDR video based on Reinhard’s photographic tone reproduction scheme is presented below. This method is suitable for real-time implementation, as detailed below. The luminance of each HDR video frame is calculated first from Eq. 1: Lw = 0.2654Rw + 0.6704Gw + 0.0642Bw
(1)
J Sign Process Syst Figure 6 Modules of photographic tone-mapping.
where Rw , Gw and Bw are the scene-referred colors and Lw is the scene luminance. After the luminance of each video frame is obtained, the log-average luminance Lavg is calculated, as shown in Eq. 2. The log-average luminance is useful information for calculating the key of the scene. 1 Lavg (tk ) = exp log(δ + Lw ) (2) N In Eq. 2, N indicates the number of pixels in the video frame, and δ is a very small value. For a video sequence, the temporal frames are highly correlated with successive video frames, and hence we modify Eq. 2 for HDR video frames, as shown in Eq. 3: 1 Lavg (tk ) = exp log (δ+ Lw (i, j, tk−1 )) , for k = 1 N (3) where Lavg (tk ) indicates the average scene luminance of the k-th HDR video frame and Lw (i, j, tk−1 ) indicates the scene luminance of pixel (i, j) of the (k − 1)-th HDR video frame. Next, the average or normal luminance can be mapped to a desired key controlled by the parameter α with a linear scaling. The scaling luminance Ls (i, j, tk ) is shown in Eq. 4. Ls (i, j, tk ) =
α Lw (i, j, tk ) Lavg (tk−1 )
(4)
The parameter α affects the brightness of the video. For display considerations, we use Eq. 5 to map high luminance in a controlled way. Ls (i, j, tk ) 1.0 + LL2 s (i,(tj,tk ) ) white k−1 Ld (i, j, tk ) = (5) 1.0 + Ls (i, j, tk ) Lwhite is the luminance value of the white point. The maximum luminance value of the previous video frame is used as the white point for the current frame. The ratio of the color channels before and after compression is kept constant to maintain the color shift at
a minimum [2]. This can be achieved if the compressed image Rd , Gd , Bd values are computed as follows: C d = Ld
Cw Lw
(6)
where Cw indicates Rw , Gw , Bw , and Cw and Lw denote the color and luminance before HDR compression. Cd indicates Rd , Gd , Bd , and Cd and Ld denote the color and luminance after HDR compression. In order to control the amount of saturation in the image, we apply exponent γ at the ratio Cw /Lw in Eq. 6. This represents gamma correction per channel as follows: Cw γ C d = Ld (7) Lw Exponent γ is a user parameter that takes values between 0 and 1. The modules of the photographic tone-mapping process are shown in Fig. 6. The “Scale_to_midtone” module computes an average of logarithm luminance for the entire frame, while the “TM_photographic” module uses the average logarithm luminance and white point as scaling parameters to map the luminance from the HDR image to the LDR image.
4 Block-Based Gradient Compression Gradient compression computes luminance in the logarithm domain because this results in a better approximation of the perceived brightness [7]. The logarithm of the luminance in the HDR image is denoted H(x,y). To avoid spatial distortions in the image, we change only the magnitudes of the gradients and keep their directions unaltered. This goal is achieved by applying a spatially variant attenuating function, : G(x, y) = ∇ H(x, y) · (x, y)
(8)
Here, ∇H(x,y) is the gradient of the logarithm image and G(x,y) is the gradient image after attenuation. The
J Sign Process Syst
gradient is approximated by the forward difference values. ∇ H(x,y) ≈ (H(x+1, y)− H(x,y), H(x,y+1)− H(x, y)) (9) HDR compression can be achieved by attenuating the modified gradients of each pixel in the image by the attenuation factor, which is obtained from the modified gradient as H(x + 1, y) − H(x − 1, y) ∇ Hh (x, y) = , 2 H(x, y + 1) − H(x, y − 1) (10) 2 The attenuation factor for each pixel is determined by the magnitude of the modified gradient, as per the following equation: α ∇ Hh (x, y) β (x, y) = (11) ∇ Hh (x, y) α The function has 2 parameters, α and β, which determine how the gradient of each pixel is attenuated. The relationship between gradient magnitude and attenuation factor is shown in Fig. 7a, b. Figure 7a shows the relationship between gradient magnitude and attenuation factor for different α with a fixed β. The gradient of a pixel is attenuated if its magnitude is bigger than α, preserved it if its magnitude is equal to α, and magnified it if its magnitude is smaller than α. In order to achieve attenuation, the value of β must be smaller than one. Figure 7b shows that the attenuation factor curve is sharper if β is smaller. After attenuating the gradient, we compute the differential of the attenuated gradient (Gx , G y ) to get the divergence. The backward difference is used to obtain approximations of the divergence.
Figure 7 Relationship between gradient magnitude and attenuation factor.
divG ≈ Gx (x, y)−Gx (x−1, y)+G y (x, y)−G y (x, y−1) (12) According to [14], we can reconstruct the LDR image I by solving the following Poisson equation: ∇ 2 I = divG
(13)
∂2 I ∂2 I + ∂ x2 ∂ y2
(14)
and divG =
∂G y ∂Gx + ∂x ∂y
∇ 2 I ≈ I(x + 1, y) + I(x − 1, y) + I(x, y + 1) + I(x, y − 1) − 4I(x, y)
where ∇2 I =
∇ 2 is the Laplacian operator, and ∇ 2 I is obtained from the following standard finite difference approximation:
(15)
(16)
To solve the partial differential equation (PDE), we must assign boundary conditions and then find the integration constant in the general solutions. At the boundaries, we assume that the derivatives around the original image grid are zero. The numerical solution of the PDE can then be obtained through the Discrete
J Sign Process Syst
end and some of the functions in the GDC are the same as those used in photographic tone-mapping methods. For example, the color to luminance transformation and logarithm luminance calculation can be shared by both tone-mapping algorithms.
5 Integrated Photographic and Gradient Compression
Figure 8 Extended block with boundary points.
Sine Transform [16]. Finally, the luminance is computed by the exponential operation, and the processed luminance is transformed to the RGB colored image similar to Eq. 7. Block-based gradient domain HDR compression divides a HDR frame into blocks of the same size, N×N. The block is then extended by copying pixels from the four boundaries, as shown in Fig. 8, and the extended block size becomes (N + 2)×(N + 2). The extended block is used for the approximation of the divergence and Laplacian operator. The gradient domain HDR compression scheme described above is then applied to each block. Figure 9 shows the modules and data flow of blockbased gradient domain compression (GDC) tonemapping [12]. The data flow includes three main parts: the front-end, tone-mapping, and the back-end. The front-end performs the input format conversion. The “Frame to Block” module in the front-end divides the HDR image into blocks of size N × N to achieve real-time processing. The tone-mapping part includes logarithm luminance calculation, differentiation, attenuation, the Poisson solver for partial differential equations, exponential luminance, and luminance to color transformation. Among all operations in the tonemapping section, the “Solve Poisson” module consumes the most computation power. The back-end part includes gamma correction, two kinds of normalization, and block to frame reconstruction. The front-end, back-
Figure 9 Modules and data flow of GDC tone-mapping.
From the descriptions of the previous two architectures, it can be observed that several modules are used by both algorithms, and so can be extracted and shared. These common modules include the front-end and the back-end color to luminance transformation, and luminance to color transformation. In addition, as most tone-mapping algorithms perform tone-mapping on the logarithm luminance domain, the logarithm and exponent calculation of the luminance can also be extracted. The integrated photographic and gradient compression process is shown in Fig. 10: the common modules are outlined with a solid line; the gradient-only modules with a dotted line; and the photographic-only modules with a dashed line. In the integrated design, a control signal is added to enable choice between the two tonemapping approaches. 5.1 ESL Design Flow and ARM SOC Platform In this section, we introduce the Electronic System Level (ESL) design flow and the ARM SOC platform with the HDR tone-mapping ASIC. Figure 11 shows the ESL design flow, which emphasizes system integration before detailed hardware and software implementation. To define the system and module specifications, the design flow begins with identification of the system requirements according to the target applications. Based on these specifications, the system architecture can be built by modeling modules using SystemC and Transaction-Level Modeling (TLM). Using TLM, we can build proper abstraction models for the system rapidly to enable complex system analysis in a cost-effective way. Using the system analysis
J Sign Process Syst Figure 10 Integrated photographic and gradient compression.
results, we can then perform the most suitable hardware and software partitioning. Development of the embedded software and the hardware can be performed in parallel and then co-verified. Based on the simulated model, the hardware and software developer can communicate with each other such that coverification is much easier than it is in a traditional design flow. We used the commercial tool “CoWare Platform Architect” to generate the ARM-based platform and analyze the system performance. Figure 12 shows the
basic architecture of the ARM SOC platform, which is built from an ARM926EJS core, a 64-bit AMBA AHB bus, 64 MB SRAMs, and the HDR tone-mapping ASIC block. The SW part of HDR tone-mapping is executed on the ARM926EJS core, and the hardware part is implemented in the ASIC. In the ARM SOC platform used in our simulations, the hardware and software part of the HDR process are co-simulated, so we do not need to make separate simulations of the hardware and software parts. The ARM926EJS core simulation module is obtained from the ARM company. The functions of the software part are implemented by c code. Then we use the commercial tool ADS (ARM Developers Suite) to compile the c code to an ARM axf file which can be executed by the ARM core. The functions of the hardware part are implemented by SystemC code and executed in the ASIC block. Since the ASIC transfers data to ARM926EJS or SRAM through the AHB bus, therefore a wrapper is added between the ASIC and the AMBA AHB bus. The function of wrapper is to convert data into AHB bus protocol or extract data from AHB bus for ASIC. The SRAM is configured to store the original HDR images and the final tone mapped LDR images. There are two different approaches to controlling the data flow in SystemC implementation, event trigger and pipelining. These approaches are described below. Event trigger: Not every module is triggered by a clock. Instead, the input data received from the input port is monitored, and the module executes the function when the data change; otherwise, it does nothing.
Figure 11 ESL design flow.
Pipeline: To improve the tone-mapping performance, we divide the frame into blocks of size N × N and allow
J Sign Process Syst Figure 12 ARM SOC platform architecture.
pipeline data to flow in units of blocks. The block size N can be any integer, and we set N to eight in the simulation. When N equals one, implementation of the pipeline occurs in the unit of a single pixel.
5.2 Function Load Analysis In order to gain a better understanding of the computational load of an individual HDR module, we first perform a pure software function load analysis on the ARM core in two steps. First, we use the TLM technique to model each HDR functional module using SystemC; then, we perform the function load analysis by using the commercial tool ADS (ARM Developer Suite) on the ARM SOC platform. Table 1 shows the function load for each module in the photographic and gradient compression model. In column one is the name of the function in the HDR software, and in the second and third columns are the corresponding function load in the gradient compression and photographic tone-mapping schemes, respectively. From this information we can see that the Solve_Poisson module has the heaviest function load, followed by the Apply_Gamma module in the gradient compression scheme. As the photographic scheme uses a simple mapping curve for all pixels in the image, it has a much shorter computation time as compared with the gradient compression method. The relative percentage of shared blocks required for gradient compression is about ten to fifteen times lower than that required for photographic tone-mapping. For example, the percentage load of Color_processing increases from 1.48% to 16.2%; however, the per-
centage load of the Apply_Gamma process in photographic tone-mapping decreases rather than increases due to the high cache miss rate when calculating the Apply_Gamma function in the gradient compression scheme. A cache miss forces the ARM to frequently load instructions from the memory, which requires up to 60% bus utilization. From the function load analysis, we see that the “Solve_Poisson”, “Apply_Gamma”, and “Scale_ to_Midtone” modules consume most of the computation power. In the following section, the hardware/software partition and the cost function are discussed in an evaluation of the performance of this co-design.
Table 1 Software function load analysis for photographic and gradient compression. Function name Compute_Luminance (S) Log_space_Luminance (G) Differentiate (G) Scale_to_Midtone (P) Tonemap_Image (P) Attenuate (G) Divergence(G) Sub_boundary (G) Solve_Poisson (G) Exponentiate_Luminance (G) Colour_Processing (S) Normalize_image (0∼1) (S) Apply_Gamma (S) Normalize_image (0∼255) (S)
Load (G)
Load (P)
0.40% 1.70% 0.18%
3.99%
24.51% 8.2% 4.73% 0.22% 0.04% 70.51% 1.50% 1.48% 1.16% 16.90% 1.16%
16.2% 17.16% 15.47% 16.61%
G gradient, P photographic, S shared for both gradient and photographic compression
J Sign Process Syst
6 Hardware/Software Partition and Cost Function Analysis
Table 2 Cycle counts, code size, ASIC area, and Metric for the four cases (M=1,000,000, K=1,000).
6.1 Hardware Software Partition We performed hardware and software partition analysis that can be used as a guideline for the implementation of a real-time HDR tone-mapping processor with a hardware/software co-design. If all of the modules in the integrated design are implemented by hardware, the penalties are the huge chip area required and the lack of flexibility. To overcome these drawbacks, we aimed to move some of the modules to software, and let the ARM processor perform these functions instead. The question is, which modules should be moved? Because the main computation load is located in the tonemapping process, we decided to move the data processing that takes place before and after tone-mapping to be handles by software. Based on this consideration, we constructed the following four models and evaluated their performance in the ARM SOC platform: pureHW, front-end with partial ASIC, back-end with partial ASIC, and pure-SW. • •
•
•
Pure-HW: All modules are implemented in ASIC; in other words, this is a pure hardware design. Front-end + partial ASIC: The functions of the front-end module are performed in the ARM core by the software and the rest of the modules are put into the ASIC. The front-end module includes input format conversion (RGBE to RGB) and color to luminance transformation (Compute_lum). Back-end + partial ASIC: The data processing required after HDR tone-mapping is performed in the ARM core by the software and the rest of the modules are put into the ASIC. The data processing in the back-end module includes two normalization functions and gamma correction. Pure SW: All HDR tone-mapping functions are executed in the ARM core using software.
Pure-SW Back-End Front-End Pure-HW
Improvement ratio (%) Cycle count Code size (M) (K bytes)
ASIC area mm(2 )
68 0(%) 12 82.35(%) 1.9 97.2(%) 1 98.53(%)
0 10.7(%) 0 46.5(%) 13.8 31(%) 14 30(%)
7.9 0(%) 3.6 54.43(%) 3.1 60.76(%) 2.6 67.09(%)
Metric
30(%) 66.011 (%) 70.052 (%) 71.683 (%)
improvement ratio is 98.53% (67/68). The same method is used to calculate the improvement ratio for software code size and ASIC area. In the ASIC area calculation, 20 mm2 was used as the reference for comparison. We then used the improvement ratio of the cycle count, code size and ASIC area to generate the cost function metric shown in the final column of Table 2. The selection of weighting values depends on the system performance requirement. In this example, we used 0.5, 0.2 and 0.3 as the weight indexes for the cycle count, software code size and ASIC area, respectively. Cost = 0.5 × Cycle counts + 0.2 × Code size + 0.3 × ASIC area
(17)
We expect that the more functions in the ARM core, the lower the performance metric, due to the low computation power of the ARM core. In addition, when functions are moved into the ARM core, the communication overhead may also increase. Although the Pure-SW model has no ASIC area cost, its performance standard is much lower than the other models. Depending on the hardware area, speed, and software code size requirements in the system, designers can select the best hardware and software partition.
6.2 Cost Function Analysis
7 Hardware Implementation and Experimental Results
Table 2 shows the cycle count, software code size and ASIC area information for the four cases described above. The first column is the cycle count, which contains two parts: the top part is the real cycle count for gradient compression in millions; the second part is the improvement ratio compared with the Pure-SW cycle count. For example, the cycle counts of the PureHW and Pure-SW are 1M and 68M, respectively. There are therefore 67M cycle counts fewer for the Pure-HW model as compared with the Pure-SW model, and so the
Based on the cost function analysis, we implemented a HDR tone-mapping ASIC using a front-end architecture owing to considerations of flexibility and hardware cost. The front-end module includes the conversion of the HDR format into the RBG components and color to luminance transformation. Currently there are three existing HDR formats (HDR, TIFF, and EXR) and each of them could have different encoding schemes [2]. For example, the HDR format can be encoded using the RGBE or XYZE encoding [2]. The TIFF
J Sign Process Syst
format has three kinds of encoding such as IEEE RGB, LogLuv24, and LogLuv32 and the EXR format is encoded by the half RGB [2]. Only the HDR format with RGBE encoding (one of the most common used HDR format) is implemented in the Pure-HW architecture. It is because the hardware area to implement all the above six HDR format conversions is much larger than that of only implementing the HDR format with RGBE encoding. However, this makes the Pure-HW lack the flexibility of handing images with other HDR formats or encoding schemes. For the Front-End architecture, since the HDR format conversion is handled in the ARM processor by software so it has the flexibility to process images with different kinds of HDR formats. We applied the discrete sine transform (DST) and the digit-by-digit exponent operation to reduce the computation time, the implementation of which is described below. 7.1 The Logarithm and Exponent Architecture We adapted a digit-by-digit algorithm that uses only shift registers and adders to reduce the hardware complexity. The digit-by-digit exponent operation is described below [15]. We calculate the exponent function by iteration and first consider the case that x is limited to the range [0, ln 2). The power x of natural number e is set to y, i.e., y = ex . We approximate the y value by building a set of data pairs (xi , yi ), and set the initial value as (x0 = x, y0 = 1). Each xi and yi always satisfy Eq. 18. yi · e = e xi
x
or
−xi
yi = e
·e
x
(18)
xi is updated by subtracting a constant ki , where ki = ln(b i ), as shown in Eq. 19. xi+1 = xi − ki = xi − ln(b i )
(19)
Take b i = 1 + si · 2−i , where si ∈ {0, 1}. The values of ln(1 + 2−i ) with 0 ≤ i ≤ 15 are stored to ROM for table lookup. If we adopt 16-bit precision, the ROM
Figure 13 Architecture of the exponential operation.
size is 16×16 bits. At each iteration, we compare xi and ln(1 + 2−i ): if xi ≥ ln(1 + 2−i ), we choose si = 1 and xi+1 = xi + ln(b i ); otherwise, we choose si as zero and xi+1 = xi . When xi equals zero, then yi = e−xi · ex = ex . The final exponent is computed iteratively from Eq. 20. yi+1 = e−xi+1 · ex = e−(xi −ki ) · ex = eki · yi = b i · yi
(20)
As all b i are either 1 or 1 + 2−i , we can use a shiftand-add method to obtain b i · yi . Note that we only calculate ex in this way when x is limited to the range [0, ln 2). For any arbitrary number x, we can take x · log2 e = I + f or x = (I + f ) · ln 2, where I and f are the integer and fractional part of x · log2 e, respectively. Then, y = ex = e(I+ f )·ln 2 = e I·ln 2+ f ·ln 2 = 2 I e f ·ln 2
(21)
where 0 ≤ f ln 2 < ln 2. Taking x0 = f · ln 2, we can use the iteration method to calculate the exponential function. The item 2 I can be implemented by the shift operation to reduce the hardware complexity for any arbitrary number. The architecture of the exponential operation is shown in Fig. 13. The logarithm operation y = log x is similar to the exponential operation described above. The digit-by-digit algorithm is suitable for hardware implementation, and the architecture of the logarithmic operation is similar. 7.2 Fast Discrete Sine Transform The discrete model of the Poisson equation (13) can be written as the following linear system: TU = F
(22)
where vector U is an approximation of the reconstructed LDR image I(x, y). The matrix T is a ‘1 1 4 1 1’ tridiagonal matrix, and F is a right hand side matrix, which includes boundary conditions. We can solve the discrete Poisson equation by the Discrete Sine Transform (DST) via the eigensystem [16].
J Sign Process Syst Figure 14 Solving the Poisson equation by the DST.
The data flow of solving the DST-based Poisson equation is shown in Fig. 14. We apply the 2-D DST on the right hand side matrix F and call the result matrix B. The 2-D DST/IDST is performed through two 1-D DST/IDST and matrix transposition. Then, we divide Bij by the system eigenvalues to get Aij, as follows:
Aij = 2 cos
jπ N
Bij − 2 + 2 cos
jπ N
−2
(23)
Note that the denominator in the equation above is the eigenvalue; j and i are the column and row indexes, respectively. Next, we apply the 2-D IDST to matrix A to obtain the reconstructed image in the logarithm domain. The definition and implementation of the 1-D DST are described below. Given v = (v0 , .... , v N−1 ) ∈ R N ,we say that vector w, where w = (w0 , .... , w N−1 )T , is the Discrete Sine Transform of v, as wk =
N−1 i=0
π(n + 1)(k + 1) vn sin N+1
(24)
Figure 15 a Eight-point DST architecture with a four-stage pipeline, where 8/10 indicates an 8-pixel input and a 10-pixel output, and so on. b–e The architecture for every stage in the eight-point DST/IDST.
We can rewrite the 1-D DST above into matrix form w = Sv, where S is the DST transform matrix. The IDST is the DST in Eq. 24 multiplied by 2/(N + 1). We designed a new hardware architecture to implement the N point 1-D DST. This architecture is very simple and contains only a few additions and multiplications. wk is separated into even and odd indexes. The N point DST can be decomposed into the sum of two N/2 point DSTs [17]. Figure 15a shows the eight-point DST architecture with a 4-stage pipeline, and Fig. 15b shows the architectures for stage 1, stage 2, stage 3 and stage 4. Cba denotes cos( ba π ), and Sab denotes sin( ba π ) in the Fig. 15b. 7.3 Experimental Result In the front-end architecture, the functions performed by the ARM processor are HDR format conversion and color to luminance transformation. In our simulation, the HDR format with RGBE encoding is used and the RGBE components are converted into the RGB color components. There are two kinds of operation modes depending on tone mapping algorithms. When the photographic tone mapping is adopted, the image is processed pixel by pixel. The image is processed window by window when the gradient tone mapping is used. The window size for gradient tone mapping is 8 × 8. In the ARM SOC platform, the ARM core reads data from the SRAM memory and executes the frontend functions. Then the luminance components are sent
(a)
(b)
( c)
(d)
Figure 16 a–d Are the results obtained using a decimal precision of 8, 12, 16 bits, and floating point realization, respectively.
J Sign Process Syst
Figure 17 Simulation results for gradient compression. Left, software; right, hardware.
to the wrapper through AHB bus. The wrapper extracts the data from the AHB bus and sends the data to the ASIC that performs either the photographic or gradient tone-mapping. The final tone mapped LDR data are sent back to SRAM through wrapper and AHB bus. The final tone mapped LDR data stored in the SRAM can be read out for verification or for LDR display. In the following, we focus on the discussion of the ASIC implementation. The number of bits used for luminance in the hardware implementation affects the image quality and hardware cost, and consideration of both cost and quality prior to hardware realization is important. The value of the luminance includes integer and decimal parts. In our design, the targeted dynamic range of luminance is from 10−3 to 105 , so we choose 16 bits for the integer part. If the number of bits in the decimal part is 16, the quantized error can be reduced to 3 × 10−5 . According to the simulation results, an average PSNR of over 40 dB and a satisfactory image quality can be reached with 16-bit realization for the
Figure 19 The integrated photographic and gradient compression HDR processor layout.
decimal part. As a result, a 32-bit word-length (16 bits for the integer and 16 bits for the decimal) was used to implement the tone-mapping processor. Figure 16 shows the results obtained using different decimal precisions. We implemented the Poisson solver, taking into consideration the speed and area. The data bus for each pixel used in the Poisson block is 24-bit wide, and the eigenvalues are stored in the memory and obtained through the lookup table. The transpose buffer and controller were implemented to read pixels from memory in parallel for matrix transposition. Block-based gradient domain HDR compression with a block size of 8 × 8 and single-layer attenuation was simulated and the results for software and hardware implementation are shown in Fig. 17. The input was a HDR video frame of 720 × 480 in size, and the dynamic ranges of the RGB components of the HDR video frames were 600,000:1, 550,000:1, and 800,000:1, respectively. The fine details of the compressed LDR frames were well-preserved; furthermore, there was no blocking effect in the reconstructed image. The software and hardware simulation results for photographic tone-mapping are shown in Fig. 18. The synthesis results show that this integrated photographic and gradient compression HDR tone-mapping
Table 3 The integrated HDR processor chip specification.
Figure 18 Simulation results for photographic compression. Left, software; right, hardware.
Name
Integrated tone-mappings HDR processor
Technology Chip size (core only) Chip size (with bond) Number of gate count On-chip memory Clock rate Video size/FPS Power
TSMC .13 μm 2.85 × 2.85 mm2 3.74 × 3.74 mm2 769.62 K 1,612 byte SRAM/192 byte ROM 100 MHz 1,024 × 768 @ 60 fps 177.1478 mW
J Sign Process Syst Table 4 Performance comparison.
Frame size fps Area (mm2 ) PSNR (dB) (memorial)
This work
Graphic FPGA Photo-graphic Gradient HW [8] [11] only [12] only [13]
1,024× 768 60 8.1 45b /35c (24 bit)
512× 512 20 200 RMS% (1.051)
1,024× 768 60 841a 34.9
720× 480 30 4.18 45 (24 bit)
720× 480 30 12 35 (24 bit)
a 841
mm2 includes the package size dB is for photographic c 35 dB is for gradient compression b 45
processor can run at a 100 MHz clock and consumes a core area of 8.1 mm2 under TSMC 0.13μm technology. Figure 19 shows the chip layout and Table 3 shows the chip specification summary. Since this implemented HDR processor ASIC is fully functional and we have performed the software/hardware partition analysis and software/hardware co-simulation in a ARMSOC platform so we do not plan to port the processor to an FPGA device. Table 4 presents a comparison of this design with other HDR tone-mapping hardware processors in terms of chip area, processing speed, and PSNR. The PSNR of the gradient synthesis result is lower than that of photographic synthesis, because we used the modified gradient algorithm instead of the original gradient algorithm. Even though a decrease in PSNR resulted, the results were acceptable and more suitable for real-time implementation. The chip area required in our design is much smaller than that required by graphic hardware and FPGA processors. At the same time, this processor achieves the highest processing speed combined with the flexibility to select different tone-mapping algorithms. Compared with a previously-described gradient processor, this design results in a 50% improvement in terms of speed and area.
8 Conclusion In this paper, we present a systematic methodology for the development of a tone-mapping processor of optimized architecture implemented on an ARM SOC platform. We illustrate a design of HDR tone-mapping processor that can handle both photographic and gradient compression. Optimization was achieved through four major steps: common module extraction, computation power enhancement, hardware/software partition, and cost function analysis. In this scheme, the chip can run at a 100 MHz clock and consumes a core area of
8.1 mm2 under TSMC 0.13μm technology, resulting in a 50% improvement in speed and area compared with previous results.
References 1. Mantiuk, R., Krawczyk, G., Myszkowski, K., & Seidel, H. P. (2004). Perception-motivated high dynamic range video encoding. ACM Transactions on Graphics, 23(3), 733–741. 2. Reinhard, E., Ward, G., Pattanaik, S., & Debevec, P. (2005). High dynamic range imaging. San Mateo: Morgan Kaufmann. 3. Tumblin, J., & Rushmeier, H. E. (1993). Tone reproduction for realistic images. IEEE Computer Graphics and Applications, 13(6), 428. 4. Chiu, K., Herf, M., Shirley, P., Swamy, S., Wang, C., & Zimmerman, K. (1993). Spatially nonuniform scaling functions for high contrast images. In Graphics Interface ’93 (pp. 245–244). 5. Reinhard, E., Stark, M., Shirley, P., & Ferwerda, J. (2002). Photographic tone reproduction for digital images. In ACM SIGGRAP (pp. 26776). 6. Pattanaik, S. N., Ferwerda, J. A., Fairchild, M. D., & Greenberg, D. P. (1998). A multiscale model of adaptation and spatial vision for image display. In Proceedings of SIGGRAPH 98 (pp. 287–298) 7. Fattal, R., Lischinski, D., & Werman, M. (2002). Gradient domain high dynamic range compression. ACM Transactions on Graphics, 21(3), 249–256. 8. Goodnight, N., Wang, R., Wooleey, C., & Humphreys, G. (2003). Interactive time-dependent tone mapping using programmable graphics hardware. In Rendering techniques, 14th eurographics symposium on rendering (pp. 26–37). 9. Drago, F., Myszkowski, K., Annen, T., & Chiba, N. (2003). Adaptive logarithmic mapping for displaying high contrast scenes. Computer Graphics Forum, 22(3), 419–426. 10. Durand, F., & Dorsey, J. (2002). Fast bilateral filtering for the display of high-dynamic-range images. ACM Transactions on Graphics, 21(3), 257–266. 11. Hassan, F., & Carletta, J. E. (2007). A real-time FPGA-based architecture for a Reinhard-like tone mapping operator. In Processing of 22nd ACM SIGGRAPH/EUROGRAPHICS (pp. 65–71). 12. Wang, T. H., Ke, W. M., Zwao, D. C., Chen, F. C., & Chiu, C. T. (2007). Design and implementation of a real-time global tone mapping processor for high dynamic range video. In IEEE ICIP 2007 (pp. 209–212). 13. Wang, T., Chiu, C. T., et al. (2007). Block-based gradient domain high dynamic range compression design for real-time implementations. In IEEE ICIP 2007 (pp. 561–564). 14. Kincaid, D. R. (2004). Celebrating fifty years of David M. Young’s successive overrelaxation iterative method. In M. Feistauer, V. Dolejsi, P. Knobloch, K., & Najzar (Eds.), Numerical mathematics and advanced applications (pp. 549– 558). New York: Springer. 15. Kantabutra, V. (1996). On hardware for computing exponential and trigonometric functions. IEEE Transactions on Computers, 45(3), 328–339. 16. Wansundara, S. N. (2002). Solving the discrete Poisson equation using the fast Fourier transform technique. Memorial University of Newfoundland, St. John’s. 17. Yao, J. C., & Hsu, C.Y. (1992). New approach for fast sine transform. Electronics Letters, 28(15), 1398–1399.
J Sign Process Syst
Ching-Te Chiu received B.S. and M.S. degree from National Taiwan University, Taipei, Taiwan, in 1986 and 1988 respectively. She received her Ph.D. degree from University of Maryland, College Park, Maryland, USA in 1992, all in electrical engineering. She was an associate professor with National Chung Cheng University, Chia-Yi, Taiwan from 1993 to 1994. From 1994 to 1996, she was member of technical staff with AT&T, Murry Hill, New Jersey, and at Lucent Technologies, Murry Hill New Jersey, from 1996 to 2000, and with Agere Systems, Santa Clara, California from 2000 to 2003. Since 2004, she has joined Department of Computer Science and Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan. Her research interests are video and communication integrated circuit design. She has been working on high dynamic range tone mapping processor chip design, high speed switch fabric IC design and SERDES interface design. Her previous chip designs include high definition television video decoder, the standard television demodulation, the SONET/SDH mapper and framer, the ATM core/edge switch, 10Gbps I/P router traffic management, and FEC decoder.
Tsun-Hsien Wang received the B.S. and M.S. degrees from National Taiwan Ocean University, Keelung, Taiwan, R.O.C., in 1993 and 1995, respectively, both in electrical engineering. He is currently pursuing the Ph.D. degree at the Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C. From 1997 to 2008, he worked at the Industrial and Technology Research Institute, Taiwan, R.O.C. In 2008, he joined the Novatek Microelectronics Corp., Ltd. Hsinchu Science Park, Taiwan. His research interests include high dynamic range image processing, VLSI algorithm and architecture for image and video signal processing, DRAM controller design, system-on-chip design technology, and related ASIC design.
Wei-Ming Ke received B.S. and M.S degree in computer science from National Tsing Hua University, Hsinchu, Taiwan, in 2007 and 2009, respectively. His research involves high dynamic range (HDR) image synthesis, HDR tone mapping algorithms, image processing, and digital IC design. He is currently working at MStar Semiconductor Inc. as a hardware and system verification engineer.
Chen-Yu Chuang received B.S. and M.S degree in computer science from National Tsing Hua University, Hsinchu, Taiwan, in 2007 and 2009, respectively. His research involves Resource Contention-aware Scheduling, and HDR tone mapping algorithms.
Jhih-Siao Huang received the B.S. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 2006, the M.S. degrees in computer science from National Tsing Hua University, Hsinchu, Taiwan, R.O.C., in 2008. His research interests are HDR tone mapping algorithms, image processing, and digital IC design. He is currently working at MStar Semiconductor Inc.
J Sign Process Syst the first commercially successful performance optimization physical design system (now in Synopsys) which is still the market leader. He then jointly founded Axis Systems (now merged with Cadence) and developed a breakthrough logic verification system using reconfigurable computer technology. After that, he helped a few start-up companies as a consultant or investor. Wishing to pass on his experiences to younger generation, he is now teaching at National Tsing-Hua University, Taiwan, his home country, on the subjects of High-Tech Entrepreneurship and System Level Design. Dr. Tsay is a devout Christian and a well respected person for his integrity, insight and ingenuity.
Wei-Su Wong received the B.S. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 2006, the M.S. degrees in computer science from National Tsing Hua University, Hsinchu, Taiwan, R.O.C., in 2008. His research interests are face detection, image processing, and digital IC design. He is currently working at MStar Semiconductor Inc.
Ren-Song Tsay nicknamed Dr. Zero-Skew, is the inventor of the famous industry standard zero-skew clock tree design method. He received his Ph. D. degree from UC Berkeley and worked for IBM T. J. Watson Research Center before he started his successful Silicon Valley ventures. He was the person designed
Cyuan-Jhe Wu received the B.S. degree in Computer Science from National Tsing Hua University, Hsinchu, Taiwan, R.O.C. in 2009. He is currently working toward the M.S. degree in Computer Science in National Tsing Hua University, Hsinchu, Taiwan, R.O.C. His research interests are human face deformation, and HDR tone mapping algorithms.