level language using floating-point and fixed-point arithmetic. The verified solution .... High Level Language ..... htt
A METHODOLOGY TO DEPLOY APPLICATIONS ON THE DUAL-CORE OMAP PLATFORM Saulo O. D. Luiz∗, Jayarama S. Santana∗, Genildo de M. Vasconcelos∗, Angelo ˆ nio M. N. Lima∗, Marcos R. A. Morais∗ Perkusich∗, Anto ∗
Laboratory of Embedded Systems and Pervasive Computing, Academic Unit of Electrical Engineering, Center of Electrical Engineering and Informatics, Federal University of Campina Grande 10105, 58.109-900 Campina Grande, PB, BRAZIL Emails:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] Abstract— In this article, a methodology to develop applications for the OMAP 161x platform is introduced. This platform is composed of an ARM9 general purpose processor and a TMS320C55x fixed-point Digital Signal Processor (DSP). At first the mathematical formulation of the application is discussed, and is verified with a high level language using floating-point and fixed-point arithmetic. The verified solution is then implemented using floating-point arithmetic in C language, and then in fixed-point arithmetic. This last implementation is analyzed using a simulator for the target DSP, in this case a C55x simulator. Then, a user interface application that runs on the ARM processor is developed. Such application is then integrated with the DSP application through an ARM-DSP inter-processor communication mechanism, named DSP Gateway. These steps were applied for a case study, an adaptive Wiener bi-dimensional image filter. The introduced methodology allows students to acquire the necessary knowledge to develop applications targeted to the OMAP platform. Keywords— OMAP161x platform, digital signal processing, inter-processor communication, DSP application development process.
1
Introduction
Embedded systems used in communications and multimedia such as digital video cameras, DVD players/recorders, portable multimedia equipment, etc. rely on computationally intensive digital signal processing algorithms which would be impracticable without nowadays’ highperformance and low-power Digital Signal Processors. The OMAP (Open Multimedia Application Platform) dual-core platform, which is composed of an ARM (Advanced Risk Machine) and a DSP (Digital Signal Processor), allows the implementation of digital signal processing in the low power consuming DSP-side, while the general purpose applications can run on the ARM-side. The development of applications to the OMAP platform involves several challenges, such as communication between the processors, implementation of algorithms based on fixed-point arithmetic, memory usage limitations and specific knowledge of the hardware architecture. Thus, it is important to develop a procedure for the deployment of applications on the OMAP platform. Useful information about the OMAP 161x architecture can be found in Instruments (2005). Specific information about the TMS320C55x digital signal processor architecture, software development tools, assembly language programming, are available in the real-time DSP textbook Kuo and Lee (2001). An application deployment process to the OMAP platform is described in this article. For the development of the DSP-side application, the following software tools are used: MATLAB, Vi-
sual C++ and Code Composer Studio (CCS) for the C55x processors. The mathematical formulation of the digital signal processing application is performed, followed by their implementation in a high level language (MatLab) using floating-point and fixed-point arithmetic. In order to serve as a preceding step to the fixed-point implementation in CCS, C programs in Visual C++ are developed using floating-point and fixed-point. The ARM runs Embedded Linux Operating System compiled for the OMAP platform. The ARM-side application is developed using the Scratchbox toolkit Movial (2005) toolkit for crosscompilation on a PC running Linux to provide a user interface and proper communication with the DSP-side. OMAP ARM
DSP
Write
DSP Gateway
Input to DSP
Embedded application Linux application Read
Device driver
DSP kernel Interrupt handler Task API
Output from DSP application
DSP application
Figure 1: Overview of the ARM-DSP communication process The communication process between the ARM and DSP is made possible by DSP Gateway, which is composed of a Linux device driver on the ARM-side and a DSP-side kernel library. Since
3360 of 3365
version 2.6.6, Linux kernel officially supports DSP Gateway for OMAP 15XX, 16XX, 1710, 5910, and 5912. An overview of the elements involved in the ARM-DSP communication is illustrated in figure 1 and the procedures for the deployment of an application to the OMAP platform is illustrated in figure 2.
This filter is described mathematically according to equations 1, 2 and 3 Chan (1984). µ(n, m) =
σ 2 (n, m) =
1 X a(i, j) N M i,j∈η
1 X 2 2 a (i, j) − [µ(n, m)] N M i,j∈η
(1)
(2)
max(0, σ 2 (n, m) − ν 2 ) fb(n, m) = µ(n, m) + · σ 2 (n, m) (3) ·(a(n, m) − µ(n, m)) Where fb(n,m) is the filter’s output, a(i,j) is the filter’s input (noisy image), ν 2 is the variance of the noise, µ(n, m) and σ 2 (n, m) are the mean and local variance, respectively in η, i.e, a N x M local neighborhood (window), around the pixel (n,m). When the noise is not known a priori, ν 2 can be calculated as the mean of all local variances σ 2 (n, m). 3 Figure 2: Procedures for the deployment of an application to the OMAP platform The following sections are organized as follows. Section 2: mathematical formulation of the application; section 3: algorithm design and verification in a high level language; section 4: implementation of the algorithm in C language using floating-point and fixed-point arithmetic; section 5: port of the code to the DSP; section 6: development of an application in the ARM-side and integration with the DSP-side using DSP Gateway. 2
Mathematical Formulation of the Application
As an illustration of the mathematical description of the application to be implemented, the case study of the Wiener adaptive filter is presented in this section. This filter is used in order to reduce the additive white Gaussian noise present in the acquisition of a digital bi-dimensional signal captured by the image sensors used in digital cameras. The Wiener digital filter is adaptive, i.e., the filtering of the image varies according to the signal’s statistics (mean and variance) in a local neighborhood. This filter is based in the minimization of the squared error between the original image f(n,m) and its estimation fb(n,m) and it is supposed that the noise is additive and White.
3.1
Algorithm Design and Verification in a High Level Language Floating-point arithmetic
The use of a high level language (in this case, MatLab) makes the implementation and the algorithmic verification easier. Moreover, the use of floating-point arithmetic yields higher precision in the calculations. The representation of a number in floatingpoint with single precision, following the IEEE (Institute of Electrical and Electronics Engineers) notation, uses 32 bits. The first bit is used for the signal, the following 8 bits represent the exponent and the remaining 23 bits represent the mantissa. The mantissa is a real number in the range [1, 2) and it is stored as a binary number in negative powers of 2. The exponent is an integer in the range -126 to 127 biased with 127. The conversion of a number in floating-point notation to a real value can be done according to 4. RealV alue = (−1)signal ·M antissa·2Expoent−Bias (4) The advantage of the implementation using floating-point notation is that very small and very large numbers can be represented and without much loss of precision. For example, the single precision floating-point notation can represent numbers in the range 1.175494351 · 10−38 to 3.402823466 · 10+38 . Thus, special considerations regarding overflow, underflow and saturation are generally not necessary (W. Gan, 2006). For high precision applications, the double precision
3361 of 3365
floating-point notation can be used, in which a number is represented using 64 bits (1 for signal, 11 for the exponent and 52 for the mantissa). For the implementation using floating-point, a 24-bit bitmap image was used as the input bidimensional signal. The 24-bit bitmap image format represents each pixel (picture element) in the RGB (Red, Green, Blue) format. Each color component is specified by 8 bits, thus each pixel is composed by 24 bits. Each color component’s is in the range 0 to 255 and the Wiener filter is applied individually to each color component. 3.2
Fixed-point arithmetic
The implementation in MatLab using fixed-point arithmetic facilitates the simulation and verification of occurrences of overflow/underflow in the algorithm for a calculated Q notation. Concerning the manipulation of data in devices with fixed-point architecture, the following numeric representation should be used: m2−e
(5)
Where m is the mantissa and e is the exponent. To increase the precision in the calculations, we use the transformation of the numbers to the Q format. This format is based on a premultiplication and pos-division of the numbers involved in the operations. The designation of the Q format representation deals with a tradeoff between precision and memory use. With the use of data types of 16 bits, for example, there will be a smaller memory usage. However we may compromise the precision of the calculations. On the other hand, if we use data types with more bits, for example, 32 or 40 bits, much more precise calculations will be done at the cost of higher memory utilization, which may be critical in some applications for embedded systems. Let x be a given number in floating-point which will be converted to fixed-point using a multiplication and saved in a variable. Consider that this variable has a data type which supports a maximum absolute value Nmax , then: Nmax = 2Q · x
(6)
Where, Nmax is the variable’s maximum absolute value, Q is the corresponding Q format representation and x is maximum original number in floating-point. Applying the base-2 logarithm in the above equation 6, the maximum value of Q can be determined, which corresponds to a maximum precision without the occurrence of overflow. This can be written as follows:
Variables N=M P
1 NM 2
a (i, j) σ (n, m) fb(n,m)
i,j∈η 2
Labels window invsqwindow mean var output
Table 1: Labels for the variables used
log2 Nmax = Q + log2 x
(7)
Q = log2 Nmax − log2 x
(8)
This conversion should be applied to the variables in floating-point before performing mathematical operations with them, having in mind to use the corresponding arithmetic operations in fixed-point. 3.3
Case Study: Port of the Wiener Filter to Fixed-point
To port the Wiener filter algorithm from floatingpoint to fixed-point, we should analyze equations 1, 2 and 3. The variables of these equations are labeled according to table 1. Considering the C55x DSP core’s 16-bit architecture, unsigned 16 bits variables are used, except for the results of intermediate operations where unsigned 32 bits variables are applied to gain precision on fixed-point calculations. In the output variable, we will use a signed 32 bits variable. The Wiener filter is analyzed using fixed-point arithmetic for windows of size 3x3 and 5x5. These values have been selected in order to limit the range of the variables. Thus, this range can be identified a priori. The worst case of the variable invsqwindow occurs when window is equal to 3, since N1M = 1 window·window is greater when window = 3. Applying equation (8) to invsqwindow, the maximum Q format representation that doesn’t cause over1 flow is Q19. Thus, invsqwindowmax = 3·3 · 219 = 58254. The sum variable is represented in Q0 since this avoids the divisions involved in normalizing the inputs a(i,j), which is in the range 0 to 255. All the operations utilizing invsqwindow and sum should be tested so as to verify that the representation of invsqwindow in Q19 will not cause overflow. For the calculation of the mean 1, invsqwindow and sum are multiplied and the result will be stored in an auxiliary unsigned variable sumAux of 32 bits, therefore it is needed to observe that: summax · invsqwindowmax ≤ 232 − 1 The maximum value of sum occurs when window = 5, therefore summax = 255 · (5 · 5) = 6375. It is verified that:
3362 of 3365
summax · invsqwindowmax = 6375 · 58254 = 371369250 ≤ leq232 − 1 So, the representation of invsqwindow in the Q19 format in the calculation of mean 1 will not cause overflow in the operation between invsqwindow and sum. For the calculation of the variance 2, initially invsqwindow and sqSum are multiplied and stored in an auxiliary unsigned variable sqSumAux of 32 bits, so: sqSummax · invsqwindowmax ≤ leq232 − 1
as a link for the translation of the code to fixedpoint C. Basically, the lines of code with operations in floating-point are rewritten to fixed-point by the use of the Q format representation (premultiplications, pos-divisions, etc.). The implementation in fixed-point is necessary as the C55x DSP core of the OMAP 161x processor uses fixed-point architecture. Moreover, developing the algorithm in C makes the port to the DSP easier (W. Gan and Tan, 2000). In this case we used the Code Composer Studio (CCS) provided by Texas Instruments.
The maximum value of sqSum is 2552 ·(5·5) = 1625625 . This will occur when window = 5. It is verified that: sqSummax · invsqwindowmax = 1625625 · 58254 = 94699158750 > 232 − 1 Thus, an overflow is detected. To avoid an overflow, the maximum permissible value of invsqwindow is calculated as follows: invsqwindowmax = 2642
232 −1 aqSummax
=
232 −1 1625625
=
Therefore, we may determine the Q-format representation using 8: Q = log2 2642 − log2
1 3·3
= 14
Since the representation of invsqwindow in Q19 will cause overflow, we shall use the smallest Q-format representation of invsqwindow, which is Q14. The variance 2 is then calculated subtracting the squared mean (µ2 ) from the result of the multiplication between invsqwindow and sqSum. After determining the mean and the variance in a NxM local neighborhood, we will find the filtered output through 3. Proceeding as described in the calculation of the mean and variance, the variables used in the filter’s output equation (3) can be calculated to a proper Q-value. Therefore, the algorithm will present a good tradeoff between the precision of the calculations and the memory usage since 16-bit variables were used, except for the intermediate calculations, where 32-bit variables were utilized. 4
Implementation of the Algorithm in C Language Using Floating-point and Fixed-point Arithmetic
In this case study, we translated the codes written in MatLab to C using Visual C++. Also, the inbuilt MatLab functions, such as the image acquisition and file I/O routines, were written in C code using floating-point arithmetic. This step is particularly important since floating-point C code implementation can be used
5
Port of the Code to the DSP
The DSP-side application is developed using Code Composer Studio (CCS) or Linux DSP tools. In this case study, we opted for the CCS. The CCS is a tool provided by Texas Instruments appropriate for the development of applications to the DSP architecture. By using this tool, it is possible to simulate DSP programs in a PC. This is useful for debugging the application before deploying it in the target device. The fixed-point C code developed in Visual C++ has to be adapted to the CCS considering the DSP architecture and memory constraints. It is also possible to emulate the application in the target hardware by connecting it to a PC using a JTAG (Joint Test Action Group). This procedure yields the profiling of the code execution, enabling the identification of time-critical code sections which can be optimized using assembly code. More-over real constraints are considered, such as I/O and memory access time for different memory types. The DSPs have internal memories, but these are limited in size. External memory can also be mapped to the DSP space, but the access to this memory is slower than the one for internal memories. Appropriate memory configurations, such as definition of segments (memory spaces) and relocation of time-critical sections and can be done using the DSP/BIOS configuration, thus improving the performance in the applications. 6
Development of an Application in the ARM-side and Integration with the DSP-side Using DSP Gateway
The Embedded Linux Operating System compiled for the OMAP 161x processor runs in the ARMside. The ARM application is developed in C and cross-compiled in a PC using the Scratchbox toolkit on Linux. In many cases this application provides user interface and assigns specific tasks to the DSP. The communication between the ARMside and DSP-side applications is done by the DSP Gateway.
3363 of 3365
DSP Gateway DSP
ARM process
Linux application
Device driver
DSP kernel
DSP application
open()
open()
DSP task dynamically created
activate DSP task open() return
write()
complete
write()
task running
input data / control command write() return
read()
read()
output data
read() return close()
close()
DSP task dynamically deleted
deactivate DSP task complete close() return
Figure 3: Example of the ARM-DSP communication The DSP Gateway is composed of a Linux device driver on the ARM-side and a kernel library in the DSP-side. The former uses standard Linux functions such as read()/write() and the later offers APIs for user tasks, accessed by Linux, and multi-task capabilities. Thus data can be exchanged between the ARM and DSP. When a Linux user application accesses the DSP device driver, this generates a command to DSP. The DSP kernel receives the command and registers it into the queue of the corresponding DSP application, which processes the commands in the queue by calling a corresponding task function. An example of the communication between the ARM and the DSP is illustrated in figure 3. The open() function executed in the ARM activates a DSP task. Data/commands are sent to the DSP by the write() function. The DSP task processes the data received from the ARM-side, which receives the outputs issuing the read() function. Finally, the close() function deactivates the DSP task. More information regarding the DSP Gateway can be found in Kobayashi (2005). 7
Conclusions
In this article, all the steps involved in the project and implementation of algorithms to the dual-core OMAP platform are presented. Specifically, the adaptive Wiener filter was used as a case study. The steps proposed provide greater ease in the implementation of fixed-point algorithms for dual-core processors. Beginning the development using a high level language provides: an easier
algorithmic verification using floating-point arithmetic and facilitates the simulation and detection of occurrences of overflow/underflow of the code in fixed-point arithmetic. The code implemented using a high level language is ported to C in floating-point (Visual C++). This step can be used as a link for the translation of the code to fixed-point C (Visual C++) by re-writing operations in floating-point to fixed-point with the use of the Q format representation. The C code using fixed-point arithmetic approaches the implementation in the target platform, which is achieved using a specific tool such as Code Composer Studio. The ARM-side application is developed to provide a user interface and proper communication with the DSP-side, which executes specific digital signal processing algorithms, providing overall performance to the dual-core platform and lower battery consumption, an important feature to the embedded systems which employ the OMAP dual-core. The methodology to deploy applications on the dual-core OMAP platform described in this article enabled undergraduate students to acquire better skills in the deployment of projects for the OMAP platform. As an outcome, projects such as the Speex (2005) and the FFmpeg (2006) were successfully developed. Acknowledgements The authors would like to thank all the members of the Laboratory of Embedded Systems and Pervasive Computing of the Federal University of
3364 of 3365
Campina Grande for all the support in the development of the methodology discussed in this article. References Chan, P. (1984). One-Dimensional Processing for Adaptive Image Restoration, Research Laboratory of Electronics, Massachusetts Institute of Technology. FFmpeg (2006). FFMPEG Multimedia System, http://ffmpeg.sourceforge.net/index.php. Instruments, T. (2005). OMAP 1611, http://focus.ti.com/general/docs/wtbu/ wtbuproductcontent.tsp?templateId=6123 &navigationId=11993&contentId=4668. Kobayashi, T. (2005). Linux DSP Gateway Specification. Revision 3.3, http://dspgateway.sourceforge.net/pub/index.php. Kuo, S. M. and Lee, B. H. (2001). RealTime Digital Signal Processing: Implementations, Applications and Experiments with the TMS320C55X, John Wiley & Sons, LTD. Movial (2005). http://www.scratchbox.org/.
Scratchbox,
Speex (2005). Speex a free codec for free speech, http://dspgateway.sourceforge.net/pub/index.php. W. Gan, S. K. (2006). Teaching dsp software development: From design to fixed-point implementations., IEEE Transactions on Education 49: 122–131. W. Gan, Y. Chong, W. G. and Tan, W. (2000). Rapid prototyping system for teaching realtime digital signal processing., IEEE Transactions on Education 43: 19–24.
3365 of 3365