play modules without requiring much imaging knowledge. Applications like ... version of the video input and parallel-to-serial conversion of the LPA output.
A LOW-POWER PARALLEL PROCESSOR IC FOR DIGITAL VIDEO CAMERAS A.A. Abbo, R.P. Kleihorst, L. Sevat, P. Wielage, R. van Veen, M.J.R. Op de Beeck, A. van der Avoird. fAnteneh.Abbo, Richard.Kleihorstg @philips.com Philips Research Laboratories, Prof. Holstlaan 4, NL-5656 AA Eindhoven, The Netherlands ABSTRACT We present a digital signal processor to be combined with a 30 frames per second VGA-format CMOS or CCD image sensor or any other source of digital video data. The processor is fully programmable and therefore able to run a variety of algorithms ranging from image communication to machine vision. The IC comprises a parallel processor array and a special purpose controller to achieve high computational performances (up to 5 GOPS) with a very modest power consumption. This can go down to 30 mWatts for simple applications such as a digital camera for video conferencing. The chip has been realized in a 0.18 m CMOS process and takes up an area of 22 mm 2 . 1. INTRODUCTION With the maturing of CMOS image sensor technology, single chip imaging systems such as digital cameras are emerging [1]. An obvious advantage of sensor-DSP combination is the possibility of using image sensors as plug-andplay modules without requiring much imaging knowledge. Applications like tele-conferencing, surveillance and electronic still-picture become versatile with the use of singlechip image capture modules. Another advantage of sensor-DSP integration is energy efficiency. The lower number of system components, not only reduces system cost, but also reduces power dissipation associated with inter-chip communication. In addition to this, a smart image sensor can perform analysis and decision on chip, thereby reducing the amount of off-chip communication. Although technically feasible, incorporating a high-performance DSP on the same die as the image sensor is mostly commercially infeasible since the latter’s technology lags behind with respect to efficient light capture on silicon. We have devised the IC module as step in that direction to be embedded with the sensor when the technologies match. In this article, we discuss the image processing chip
which can be interfaced to image sensors to achieve the above mentioned advantages. The architecture of this IC uses parallel processing to exploit the inherent nature of video processing algorithms to arrive at very low power dissipation [2, 3]. While the computational capacity of the IC can reach to 5 GOPS, the power dissipation for simple applications such as video-conferencing algorithms can become as low as 30 mW. This fully programmable chip is realized in a 0.18 m CMOS process and measures almost 22 mm 2 . 2. ARCHITECTURE The top-level architecture of the design is shown in Figure 1(a). Different digital functional blocks and memory modules in the IC with some of the main communication lines are shown. There are two programmable processors: a Global Control Processor for line-based algorithms and a Linear Processor Array (LPA), with 320 identical processing elements, for pixel-based algorithms. The LPA uses 16 line memories for temporary storage of data. Additional 4 sequential line memories are used for serial-to-parallel conversion of the video input and parallel-to-serial conversion of the LPA output. The program memory is shared by both processors and can store up to 1024 instructions. The non-programmable processors and controllers in the IC are the row and column address selectors, the I 2 C controller and the serial processor. These processors accept parameters from the Global Control Processor for sub sampling, region-of-interest selection, mirroring, adjusting the sensor gain, black-level, gamma and video output format. In what follows we discuss some of the modules. 2.1. Input signal The image or video input data is a VGA (640480 pixels) size frame (matrix) with upto 10-bit digitized signals at a maximum rate of 30 frames/second. Usually this signal originates from a sensor and is in the Kodak Bayer CFA pattern [4], but because of the programmability, other formats
ADC 1&2
Standard Cell
instruc global OR
SRAM FLAG
Linear Global
FLAG ALU
Program
Processor
ACCU
11111111111111111111111 00000000000000000000000 00000000000000000000000 11111111111111111111111 Parallel Line Memories (16 x 640−pixels)
11111111111111111111111 00000000000000000000000 00000000000000000000000 11111111111111111111111 Sequential IO Line Memories (4 x 640−pixels)
ALU
Memory
Array
Video In
ADC 3&4
ACCU
Control MUX
Processor
MUX
SClock
I2C Control
SData coeff
Serial Processor
R G B
Video/Data out
part of Line−Mem
part of Line−Mem
0 even odd
0 even odd
19 even odd
(a)
19 even odd
(b)
Figure 1: Architecture: (a) top-level block diagram (b) linear processor array with memory interconnection can be accepted. 2.2. Parallel Processor The parallel processor array consists of 320 identical processing elements (PEs). The number of PEs is optimal with respect to the basic data format (2 x 2 pixels) of the Bayer pattern. 640 PEs would be inefficient with respect to silicon area since half of the processors would remain idle. By means of left, middle and right communication channels, each processor can directly obtain data from six columns facilitating convolution algorithms. All processors execute identical instructions on their local data. This approach reduces power consumption compared to a sequential column processor because the control and address-decoding is performed only once and shared by all processing elements. An LPA is very suitable for image processing applications and VLSI integration [5, 6]. The architecture of the LPA is shown in Figure 1(b). Each processor contains an accumulator to store the last result, which can be used as an operand for the next instruction. This localized storage reduces the number of accesses to the line-memory and the associated power dissipation. An adder and a multiplier have been implemented in the ALU with which comparison, addition, subtraction, data scaling and multiply-accumulate functions are performed. The processors incorporate a flag. Based on this flag conditional-pass instructions are possible, allowing datadependency in the algorithms. All 320 flags are connected to a global line which can be seen and reacted upon by the Global Control Processor. In this way, iteration processes with a certain end-condition can be run on the parallel processor. Compared to results published in [2, 7], the LPA is not bit-serial, but has a 10-bit word-width, increasing the power-efficiency [8]. The parallel processor can execute algorithms like Fixed Pattern Noise (FPN) reduction, de-
fective pixel concealment, color reconstruction and colordomain transformation, pre-filtering, sub-sampling, template matching, segmentation, color keying, event recognition and even simple forms of image compression such as the lossy compression part of JPEG [9].
2.3. Line Memories The 16 parallel line memories are made of 10 physical pseudomulti-port static RAM modules which enable single-cycle operand readout and result storage. Three read and three write operations are executed in one cycle (60 ns) to receive and dispatch data over the 3200-bit wide input port and an output port of the same width, respectively. This multi-port approach enables us to keep high-speed clocking (96 MHz) local to the memories while the system runs at a much lower clock rate, 16 MHz. The 4 I/O sequential line memories are made using separate registers. This was done because of the complex addressing scheme such as parallel write and sequential read for the3 output memories and parallel read and sequental write for the input memory. Given the 16 line-memories for temporary data storage and the 4 I/O sequential line-memories, block-processing (N N ) operations for N 15 have been implemented. Figure 2 shows how color interpolation is done over a 3 3 block. The line-memory locations x; y; z hold Bayer data from three consecutive image lines. Depending on which pattern the operator matrix is centered on, we execute either equation (a) or (b). The computations are classified according to Even and Odd column tags. Using an index in the range [ 2; 3], each processor can access data from neighbouring columns.
z
BG BG BG B G BG BG BG
y
. . . G R G R G R G R GR GR GR . . .
BG BG BG B G BG BG BG
PE6
2 3 −2 −1 0 1 2 3
PE7
...
PE4
−2 −1 0 1
PE5
x
...
(a) Even: B’[0] = 0.50(x[0] + z[0]) R’[0] = 0.50(y[−1] + y[1]) Odd: G’[1] = 0.25(x[1] + y[0] + y[2] + z[1]) B’[1] = 0.25(x[0] + x[2] + z[0] + z[2]) (b) Even: G’[0] = 0.25(x[0] + y[−1] + y[1] + z[0]) R’[0] = 0.25(x[−1] + x[1] + z[−1] + z[1]) Odd: B’[1] = 0.50(y[0] + y[2]) R’[1] = 0.50(x[1] + z[1])
Figure 2: 3x3 block-processing for color interpolation 2.4. Global Control Processor The main task of global control processor (GCP) is to synchronize operations in the entire chip. It updates the program counter, fetches and decodes instructions and passes them to the LPA. It can also do global computations for exposure-time control, white-balancing and the like by making use of the statistical image data which is updated (by the serial processor) in internal registers. A number of logical and arithmetic instructions are available including multiplication and features for nested and conditional jumps. The custom designed global processor architecture and single-memory coding style enable single-cycle fetching, decoding and execution of the 24-bit instruction, which is not feasible with other low-cost micro-controllers such as MIPS PR1900 CPU [10]. Although such general-purpose processors come with tools for controller code development in high-level descriptions, the fact that the LPA code dominates the total program and still has to be manually written lessens the advantage. 2.5. Serial Processor The serial processor (SP) reads out data, column-wise, from three sequential line memories and formats it into popular digital formats such as 4:4:4, 4:2:0, 4:2:2, 4:0:0 and 4:1:1. The formatted data is sent off-chip using three ports, each 10-bit wide. A selected set of columns, part of the region of interest, or the entire row will be scanned left-to-right or visa-versa and possibly sub-sampled. The SP monitors the statistics of the image data, which can be read by the GCP. 3. POWER EFFICIENCY The power efficiency of the design comes from the massive parallel implementation. Although it will not matter for the computational power whether calculations are performed in sequence or parallel, power is considerably reduced by lowspeed shared items such as control decoding and instruction
fetch. A major reduction in power consumption stems from the principle that the data is obtained in parallel from the line memories [8, 11]. The wide-word line memories are a factor of 3 more power efficient than comparable pixel memories. The power consumption mainly depends on the image parameters, i.e., number of rows per frame, the frame rate, and on the program that runs on the parallel processor. The power consumption has been calculated for 3 programs of increasing complexity and performance assuming a full VGA format image at a frame rate of 30 frames/second. The instruction counts and the corresponding power consumption figures are shown in Table 1, which are hard to reach at with contemporary sequential processors. A decomposition of the power figure for the simple program is as follows: LPA (1.6 mW), line-memory (16.2 mW), sequential memory (3.9 mW), SP (1.0 mW), GCP (2.9 mW), program memory (1.6 mW), and 2.8 mW for interconnect dissipation. The “simple video communication” program performs all tasks necessary including noise reduction and defective pixel concealment to arrive at a digital 4:2:2 formatted YUV video stream from the raw Bayer data of the sensor. This program reflects the performance usually found in consumer digital cameras. The sophisticated program performs the same tasks now aiming at optimal image quality and resolution. Note that the higher power consumption is still very competative compared to the power consumed by offthe-shelf DSPs running the same algorithm. The template matching program is a hypothetical program that uses all of the program memory space and the full-bandwidth (whereas the previous programs only use part of the bandwidth). Typical of this program is that it makes extensive use of the line-memories maximizing power consumption [8].
4. PROGRAMMING This multiple processor IC is programmed using assembly language. The assembled code is downloaded into the program memory (up to 1024 instructions) via the I 2 C interface. The instructions (each 24-bits wide) are used to activate either the LPA or the GCP depending on a selection bit. Although no full use is made of the available hardware (one of the processors is always idle), this mode of sequential programming was adopted for its simplicity. There is room for performing over a 1000 instructions within the time to produce one line of video signal. Several programs can be stored together in the program memory. For instance, a selftest program can be present and only executed on certain conditions. An example of a small program that produces video output is shown in Figure 4. Note that a symbolic program code such as “do color interpolation” is made of a number of basic instructions.
Table 1: Power consumption and performance figures for VGA 30 frames/second Program simple video communication sophisticated video communication maximal OCR performance
---------------------------------------------; INITIALIZATION PART set row addressing parameters. (GCP) estimate FPN parameters. (LPA) set color transformation parameters. (GCP) set exposure time control parameters. (GCP) ; GENERAL RUNNING MODE FOR EVERY image DO (GCP) reset image statistics (GCP & SP) read ROI coordinates (GCP) start ADC (GCP) FOR EVERY row IN ROI DO (GCP) wait for ADC to finish (GCP) obtain values from ADC (LPA) correct for FPN (LPA) conceal defective pixels (LPA) reduce noise (LPA) interpolate colors (LPA) do RGB to YUV transformation (LPA) pre-filter U & V (LPA) copy result to output (LPA & SP) update image statistics (SP) start ADC for next row (GCP) NEXT row (GCP) read image statistics (GCP & SP) update exposure time parameters (GCP) update color transf. parameters (GCP) NEXT image (GCP) ----------------------------------------------
Figure 3: Sample program for image communication 5. IC SPECIFICATIONS The IC was processed in a standard 0:18 m CMOS technology. Table 2 shows the IC specification area requirements of the different functional blocks. The blocks that are not mentioned in the table have a negligible area. 6. REFERENCES [1] D. Bursky, “Highly integrated image sensors cut system cost, complexity,” Electronic Design, oct. 28th 1999. [2] J. Gealow and C. Sodini, “A pixel-parallel image processor using logic pitch-matched to dynamic memory,” IEEE Journal of Solid-State Circuits, vol. 34, June 1999. [3] R. Kleihorst, A. Abbo, A. van der Avoird, M. O. de Beeck, L. Sevat, P. Wielafe, R. van veen, and H. van Herten, “Xetal: A low-power high-performance smart camera processor,” in ISCAS 2001, (Sydney, Australia), may 2001.
code-lines/row 27 150 1024
performance 125 MOPS 700 MOPS 5 GOPS
power 30 mW 200 mW 1.6 W
Table 2: IC specifications & Area Figures process number of transistors clock frequency package supply voltage internal memory bandwidth max input format output data accuracy number of output channels video output formats Global & Serial Processors Parallel Processor Parallel Line Memory Sequential Line Memory Bonding Pads & Power Routing Total Area
CMOS18 (0:18m) 3M 16 Mhz 120 pins 1.8 V 107 Gbit/S VGA at 30 fps up to 10 bits 3 4:4:4, 4:2:2, 4:2:0, 4:1:1, 4:0:0 1.25 mm2 5.86 mm2 2.94 mm2 4.50 mm2 7.23 mm2 21.8 mm2
[4] J. Adams, “Design of practical color filter array interpolation algorithms for digital cameras,” in Proceeding of SPIE, (Bellingham, WA, USA), pp. 117–125, SPIE, 1997. [5] P. Jonker, “Why linear arrays are better image processors,” in Proc. 12th IAPR Conf. on Pattern Recognition, (Jerusalem, Israel), pp. 334–338, 1994. [6] D. W. Hammerstrom and D. P. Lulich, “Image processing using one-dimensional processor arrays,” IEEE Proceedings, vol. 84, pp. 1005–1018, jul 1996.
128 CMOS im[7] H. Yamashila and C. Sodini, “A 128 ager with 4 128 bit-serial column-parallel PE array,” in ISSCC2001 Digest of technical papers, 2001.
[8] R. Manniesing, “Power analysis of a linear processor array,” tech. rep., Delft University of Technology, Delft, The Netherlands, Sept. 1999. [9] J. Hsieh, A. van der Avoird, R. Kleihorst, and T. Meng, “Transpose switch matrix memory for motion JPEG video compression on single chip digital CMOS camcorder,” in ICIP 2000, (Vancouver, BC, Canada), Sept. 2000. [10] MIPS Technologies Inc., “The MIPS CPU family.” http://www.mips.com/, 2000. [11] A. Chandrakasan and R. Brodersen, Low power digital CMOS design. Norwell, MA, USA: Kluwer academic publishers, 1995.