A Programmable SIMD Vision Chip for Real-Time Vision ... - IEEE Xplore

55 downloads 0 Views 2MB Size Report
Abstract—A programmable vision chip for real-time vision applications is presented. The chip architecture is a combination of a SIMD processing element array ...
1470

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 6, JUNE 2008

A Programmable SIMD Vision Chip for Real-Time Vision Applications Wei Miao, Qingyu Lin, Wancheng Zhang, and Nan-Jian Wu

Abstract—A programmable vision chip for real-time vision applications is presented. The chip architecture is a combination of a SIMD processing element array and row-parallel processors, which can perform pixel-parallel and row-parallel operations at high speed. It implements the mathematical morphology method to carry out low-level and mid-level image processing and sends out image features for high-level image processing without I/O bottleneck. The chip can perform many algorithms through software control. The simulated maximum frequency of the vision chip is 300 MHz with 16 16 pixels resolution. It achieves the rate of 1000 frames per second in real-time vision. A prototype chip with a 16 16 PE array is fabricated by the 0.18 m standard CMOS process. It has a pixel size of 30 m 40 m and 8.72 mW power consumption with a 1.8 V power supply. Experiments including the mathematical morphology method and target tracking application demonstrated that the chip is fully functional and can be applied in real-time vision applications. Index Terms—Image processing, machine vision, morphology, target tracking, vision chip.

I. INTRODUCTION

R

ESEARCHERS have been interested in vision chips for decades [1]. The vision chip integrates image sensors with processing elements (PEs) in one chip and performs real-time parallel image processing. It has the advantages of compact size, high speed, and low power consumption so that it can be widely applied in robot control, factory automation, and target tracking systems. Nevertheless, it is difficult for vision chips to perform complex calculations and algorithms that are currently performed in DSP or general-purpose computers. To make the vision chip compatible with DSP or general-purpose computers in vision systems and to enhance its parallel-processing performance are challenges in the development of vision chips. Image processing methods can be divided into three levels: low-, mid-, and high-level processing [2]. Low-level processing involves primary operations such as noise cancellation and image enhancement. Mid-level processing involves segmentation, description of regions, and classification of objects. High-level processing performs intelligent analysis and cognitive vision. The common features of low- and mid-level image processing are massively parallel processing and a large amount of image data to be processed. Therefore, vision chips are suitable for low- and mid-level image processing tasks. Manuscript received February 8, 2007; revised March 18, 2008. This work was supported by the National Natural Science Foundation of China under Grant 90607007 of China. The authors are with the State Key Laboratory for Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China (e-mail: [email protected]). Digital Object Identifier 10.1109/JSSC.2008.923621

Early on, many application-specific vision chips that performed certain low-level image processing operations were reported [3]–[5]. Later, programmable general-purpose vision chips were reported. The chip architecture is similar to the single instruction multiple data (SIMD) massively parallel array [6], improved by integrating optical sensors. A vision chip with serial single-bit digital processing was developed in 1996 [7]. The S PE, reported in 1999, is a general-purpose digital vision chip processing 8-bit gray images [8]. Recently, another general-purpose vision chip SCAMP with analog PE was developed [9]. The PVLSAR2.2, reported in 1999, can process gray-level digital images while realizing image processing with less than 50 transistors in PE by including analog features [10]. Some programmable vision chips such as Brea 2004 [11] are implemented by CNN. Besides the various implementations of the PE array, these general-purpose vision chips mainly perform low-level image processing. Moreover, they output large amounts of image data with the I/O bottleneck, though limited global feature extraction can be found in some chips [7], [10]. MIP 1993 is a programmable chip that is dedicated to Mathematical Morphology (MM) and has the ability to perform mid-level image processing [12]. However, it still lacks global feature extraction and it cannot output useful information for high-level image processing. General-purpose vision chips should perform low- and mid-level image processing in sequence, and should support high-speed output of image features. This not only improves the function of the vision chip but also overcomes the I/O bottleneck. Therefore, it is easier to fulfill real-time image processing tasks for vision systems that include a general purpose vision chip. We have developed a high-speed target tracking vision chip that is directly based on mid-level image processing and output target position [13], [14]. Some vision chips that perform certain low- and mid-level image processing and output certain global features were developed by other groups [15]–[18]. The chips are application-specific vision chips. In this paper, we present a general-purpose SIMD vision chip for real-time vision applications. The chip overcomes the difficulty of the early general-purpose vision chips in the field of real-time machine vision. It consists of both a pixel-parallel PE array and row-parallel processors. It can carry out pixel-parallel and row-parallel operations and obtain global information in images. This chip performs low- and mid-level image processing by taking mathematical morphology method as the major tool, and it outputs image features for high-level image processing. Many algorithms for real-time applications in machine vision can be implemented in the chip through software control. The chip has features of high speed, low power consumption, and a small PE area.

0018-9200/$25.00 © 2008 IEEE

MIAO et al.: A PROGRAMMABLE SIMD VISION CHIP FOR REAL-TIME VISION APPLICATIONS

1471

Fig. 1. Architecture of the SRVC.

This paper proceeds as follows. In Section II, we describe the architecture of our chip. In Section III, the VLSI implementation of the chip is presented. In Section IV, the experimental results of the prototype chip for image processing tasks, including a target tracking algorithm, is presented. In Section V, the performance of the chip is discussed. Finally, we draw conclusions in Section VI. II. ARCHITECTURE OF THE CHIP The presented vision chip is designed for application in automatic machine vision systems, such as factory automation and security systems. Because such systems work in a somewhat artificial environment and require fast response, the system specification on the vision chip stresses high-speed processing and high-speed data interfacing between the chip and other components. As a result, the proposed vision chip implements highspeed parallel logic operations on binary images, which is sufficient for many applications and expends fewer resources. The chip directly sends out useful signals and can cause automatic systems to respond immediately. For example, obtaining object position and range can cause an assemble system to immediately locate the object. The chip also sends out reduced data for features that are represented as coordinates rather than the whole image to enhance information transfer speed. For example, the coordinates of the pixels on the object’s edge can be sent out to obtain other object features during later processing. A. Architecture The architecture of the proposed programmable SIMD Realtime Vision Chip (SRVC) is shown in Fig. 1. The core of the

PE array. In the pevision chip is a mesh-connected riphery of the PE array, there is an X processor, a Y processor, a PE data I/O module, a coordinate output control module, an on-chip controller, and 2 pMOS transistors. One pMOS transistor and nMOS transistors in the PE array form an -input pseudo-NMOS NOR gate. The nMOS transistors of the gate are contained respectively in PEs that locate in one row or one column of the PE array. The on-chip controller manages the PE array and its periphery circuit. First, the vision chip acquires images by the sensors in the PE array. Second, low-level and mid-level image processing is carried out in the PE array. Finally, the chip outputs the image data or features extracted from the image. The procedure is iterated at every frame. The diagram of a PE is given in Fig. 2. It can be divided into two main parts. One part contains a photodiode (PD) and a 1-bit analog-to-digital converter (ADC). The photodiode integrates photocurrent that is proportional to incident light strength. The PD voltage is quantized to a 1-bit digital signal using two and . The second part is a 1-bit digital threshold voltages, processor that consists of two registers, a 1-bit ALU, a 4-bit memory, and three multiplexers. The universal register Reg1 loads data from the 1-bit ADC, from the 4-bit memory, or from the four nearest PEs. The two operands of the ALU come from Reg1 and the memory. The result of the ALU is first sent to the register Reg2 and then stored in the memory. In addition to the two main parts, the two nMOS transistors (NM1 and NM2) respectively belong to two pseudo-NMOS NOR gates in the row and in the column that contain the PE. The output of Reg1 is connected to the gates of the two nMOS transistors as one input of the two -input NOR gates.

1472

Fig. 2. Diagram of the pixel element.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 6, JUNE 2008

data input and sends the data into the PE array in parallel. The module provides the PE array another data input port to receive image patterns and to dump data in the PE array. The X processor contains N X processor units (XPU) and a ROM that stores the X coordinates. The X processor receives the projection of the image in Reg1 registers of the PE array on the X axis. The positions of the left and right edges of the projection are found by XPUs, and the corresponding X coordinates are obtained by the ROM. The Y processor contains N Y processor units (YPU) and a ROM that stores the Y coordinates. Its inputs are either the outputs of the NOR gates in rows or parallel shift-out data of Reg1 registers in the PE array. The Y processor not only gives the projected Y coordinates of the top edge and the bottom edge of the image in Reg1 registers on the Y axis, but also obtains the Y coordinates of the activated pixels in a column of the image in sequence. The X coordinates and the Y coordinates from the X processor and the Y processor are transferred to the coordinate output control module that controls the output of the coordinates. B. Functions

Fig. 3. Image layers in the PE array.

The PE array can obtain an image with pixels. Each PE is connected directly with its four nearest neighbor PEs. images exist in the PE array. They are the original Eight analog image, original binarized image, and temporary binary images in registers and binary images in memory, as shown in Fig. 3. The PE array performs pixel-parallel image processing in one image or between two images. It first obtains the analog image (PIX) by the photodiodes. The 1-bit ADCs in the PE array convert PIX to the binary image (BIN). Then the binary image BIN is loaded into universal registers Reg1. The image in Reg1 can shift within the Reg1 array, can be directly transferred into the memory, and can operate with another image from the memory. The result image from ALUs is buffered in registers Reg2. The memory in the PE array stores four images, named M1, M2, M3, and M4, respectively. In the periphery of the PE array, one N-input pseudo-NMOS NOR gate detects whether there are activated pixels in one row or one column of the image stored in the Reg1 array. An activated pixel is a pixel that has the value of logic 1 in a binary image. All NOR gates can get the projections of the image on the X axis and the Y axis. The PE data I/O module contains shift registers. It receives image data from the PE array in parallel and serially sends the data out of the chip. On the other hand, it serially receives image

The SRVC chip has two important characteristics. First, it performs image acquisition, low-level and mid-level image processing, and fast output of image features in a full procedure. Low- and mid-level image processing is mainly carried out by the mathematical morphology method. Second, it integrates pixel-parallel processing and row-parallel processing. These characteristics enable the SRVC to realize complicated algorithms. We introduce the major functions of the SRVC below. -pixel opto1) Image Acquisition and Segmentation: electronic images are obtained by the PE array periodically. The default period is 1 ms, and this period can be adjusted. The following 1-bit quantization of the image is the method of image segmentation, so that objects of interest are retained in the binary image while other objects are removed as much as possible. 2) Pixel-Parallel Image Processing: The PE array can shift images and perform basic pixel-parallel logic binary operations between neighboring pixels or between two corresponding pixels of two images. Combing the basic operations can achieve binary Mathematical Morphology (MM). MM algorithms not only perform lowlevel image processing such as morphological filtering, but also carry out mid-level image processing, such as extracting image objects and features. Erosion and dilation are basic MM operations, and they use a small image object, called a structural element, to erode or dilate objects in an image. Examples of erosion and dilation are given in Fig. 4. Other MM operations are realized by combinations of erosion, dilation, and logic operations. The SRVC chip efficiently realizes erosion and dilation by pipelining operations of Shift and AND(OR) using Reg1 and Reg2 registers in the PEs. The procedure is regular for different structural elements. This largely simplifies the control sequence. For example, the erosion in Fig. 4(c) is performed by the serial of operations shown in Fig. 4(e). The number of clock cycles used to perform erosion or dilation in the SRVC can be esti, where is the number of activated pixels in the mated as structural element.

MIAO et al.: A PROGRAMMABLE SIMD VISION CHIP FOR REAL-TIME VISION APPLICATIONS

1473

Fig. 4. Example of erosion and dilation in mathematical morphology operations. (a) An image denoted A. (b) An image denoted B , in which the cross shows where the origin is. (c) The result of A B . (d) The result of A B . (e) Sequence of erosion corresponding to (c).

2

8

Using only erosion and dilation, it is difficult to realize advanced iterative MM algorithms such as region growing and convex hull because knowing when to terminate the algorithms is required. The significant improvement of SRVC in performing MM is that it efficiently supports iterative MM algorithms by a global function that detects a void image, which is defined as an image without activated pixels. In this way, iterative MM algorithms can be automatically terminated in time. The MIP chip proposed in [12] completes the basic pixel-parallel operations similar to our SRVC. Nevertheless, because the data must circulate among four registers, the control over logic operations is anomalous and the number of clock cycles of operations increases with the number of memory bits. MIP cannot perform the detection of a void image. Therefore, it cannot perform many advanced MM algorithms. 3) Detecting a Void Image: Detecting if an image is a void image is a useful function. Many iterative algorithms, such as region growing and skeleton, require knowing whether two images are equal so as to terminate the algorithm. After subtraction is performed between two images, whether the two images are equal is determined by detecting if the result image is a void image. The function detecting a void image can be realized by the NOR gates and the Y processor. If the Y processor detects a void image, the value of the output port “void” will be logic 1. An example is shown in Fig. 5(a). The PVLSAR2.2 [10] realizes such a global OR by an analog method that lacks precision and speed. The method proposed by Anders et al. in 1996 also has a similar function [19]. However, that method is slow because it requires a propagation process in the PE array. 4) Extracting the Range of a Region and the Range’s Center: SRVC is efficient for quickly getting a rectangular range of the only region in an image and the range’s center. If there are several regions in an image, they must be separated first. Fig. 5(b) shows the global operation on the SRVC. The image in the Reg1 array is projected onto the X axis and the Y axis by the calculation of the pseudo-NMOS NOR gates. The coordinates of the right edge and the left edge of the projection of

Fig. 5. Non-pixel-parallel operations. The input of the Y Processor in (a) and (b) comes from the outputs of NOR gates in row. (a) Diagram for detecting a . (Notice that in (b) void .) (b) Extracting the void image. Here void range and the center of a region. (c) Diagram of extracting the coordinates (x,y ) of activated pixels in an image. The Y Processor can quickly generate the Y coordinates of activated pixels in a column from the bottom to the top one by one.

=1

=0

the region on the X axis are extracted in the X processor. The coand the bottom edge of the ordinates of the top edge projection of the region on the Y axis are extracted in the Y processor. The four edges indicate the range of the region and are sent to the coordinates output control module, where the range’s is center

(1) It needs only eight clock cycles to calculate the range of a region and the range’s center, which is very fast. The frequency of the clock is determined by the NOR gates and the circuits in the X processor and the Y processor. It is not difficult to design fast circuits to get a high frequency without a great resource cost. In some applications, such as target tracking, we usually need to obtain a point to represent the position of an object. Compared with the extraction operation of the centroid by global summation used in other papers [17], to get the target range’s center by the proposed vision chip is easier and quicker, while the range’s center has almost the same effect as the centroid for a regular object. 5) Extracting the Coordinates of Activated Pixels: The results of mid-level image processing are image features that consist of sparse activated pixels. These features must be quickly sent out of the chip in a certain format for further processing. A method has been developed to give coordinates

1474

of activated pixels [19]. The limit is the speed and complexity, especially for a large amount of activated pixels. PVLSAR2.2 in [10] finds the position of the non-zero line only. The SRVC quickly extracts coordinates of activated pixels in the image. Therefore, those features in the image can be outputted as the coordinates of activated pixels so that the information transfer is fast, without the I/O bottleneck from the SRVC to other digital processors. Another reason to output the features in coordinates is that coordinates can be handled easily. For example, by parallel image processing, the object boundary is first obtained, and then many descriptors or representations such as area, curvature, and chain code can be directly obtained from the coordinates of the boundary. Fig. 5(c) shows the procedure through which the SRVC obtains the coordinates of activated pixels in the image in the Reg1 array column by column. First, the data from the first column of the image is transferred into the Y processor. Then, the Y processor searches the activated pixels in the column one by one from the bottom to the top and, at the same time, generates the Y coordinates of the activated pixels. Next, after the coordinates of all activated pixels in the column are generated, the PE array sends the image data of the next column to the Y processor. The Y processor begins to generate the coordinates of activated pixels in the new column. The above process is repeated until the Y coordinates of the activated pixels in the last column of the image are generated. On the other hand, the X coordinates are simply generated by a column counter in the coordinate output control module. The vision chip reported in [16] has the similar function of searching activated pixels and extracting their coordinates. It uses a row-parallel searching architecture with a 432 MHz clock frequency and uses buffers to store coordinates before they are sent out. In comparison to this chip, the architecture of our SRVC is more reasonable because the searching circuits are only in the Y processor to balance the speed of extraction and the speed of sending the coordinates. Thus, the clock frequency in the PE array can be much lower than that in the Y processor, no circuits for searching activated pixels exist in the PE array, and no output buffer is used. As a result, a great amount of area and power are saved. 6) On-Chip Control: The on-chip controller receives instructions from outside the chip and decodes them. According to the instructions, the on-chip controller manages the operations of the PE array and the other modules. The detailed control over operations extracting the range of a region and the range’s center, detecting a void image, and extracting coordinates of activated pixels are realized by finite state machines, which are controlled by the instructions. 7) Time for Various Operations: We estimated the time required for various operations implemented by the SRVC and Table I gives the results. In an algorithm, the operations that cost the most processing time should be those iterative operations of the MM method such as region growing and skeleton. If a rate of 1000 frames per second is required, it is estimated that more than ten iterative operations of MM could be completed with a 20 MHz clock frequency in a 256 256 PE array. To extract and send out the coordinates of activated pixels does not cost much

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 6, JUNE 2008

TABLE I TIME REQUIRED FOR VARIOUS OPERATIONS

time in practice, because there are few activated pixels that represent features in an image. III. VLSI IMPLEMENTATION This SRVC chip is implemented in 0.18 m CMOS technology with a 1.8 V power supply. The main circuit blocks are designed as follows. A. Circuits in the Processing Element Due to the requirement of 1000 fps performance, the integral time of a photodiode must be less than 1 ms for each frame. It needs a high sensitivity photodiode in the circuits, so we use an N-well/P-sub SAB diode without salicide as the photodiode. SAB is the mask layer of the salicide block mask, which is used only in the standard salicide CMOS process for blocking salicide formation. Using the photodiode, we gain a short integral time and high relative spectral quantum efficiency. The test results and a discussion of the N-well/P-sub SAB diode were presented in our previous work [14]. The circuits of the PD and the 1-bit ADC in PE are given in Fig. 6 [13]. A PD_ctrl signal controls the time of integrating optical signals in the PD. The PD reverse voltage depends on the integral of the optical signal and represents the brightness of the pixel. It is amplified and binarized by the 1-bit ADC. The output signal BIN is logic 1 if the is between two threshold amplified PD reverse voltage and ( ). Otherwise, it is equal to logic 0. voltages The 4-bit memory in the PE is realized using 4 latches (M1, M2, M3, M4) because adopting latches results in area reduction and fast write and read speed.

MIAO et al.: A PROGRAMMABLE SIMD VISION CHIP FOR REAL-TIME VISION APPLICATIONS

1475

Fig. 7. Schematic diagram of a search chain that has eight search chain units (SCUs). The length of the search chain is 8. Fig. 6. Circuits of the PD and the 1-bit ADC [13].

B. Design of the Pseudo-NMOS NOR Gates The 2 pseudo-NMOS NOR gates can be implemented compatibly with the structure of the PE array to save area. The static of the power dissipation, the noise margin, and the delay is defined as the time NOR gate are acceptable. The delay of the transition from low to high on the output of the NOR gate. The maximum static power dissipation of all NOR gates increase linearly with . They are 9.6 mW and the delay . When the data in the registers Reg1 and 3.1 ns for of the PE array are all logic 1, the static power dissipation is the maximum. However, the normal static power dissipation is much less than the maximum. In addition, the pMOS transistors in the NOR gates are switched off by the signals POY and POX, shown in Fig. 1, when the NOR gates do not operate. The voltage level corresponding to logic 0 of the output of the NOR gate retains a sufficient noise margin. C. Search Chain in the X Processor and the Y Processor The search chain in the X processor and the Y processor can perform a function that finds the first logic 1 in a serial of bits along a certain direction. The length of the search chain is defined as the number of bits being searched in the chain. An is given in Fig. 7. example of a search chain that has The search chain consists of 8 search chain units (SCU). From left to right, it looks for the position of the first logic 1 in the 8-bit data of the parallel input SC_in. At the position of the first logic 1, the corresponding bit in parallel output SC_out is logic 0, and other bits are all logic 1. The input signal SC_active controls the operation of the search chain. The search chain operates only when the SC_active is set to high. Another output signal SC_end comes from the end of the search chain. If SC_active is high, SC_end will be high only when all parallel-input bits of the search chain are logic 0. SC_end is quite useful in some operations. For example, the output port “void” of the Y processor is directly connected with SC_end. Searching circuits realized by dynamic logic with a similar function were reported in [15] and [16]. Compared with the dynamic logic circuit, the static logic circuit of our search chain has advantages of easy implementation and tolerance to noise. The search time is principally determined by the delay of the transmission gates and is proportional to the length of the

Fig. 8. Schematic diagram of an improved search chain that has a two-stage structure. The length of the search chain is 8.

search chain. The longest delay is 12.12 ns if is 128 and a buffer is inserted every three transmission gates. The search time is much less than that of the searching circuit in [15] and [16]. The longest search time is 71 ns for 128 pixels per row in [15] and 30 ns in [16]. We designed an improved search chain that uses a multi-stage structure for higher speed. The multi-stage structure is similar is to a carry-look-ahead adder. A two-stage example of , shown in Fig. 8. Stage 1 has two sub-chains of length and stage 2 has one sub-chain of length . The length of the sub-chains in stage should be equal because the sub-chains . in the same stage work in parallel. For all stages, . The search time now is greatly reduced to proportional to For example, if in a two-stage search chain , the search time will be only 3.07 ns. where D. XPUs in the X Processor The X processor contains N XPUs. The diagram of the th , is shown in Fig. 9. It consists of XPU, two search chain units SCU1[i] and SCU2[i], which belong to two search chains with the opposite search direction. The input NOR_C[i] of the two SCUs is the output of the NOR gate located in the th column of the PE array. A multiplexer selects the outputs of the two SCUs, and the selected output is sent to the ROM that stores X coordinates.

1476

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 6, JUNE 2008

Fig. 9. Schematic diagram of XPU[i]. Fig. 11. Timing diagram for the operation of generating the Y coordinates of the activated pixel of an image in the PE array column by column.

Fig. 10. Schematic diagram of YPU[j].

E. YPUs in the Y Processor The diagram of the th YPU, , is given in Fig. 10. Like the th XPU, it also consists of two search chain units SCU1[j] and SCU2[j], which belong to two search chains with the opposite search direction. It has an additional input PE_R[j] that is the data from the th row of the PE array as well as the NOR_R[j], which is the output of the NOR gate in the th row. If the bit in the register R1[j] of the th YPU is the first logic 1 that is found by search chain 2 formed by SCU2[j], the bit is set to 0 by the path where the AND gate locates. Hence, search chain 2 can continue to seek the next bit with the value of 1. When the chip performs the operation extracting coordinates of activated pixels, search chain 2 works. During this process, Sel1 will equal to the signal SC_end of search chain 2. Thus, the process automatically continues after initialization of the control signals in the Y processor. The signal SC_end also indicates the end of one column. The time diagram of the process of the function is shown in Fig. 11. The Y coordinates of activated pixels are generated every other clock cycle. The period of the clock is determined by the speed of the search chain. IV. EXPERIMENTS ON THE PROTOTYPE CHIP A. Prototype Chip A prototype SRVC chip with a 16 16 PE array was designed and fabricated using the 0.18 m CMOS process. The

Fig. 12. Microphotograph of the prototype chip.

TABLE II FEATURES OF THE PROTOTYPE CHIP

microphotograph of the chip is given in Fig. 12. Table II lists the features of the prototype chip. A test board for the prototype chip was fabricated. The test board is controlled by a FPGA, in which a test bench program was developed. The test bench generates instructions and data for the test and receives output results from the prototype chip. A software interface that communicates with the FPGA was

MIAO et al.: A PROGRAMMABLE SIMD VISION CHIP FOR REAL-TIME VISION APPLICATIONS

Fig. 13. Some algorithms’ example using mathematical morphology performed in the prototype chip. The dot in the structural element (SE) indicates the origin. (a) Noise cancellation. (b) Region growing. (c) Extracting boundary. (d) Skeleton.

also developed. By software coding, various algorithms can be implemented in experiments. The clock frequency in the experiments was 20 MHz, which was limited by the connection between the test board and the FPGA. For algorithms in the experiments, the typical frame rate is set to 1000 frames per second. For some simple algorithms, the frame rate can exceed 10 000 frames per second. B. Algorithms Based on Mathematical Morphology Fig. 13 gives the results of four algorithms, which demonstrate that the SRVC chip can realize mathematical morphology operations. The algorithms use the same structural element shown in the bottom-right corner of Fig. 13. Fig. 13(a) shows the result of the noise cancellation algorithm. The top is an image with a large object and some noise. The first step is to eliminate noise dots outside the object region by opening, that is an erosion followed by a dilation. This generates the image shown in the middle of Fig. 13(a). Next, the closing operation, a dilation followed by an erosion, is performed on the middle of the image, resulting in the image without the noise at the bottom. Fig. 13(b) gives the process of region growing based on dilation. A seed in the image at the top grows referencing the region in the bottom of Fig. 13(a) and finally generates a region equal to that, shown in the image at the bottom of Fig. 13(b). Fig. 13(c) shows the outer boundary in the middle and the inner boundary at the bottom corresponding to the region in the image at the top. Dilation and erosion are used to obtain the outer and inner boundary, respectively. The image in the middle of Fig. 13(d) is the skeleton of the region in the image at the top. The skeleton is the union of sub-skeletons generated during the algorithm. The region can be recovered by the sub-skeletons, shown in the image at the bottom. C. Application-Specific Functions The function detecting a void image is used in many algorithms. For example, the region growing and the skeleton operations, shown in Fig. 13, adopt the function to terminate the algorithms. In Fig. 14(b), the range and the range’s center of the letter “A” was obtained. After the binary image was generated, the

1477

Fig. 14. (a) The photo of an object taken by a digital camera. (b) The range and the center of the object extracted in the SRVC. (c) Outer-boundary of the object and the Y coordinates transferred from the chip.

function extracting the range of a region and the range’s center was used. In another experiment, the outer boundary of “A” was obtained. Using the global function extracting coordinates of activated pixels, the boundary of “A” was sent out of the chip in the form of coordinates. Fig. 14(c) shows the boundary image and the extracted Y coordinates of the activated pixels on the boundary. D. Target Tracking To demonstrate the ability of the SRVC for real-time vision applications, a simplified target tracking algorithm from [13] was implemented in the prototype chip. The implementation of the algorithm covers all functions of the SRVC. It begins from image acquisition and 1-bit quantization, and then uses logic operations and the morphological method to carry out noise cancellation and self-windowing capturing [17]. The position of the captured target is generated by the function extracting the range of a region and the range’s center. Considering the collision of the target with other objects, pixel-parallel operations and the function of detecting a void image are combined to complete collision detection and separation detection. If collision is detected, the boundary of the target just before collision is sent out of the chip to record features of the target. If separation is detected, the boundaries of the separated regions are sent out of the chip to calculate certain features for target recognition. The boundaries are obtained by pixel-parallel operations together with the function extracting coordinates of activated pixels in the chip. The process of this target tracking algorithm illustrates how the SRVC works through image acquisition, lowand mid-level image processing, and the generation of convenient information for high-level image processing to accomplish complex applications in real-time vision. The equipment for target tracking is given in Fig. 15(a). A white target on a dark background moves forth and back horizontally before a camera that is fixed on an actuator. The distance between the camera and the target is 60 cm. The prototype chip is behind the camera, and it continues capturing the target and extracting the position of the target. According to the position of the target, the actuator adjusts certain angles to locate

1478

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 6, JUNE 2008

TABLE III COMPARISON OF SRVC WITH OTHER CHIPS

Fig. 15. Example of the target tracking experiment. (a) The equipment for the experiment. (b) The motion of the target that moves horizontally back and forth.

the target in the center of view. Thus, the target is tracked by the SRVC vision system. Fig. 15(b) gives the record of the motion of the target during the target tracking experiment. V. FEATURES AND PERFORMANCE Overall, the features of SRVC are its trade-offs for applications. The regular PE structure facilitates the scaling up of PE array size or memory bits. The specialized periphery circuit for fast global feature extraction and exportation promise outstanding performance in many applications from the system level view. A comparison of SRVC with other digital SIMD bit-serial vision chips is listed in Table III. The large PE size [8] limits its scalability for real applications, although of

each PE has strong processing ability. PVLSAR2.2 [10] possesses impressive performance and a small PE size, but the stability of its PE and the complexity of control complicate scaling up. MIP [12] has a PE function similar to SRVC, but it also has a complicated control and can not extract global features. The chip area of the SRVC is mostly determined by the PE array that increases with the square of . Therefore, the area of the PE is an important parameter. The area of the PE in the prototype chip is 30 40 m . This is moderate for a general-purpose digital PE. There is potential to reduce the PE area if reducing some performance is acceptable. With similar functions, the PE of SRVC should have fewer transistors than the PE of MIP [12], because latches are adopted in SRVC rather than the Active-high Flip-Flops in MIP. Furthermore, the PE of SRVC has one more memory bit than the PE of MIP. We performed a simulation on the PE array and the periphery circuits in EDA tools by extracting the analog circuit from the layout with parasitic parameters. The maximum clock frequency of the 16 16 PE array can reach 300 MHz. The clock frequency of the X processor and the Y processor is increases. dominated by the search chain and increases as Using the improved search chain, the maximum frequency can . The maximum frequency of the reach 300 MHz for pseudo-NMOS NOR gates is lower and it reaches 300 MHz . However, this could be increased by improving when the circuit design and by finishing the NOR calculation in more than one clock cycle. Therefore, the speed of the NOR gates would not limit the whole speed of the SRVC. In conclusion, even for larger PE arrays, the simulated maximum clock frequency of the SRVC could be much higher than 20 MHz. The power consumption of the prototype chip averages 8.72 mW at 20 MHz. The power consumption mostly comes from the PDs and the 1-bit ADCs in the PE array, which consumes 8.09 mW. The performance of sensors is not significantly improved compared to our early proposal [14]. SRVC is not sensitive to noise, because the segmentation method that uses the two thresholds depends on a notable difference in the lightness between objects and background. However, this also limits the application of the chip. It cannot always handle the complexity of the real world, while it can work well in applications in artificial environments such as factory automation.

MIAO et al.: A PROGRAMMABLE SIMD VISION CHIP FOR REAL-TIME VISION APPLICATIONS

VI. CONCLUSION A programmable vision chip SRVC for real-time vision apPE array, an plications was proposed. It consists of a X processor, a Y processor, a PE data I/O module, a coordiN-input nates output control module, on-chip controller, and pseudo-NMOS NOR gates. The chip architecture supports both pixel-parallel and row-parallel operations and performs lowlevel and mid-level image processing by relying on mathematical morphology. It can implement various algorithms through software control. All of the functions for the real-time vision applications can be performed at high speed. A 16 16 prototype chip was fabricated using the 0.18 m CMOS process. The experimental results demonstrate its functionality and its potential in real-time vision applications. The chip core area of 0.7 mm 0.64 mm is small and the power consumption of 8.72 mW is low. The SRVC chip can easily be scaled up in resolution or in bits of memory. It will find wide application in real-time vision, for example, factory automation, medical inspection, security, robots, target tracking control and so on.

1479

2

[16] Y. Oike, M. Ikeda, and K. Asada, “A 375 365 1 k frames/s rangefinding image sensor with 394.5 kHz access rate and 0.2 sub-pixel accuracy,” presented at the IEEE ISSCC, San Francisco, CA, 2004, TP 6.6. [17] T. Komuro, I. Ishii, M. Ishikawa, and A. Yoshida, “A digital vision chip specialized for high-speed target tracking,” IEEE Trans. Electron Devices, vol. 50, no. 1, pp. 191–199, Jan. 2003. [18] Y. Watanabe, T. Komuro, S. Kagami, and M. Ishikawa, “Vision chip architecture for simultaneous output of multi-target positions,” in SICE 2003 Annu. Conf., Fukui, Japan, 2003, vol. 2, pp. 1572–1575. [19] A. Wstrom, R. Forchheimer, and J.-E. Eklund, “Global feature extraction operations for near-sensor image processing,” IEEE Trans. Image Process., vol. 5, no. 1, pp. 102–110, Jan. 1996. Wei Miao was born on November 11, 1980, in Sichuan, China. He received the Bachelor degree in physics from Tsinghua University, Beijing, China, in 2002, and the Ph.D. degree in microelectronics and solid-state electronics from the Institute of Semiconductors, Chinese Academy of Sciences, Beijing, in 2007. He is currently an Architecture Design Engineer in Omnivision SDC, Shanghai, China. He has done research on topics related to mixed-signal VLSI, image processing, machine vision, and quantum analog computation.

REFERENCES [1] K. Aizawa, “Computational sensors – vision VLSI,” IEICE Trans. Inf. Syst., vol. E82-D, no. 3, pp. 580–588, 1999. [2] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed. Upper Saddle River, NJ: Pearson Education, 2002, p. 2. [3] J. G. Harris, C. Koch, and J. Luo, “A two-dimensional analog VLSI circuit for detecting discontinuities in early vision,” Science, vol. 246, pp. 1209–1211, Jun. 1990. [4] H. Kobayashi, L. White, and A. A. Abidi, “An active resistor network for Gaussian filtering of images,” IEEE J. Solid-State Circuits, vol. 26, no. 5, pp. 738–748, May 1991. [5] S. Y. Lin, M. H. Chen, and T. D. Chiueh, “Neuromorphic vision processing system,” Electron. Lett., vol. 33, no. 12, pp. 1039–1040, Jun. 1997. [6] E. S. Gayles, T. P. Kelliher, R. M. Owens, and M. J. Irwin, “The design of the MGAP-2: A micro-grained massively parallel array,” IEEE Trans. Very Large Scale Integrat. (VLSI) Syst., vol. 8, no. 6, pp. 709–716, Dec. 2000. [7] J. E. Eklund, C. Svensson, and A. Astrom, “VLSI implementation of a focal plane image processor—a realization of the near-sensor image processing concept,” IEEE Trans. Very Large Scale Integrat. (VLSI) Syst., vol. 4, no. 3, pp. 322–335, Sep. 1996. [8] M. Ishikawa, K. Ogawa, T. Komuro, and I. Ishii, “A CMOS vision chip with SIMD processing element array for 1 ms image processing,” presented at the IEEE ISSCC, San Francisco, CA, 1999, TP 12.2. [9] P. Dudek and P. J. Hicks, “A general-purpose processor-per-pixel analog SIMD vision chip,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 1, pp. 13–20, Jan. 2005. [10] F. Paillet, D. Mercier, and T. M. Bernard, “Second generation programmable artificial retina,” in 12th Annu. IEEE Int. ASIC/SOC Conf., Washington, DC, 1999, pp. 304–309. [11] V. M. Brea, D. L. Vilariño, A. Paasio, and D. Cabello, “Design of the processing core of a mixed-signal CMOS DTCNN chip for pixel-level snakes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 5, pp. 997–1013, May 2004. [12] W.-C. Fang, T. Shaw, J. Yu, B. Lau, and Y.-C. Lin, “Parallel morphological image processing with an opto-electronic VLSI array processor,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 1993, vol. 1, pp. I-409–I412. [13] W. Miao, Q.-Y. Lin, and N.-J. Wu, “A novel vision chip for high-speed target tracking,” Jpn. J. Appl. Phys., vol. 46, no. 4B, Apr. 2007. [14] Q.-Y. Lin, W. Miao, and N.-J. Wu, “A high-speed target tracking CMOS image sensor,” in Proc. 2006 IEEE Asian Solid-State Circuits Conf., Hangzhou, China, Nov. 2006, pp. 139–142. [15] Y. Oike, M. Ikeda, and K. Asada, “A row-parallel position detector for high-speed 3-D camera based on light-section method,” IEICE Trans. Electron., vol. E86-C, no. 11, pp. 2320–2328, 2003.

Qingyu Lin was born in Kunming, China, in 1981. He received the B.S. degree in physics from Peking University, China, in 2003. Since 2004, he has been pursuing the Ph.D. degree at the State Key Laboratory for Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing, China. His research interests are in the field of CMOS integrated optical sensors and on-chip image processing.

Wancheng Zhang was born in 1985. He received the B.S. degree in physics from Peking University, Beijing, China, in 2004. He is currently working toward the Ph.D. degree at the Institute of Semiconductors, Chinese Academy of Sciences, Beijing. His current research interest is novel nanoelectronic devices and circuits and digital circuit design.

Nan-Jian Wu was born in Zhejiang, China, on February 27, 1961. He received the B.S. degree in physics from Heilongjiang University, China, in 1982, the M.S. degree in electronic engineering from Jilin University, China, in 1985, and the Ph.D. degree in electronic engineering from the University of Electronic Communications, Chofu, Japan, in 1992. In 1992, he joined the Research Center for Interface Quantum Electronics and Faculty of Engineering, Hokkaido University, Sapporo, Japan, as a Research Associate. In 1998, he was an Associate Professor in the Department of Electro-Communications, University of Electronic Communications. Since 2000, he has been a Professor in the Institute of Semiconductors, Chinese Academy of Sciences. In 2005, as Visiting Professor he visited the Research Center for Integrated Quantum Electronics, Hokkaido University. His research is in the field of semiconductor quantum devices and circuits, and design of analog–digital mixed-signal LSI.

Suggest Documents