2132
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
A Programmable Vision Chip Based on Multiple Levels of Parallel Processors Wancheng Zhang, Qiuyu Fu, and Nan-Jian Wu, Member, IEEE
Abstract—This paper proposes a novel programmable vision chip based on multiple levels of parallel processors. The chip integrates CMOS image sensor, multiple-levels of SIMD parallel processors and an embedded microprocessor unit (MPU). The multiple-levels of SIMD parallel processors consist of an array processor of SIMD processing elements (PEs) and a column of SIMD row processors (RPs). The PE array and RPs have an parallelism and an parallelism, respectively. The PE array and RPs can be reconfigured to handle algorithms with different complexities and processing speeds. The PE array, RPs and MPU can execute low-, mid- and high-level image processing algorithms, respectively, which efficiently increases the performance of the vision chip. The vision chip can satisfy flexibly the needs of different vision applications such as image pre-processing, complicated feature extraction and over 1000 fps high-speed image capture. A prototype chip with 128 28 image sensor, 128 A/D converters, 32 8-bit RPs and 32 128 PEs is fabricated using the 0.18 CMOS process. Applications including target tracking, pattern extraction and image recognition are demonstrated. Index Terms—CMOS sensor, image recognition, massive parallel, SIMD, vision chip.
I. INTRODUCTION
T
HERE is a continuous demand for high-speed image processing and recognition [1]. Traditional machine vision systems are composed of image sensor and general-purpose processor. This approach has several limitations. First, the large amount of image data transferred induces heavy I/O load and large power dissipation. Second, it is not easy to finish iterative image processing algorithms at high-speed even by using high-performance general-purpose processors. The vision chip [2], [3] struggles to overcome these limitations by: 1) integrating the image sensor and the processors into a single chip and only outputting extracted feature information; and 2) using parallel processing elements to accelerate processing speed. The vision chip has the advantages of small size, high processing speed and low power consumption. It has broad applications in factory automation, security monitoring, and robot control.
Manuscript received September 27, 2010; revised February 24, 2011; accepted May 11, 2011. Date of publication June 30, 2011; date of current version August 24, 2011. This paper was approved by Associate Editor Bevan Baas. This work was supported by the National Natural Science Foundation of China under Grant 60976023, the Chinese National High-Tech Researching and Development Projection under Grant 2008AA010703, and the special funds for Major State Basic Research Project 2011CB932902 of China. The authors are with the State Key Laboratory for Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2011.2158024
A large number of vision chips have been developed [3]–[20]. Most of these vision chips include a two-dimensional (2D) array of processing elements (PEs) operating in single-instruction-multiple-data (SIMD) fashion. Each PE corresponds to an image pixel and consists of a photodiode and processing circuit. According to the circuit structure, the vision chips can be categorized into analog and digital. The analog vision chip integrates analog image processing circuits with each sensor pixel [4]–[12]. Its advantage is compact area and high speed. The digital vision chip has better flexibility and can handle much more complicated image processing algorithms [13]–[20]. On the other hand, according to the application field, the vision chips can be categorized into application-specific and general-purpose. The application-specific vision chips for motion detection [6], [13], target tracking [10], [14], image compression [15], morphology [9] and range finding [16] have been demonstrated. These chips have impressive performances but low flexibility. The general-purpose vision chip consists of massively parallel programmable PEs that can be used as an image processor with high-speed and good flexibility [3], [5], [11], [17]–[20]. However, the performances of these vision chips are still limited for several reasons. First, the SIMD PE array can efficiently finish low-level pixel-parallel operations, but it is hard to perform row-parallel and non-parallel operations. Recently, a multi-SIMD vision processor architecture was proposed [21], and its FPGA implementation demonstrated the potential advantage of integrating different levels of parallel processors into one chip. Second, previous reported vision chips have one PE along with one pixel, but the PE’s area is 5–20 times larger than the sensor pixel’s area [9], [20]. Since the chip area grows quickly with image size, the image that the chip captured has limited size. Moreover, this fixed one-to-one PE-pixel mapping relationship reduces the image processing flexibly. This paper proposes a novel digital general-purpose programmable vision chip based on multiple levels of parallel processors. It consists of CMOS image sensor, a SIMD PE array, a column of SIMD row processors (RPs), and a microprocessor unit (MPU). The image sensor outputs 8-bit grayscale image data to the processors in a parallel fashion. The PE array allows configurable mapping relationship between the captured image pixel and the PEs, and performs pixel-parallel algorithms. The PE is specifically designed for vision application and it has high performance when performing vision algorithms. The row processors perform row-parallel algorithms. The embedded MPU can perform the non-parallel processing algorithm and manage the chip operation. The vision chip can realize complicated image processing algorithms efficiently. Dedicated parallel programming language and compiler are developed. The vision
0018-9200/$26.00 © 2011 IEEE
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2133
Fig. 1. Brief view of the vision chip architecture.
chip with application specific software can satisfy flexibly the needs of different vision image processing applications such as common vision chip applications, pattern recognition and over 1000 fps image pre-processing. Rest of the paper is organized as follows. In Section II, the chip architecture is introduced. In Section III, the circuit components and the programming language are presented. In Section IV, the VLSI implementation and chip performances are shown. Finally, Section V concludes this paper and indicates future directions. II. CHIP ARCHITECTURE A. Multiple-Level Parallelism Fig. 1 gives a brief review of the chip architecture. It consists of CMOS image sensor, a SIMD PE array, a column of SIMD RPs, and a MPU. The image sensor acquires image and outputs raw image data column by column to the processors. The 2D-array of SIMD PEs performs pixel-parallel image processing algorithms, while the 1D-array of SIMD RPs performs row-parallel algorithms. The PE array and the RP array share their data memories. This method can be used to reduce chip area [21]. MPU fetches the processed data from the RP array and performs non-parallel image processing algorithms. Note that the image size does not equal to PE array size or . Area of the PE is much larger than area of the sensor pixel [20]. If the PE array size is equal to image size, then the chip area grows in the order of and the sensor resolution will be limited under feasible area. Therefore, and are usually designed to be smaller than . The PE array has reconfigurable mapping relationship with the sensor array. The PE array can handle operations on different image sizes and can achieve same speed acceleration. This feature will be further discussed in Section II-B. This architecture gives consideration to both flexibility and parallelism in the image processing. The image processing algorithms can be classified into low-, mid- and high-levels [22]. Different levels of image processing algorithms require processors with different parallelism and complexity. In the proposed chip architecture, the low-, mid- and high-level image processing algorithms are implemented compatibly by three kinds of processors with a PE-level parallelism, with a
row-level parallelism and with no parallelism, respectively. Fig. 2 illustrates the parallel-processing methods based on the architecture of the vision chip. Each PE is directly connected with its neighbor PEs in PE array. It performs logic operations among the present digital signals of itself and the neighbor PEs, and obtains its result. The PE array can execute the low-level algorithms, such as 2D-filters and background reduction, in pixel-parallel fashion. These algorithms have an parallelism, where is the image size. The PE array has PEs, and it can speed up the low-level image processing speed by . Mid-level algorithms include statistical calculation, column/row operation and FFT/DCT. They have parallelism, which indicates that a row (or column) of pixels have same operations at the same time. The RP array has processors to perform mid-level algorithms. In theory, it can speed up the mid-level image processing by .The actual speed-up depends on the nature of the algorithm. Finally, high-level algorithms which require serial calculations are not the target of the PE array and the RP array, and they can be performed by MPU. A typical vision application like character recognition includes several stages. First, image pre-processing like noise reduction, edge detection and skeleton extraction are finished by the PE array. Second, the RP array performs statistical operations to generate vectors that represent the image feature. Finally, the MPU gives the recognition result. If this whole process is done by a general processor, then the low-level image pre-processing may consume a rather large percent of the processing time [24]. For example, in the original PPED image feature extraction algorithm [25] we will discuss later, low-level image processing consumes 97% of all cycles. By using the PE array, the pre-processing speed can be largely improved. Therefore, the chip architecture is feasible to achieve fast image processing speed. Moreover, in the architecture design of vision chip, the trade-off between flexibility and parallelism must be considered. The PE circuit has simplest control path. The fine-grained PE array has maximum parallelism, maximum memory access bandwidth, but lowest flexibility [23]. The mid-grained RP array has moderate parallelism and flexibility. A MPU has lowest parallelism and maximum flexibility. According to the flexibility and parallelism of the target applications, we may dynamically reconfigure the PE array or change the numbers of
2134
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 2. Different parallel processing methods, their representative image processing operations and corresponding speed-up.
PEs and RPs to satisfy the needs of high-speed or high-quality image processing. Fig. 3 presents the detailed architecture of the vision chip. An CMOS sensor array is accompanied with a column of programmable-gain amplifiers (PGAs) and a column of analog-to-digital converters (ADCs). A column of photodiode voltages are amplified by the PGA and then converted to digital signals in parallel. The RP array consists of RPs. Each RP is connected with its neighborhood RPs so that data of the RPs can be exchanged. The PE array contains PEs connected in 2D mesh so that data over the PE plane can freely communicate with each other. The PE has a similar structure with the MATRIX architecture [23]. Each PE has its local -byte data memory. One RP shares its data memory with a row of PEs, so one RP has bytes data memory. During image acquisition, the pixels data of one image column are sampled and converted by the ADCs and are stored into the registers of RP array. The pixels data are then written into the RP’s data memory. In this way a column of image data is stored in one column of PEs or several columns of PEs, since the RP array shares its data memory with the PE array. This process is repeated until the whole image or part of the image which is of interest is loaded into the PE array (or equivalently, into the data memories of RP
array). Then the image will be processed in pixel-parallel by the PE array and in row-parallel by the RP array. The amount of result data obtained by the parallel processors is much smaller than the raw image, and can be fetched by the MPU through data bus. The MPU runs high-level algorithms and adjust the sensor parameters on demand. The instructions for the PE array and the RP array are stored in an on-chip program memory and are issued by a specific controller, whose operation is controlled by the MPU. In this way, the parallel processors can directly access its program memory, and the parallel processors can operate in parallel with the MPU. In summary, the chip architecture integrates analog sensors and different levels of parallel processors into a SOC vision chip and it optimizes the trade-off among image resolution, processing speed and chip area. The chip architecture has remarkable difference with previous SIMD media processors [24]–[26]; it is specific designed for vision chip applications. Several detailed architecture features are presented in the below section. B. Architecture Features 1) Fine-Grained PE Processor: The PE array is formed with fine-grained PEs. Each PE consists of a simple 1-bit ALU and
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2135
Fig. 3. Schematic of the chip architecture. It consists of image sensor array, A/D converters, PE array, RP array and MPU.
data memories. Although the ALU is only 1-bit, the PE can finish complicated multiple-bit operations by using multiple operation cycles. This design is specifically efficient for vision chip applications for several reasons. First, the data type in vision chip application is mostly short length fixed-point integers [1]. For example, 8-bit resolution pixel data is widely used in vision applications. In addition, vision chip algorithm has a large amount of pixel-parallel low-level operations [20]. Compared with conventional coarse-grained or 1D parallel mid-grained processors [26]–[28], the 2D parallel fine-grained PE array can be more efficient in performing pixel-parallel vision algorithms. Here we compare the OPS (operation per second) per unit area performance of PEs with mid-grained ALU. Suppose the mid-grained -bit ALU is composed of a simple ALU unit and a dedicate multiplier. A 1-bit PE finishes -bit addition/logic in cycles, and PEs finish -bit addition/subtraction for times in cycles. The -bit ALU finishes the same task in cycles. However, an -bit adder alone occupies much more area than PEs. The data path of 1-bit PE processor is short, and the 1-bit PE can achieve higher operation frequency by using small number of transistors. Take 8-bit adder for example. With same technology node (0.18 m) and same timing constraint (160 MHz), synthesizing analysis shows that the area of an 8-bit ALU is 8 times larger than 8 of 1-bit ALUs. When performing 8–bit addition/subtraction/logic operations, the PE takes 9 cycles/instructions, and its OPS/area performance is 800% higher than the mid-grained ALU. On the other hand, the 1-bit PE finishes -bit multiplication in cycles, while the mid-grained ALU finishes the multiplication
in 1 cycle. When performing 8 8 bit multiplication, the 1-bit PE needs 108 cycles, and its performance is 35% lower than the mid-grained ALU (In our chip architecture, in principle multiplication could be finished by row processors and can be more efficient; here we focus on PE array only). Therefore, the overall performance depends on the nature of the algorithm. We developed a dedicated compiler for the 2D PE array, which can convert -bit data algorithms into 1-bit PE operations. By using the compiler, we analyzed typical vision algorithms, and conclude that for most cases, the PE processor performance is significantly higher than the mid-grained processor. Implementation results will be shown in Section IV. Second, some vision chip applications operate with data types of shorter length (for example, 4-bit or 1-bit black/white data type). When dealing with these data types, the PE array will be more efficient. This is because the operation cycles of PE processor is proportion to the data length, while the operation cycle of mid-grain processor is irrelevant to the data length. For example, PE processors can finish -bit AND/OR operations in 1 cycle, while an -bit ALU can only finish one 1-bit AND/OR operation in 1 cycle. In this case the PE array performance is much higher. In the PPED algorithm [25] we will shown later, 22% of all operations deal with data types whose length is smaller than 8-bit, and these operations can be finished by the PEs in smaller number of cycles. Finally, the PE can store image data of one pixel or several pixels locally, and the PE array directly fetches sensor image data column by column. This feature facilitates iterative pixelparallel image processing, and reduces the memory load/store time. The PE array can read & write bits in one cycle,
2136
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 4. Configurable mapping relationships between the sensor image and the PE array: (a) 4-pixel/1-PE; (b) 1-pixel/1-PE; (c) 1-pixel/4-PE. Suppose the PE has -byte memory to store images.
and its equivalent memory bandwidth is times higher than using mid-grained processors. In summary, the PE processor speeds up vision algorithms and it cannot be replaced by row processors. The limit of fine-grained PE processor is that it can only finish pixel-parallel algorithms. In our chip, the PE processor works together with the mid-grained RP processor. This heterogeneous structure along with the PE array features shown above improves the SOC chip performance when performing vision applications. 2) Flexible Image Pixel–PE Mapping Relationship: The PE array has flexible mapping relationship with the image sensor. In most previous reported vision chips, the image pixel array size and PE array size is same and one pixel is mapped into one PE. This kind of chip architecture restricts increase of the image sensor resolution because the area of the PE is much larger than the area of the pixel [5], [20]. Furthermore it weakens the flexibility of the PE array, because the image size and number of images stored in the PE array is fixed. In our chip, the interconnection and function of the PE array is flexible. The chip architecture allows the processing algorithm to store multiple pixels in one PE, one pixel in one PE, or one pixel in multiple PEs. This feature enables the PE array to efficiently handle different processing algorithms that are suitable for different applications. In common cases, each PE stores and processes the data of one pixel, as shown in Fig. 4(b). Previous reported vision chips only have this one pixel-to-one PE mapping. In this case, the PE array stores of 8-bit size image frames (the PE has byte memory). If the image pixel array size is larger than PE array size, the image can be cut into slices. The MPU controls the image sensor to output the image slice by slice to the PE array, and the PE array processes the image slice by slice. This mapping relationship is suitable for low-complexity algorithms, like background reduction or simple 2D filter. In these algorithms, the operation on each pixel does not have complicated relationships with others, and the required image size and stored frame number are moderate. For algorithms like image segmentation and mathematical morphology, a large range of neighbor PEs’ data is iteratively used to calculate the result. Therefore it is convenient to store the whole image (or a larger image slice) into the PE array to reduce the difficulty in adequately processing the edge of image slices. Fig. 4(a) illustrates the cases to store 4 pixels data in one PE. In this case, the PE array stores of size
image frames, and it takes 4 cycles for the PE array to finish a one-bit operation of the image. Although each PE stores 4 neighborhood-pixel data, the PE array can still finish pixel-parallel processing algorithms. Fig. 5 explains the operating principle of PE to add adjacent pixels in this case. In Fig. 5(a), each pixel operates with its bottom neighbor pixel. This operation has four steps. In the first three steps, the PE data is loaded from its own memory, and the pixels data are transferred intra-PE. For example in the first step, the pixel data (1, 1) in the PE (1, 1) operates with the pixel (2, 1) stored in the same PE. In the last step, the PE data is loaded from its bottom neighbor PE, and the pixels data are transferred among different PEs. In Fig. 5(b) each pixel operates with its right neighbor pixel, and the operation also has four steps. In all steps, the PE data is loaded from its right neighbor pixel. In this way, the PE array also achieves speed acceleration when processing a image. One PE needs 4 times of cycles to finish one pixel operation. Our dedicated compiler will split the one image pixel operations into multiple PE instructions to handle the operation of this mapping relationship. For algorithms like motion detection and target tracking, only a small part of area in one image is usually concentrated upon [13], [14]. But, these algorithms may require very fast processing speed and storage of multiple previous image frames. In these cases, it is useful to let the PE array only process a concentrated part of image and to store more image frames locally to reduce the I/O cost with external memory. The PE array can combine several PEs into a PE-unit and use multiple PEs to store one pixel data. As shown in Fig. 4(c), four PEs in a row are combined into one 4-bit PE, so that the PE array processes a size image rapidly and can store of image frames. Fig. 6 illustrates the method to combine PEs. The PE function is altered by changing the connections of its input carry and output carry via the configuration bits cflag, cflag2, and the input multiplexer shown in Fig. 8. All PEs receive same instructions, but different PE can have different input carry connection. By changing the two carry flag bits, the PE’s input carry and output carry can show three different connection schemes. The carry flags are set at the initial configuration stage and are propagated through the PE array. In Fig. 6(a), the two carry flags are set to “00”, so that the input carry of each PE is connected to the output carry of itself. In this case, each PE stores one-pixel data, the
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2137
Fig. 5. Schematics of PE operations when one PE stores 4 pixel data: (a) vertical pixel operation and (b) horizontal pixel operation.
Fig. 6. Methods to combine several PEs into a new one by using two carry flags: (a) one 1-bit PE operates with its left/right neighbors; (b) two PEs interact with their 2nd left/right neighbors and operate as one 2-bit-PE unit; (c) four PEs operate as one 4-bit-PE unit.
output carry of one PE serves as the input carry of the same PE, and all PEs operate independently. In Fig. 6(b), the carry flags are set to “10”. The input carry of the second PE in the row is connected to the output carry of the first PE, and the input carry of the first PE is connected to the output carry of
the second PE. When performing addition, the carry output of the first PE serves as the input carry of the second PE. Here the two 1-bit adders of the 1-bit PE form a 2-bit adder. Therefore, two adjacent PE can be viewed as a new “2-bit-PE” unit which stores 1-pixel data and finishes 2-bit operation per cycle. The
2138
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 7. Principle of on-chip sensor feedback. The PE array output image features and the MPU controls the integration time and PGA gain to enhance image quality.
PE array act as an array of 2-bit-PEs in this case, and each PE exchanges data with its 2nd left neighbor and 2nd right neighbor in the same row. Similarly, in Fig. 6(c), the output carry and input carry of 4 adjacent PEs are connected in series, and 4 PEs form a 4-bit “4-bit-PE” unit. The PE array can be viewed as an array of 4-bit-PEs. In this way, the PE array can still achieve speed acceleration when processing an small size image. In summary, the flexible pixel–PE mapping relationship considers the trade-off between the algorithm complexity, processing speed and memory size. The pixel–PE mapping is initialized according to the algorithm requirement. In this way, the PE array can achieve the same speed acceleration when handling different size images. 3) On Chip Sensor Feedback: The sensor array integration time, the PGA gain and the ADCs resolution are controlled by the MPU. The PE array allows quick extraction of image features. It can work together with the MPU to enhance the dynamic range of the sensor. Fig. 7 shows its operating principle. The PE array quickly detects global properties of the image like maximum, minimum and average pixel values, and report to the MPU. For example, the average pixel values can be calculated within 1 ms with 100 MHz frequency. If the pixel grayscales are too low or too high in average, the MPU can: 1) increase or decrease the integration time; 2) adjust the PGA gain, so that the sensor can work with different lighting conditions. The PE array can also synthesize multiple images with different exposure times. For example, the PE array can store one image with 0.5 ms integration time and another one with 0.1 s integration time and add both images to produce an image with better contrast and quality. Another application is to select area of interest. The MPU can first control the PE array to load the full-size
image. Then by applying motion detection and/or edge extraction algorithm, the MPU can detect the object skeleton of the image [20]. After that, the MPU can select the columns of interest and only load select part into the PE array in the next frame. III. CIRCUIT DESIGN AND PROGRAMMING LANGUAGE A. PE Circuit Fig. 8 shows the PE circuit. Each PE consists of two pieces of SRAMs and a 1-bit ALU. In each cycle, two bits data1 and data2 are read from the left SRAM and the right SRAM, respectively, and data1 of each PE is connected to its 1st, 2nd, 4th left neighbor, 1st, 2nd, 4th right neighbor, top neighbor and bottom neighbor. This connection scheme accelerates the data transfer speed among PEs, and the 2nd/4th connection can be used to combine several nearby PEs to form a larger multiple-bit PE unit, as discussed in Section II.B. The first operand of ALU is selected from data1 of its neighbor PEs and itself via a multiplexer. The second operand of the ALU is fetched from the right SRAM. ALU output is written back to one of the SRAMs. The read-modify-write process of data is finished in one cycle. A register stores the output carry of the ALU. The source of the ALU input carry is controlled by specific registers cflag and cflag2. When cflag is 0, comes from the output of the ALU. Otherwise, comes from neighbor PEs. By controlling the carry flags, the PE can be combined into larger PE-unit, as discussed in Section II. The PE finishes the arithmetic operations of two variables with arbitrary bit width by repeating the 1-bit operations. Fig. 9(a) illustrates the case to add two 4-bit variables. At the
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2139
Fig. 8. Schematic of the PE. It consists of a 1-bit ALU and two pieces of memories. Each PE is connected with eight nearby neighbors. Two carry flags select the source of the input carry.
Fig. 10. Structure of the PE’s 7-T SRAM cell.
specific vflag register. In PE, the ALU output is written back to memory only when vflag is 1. Fig. 9(b) shows the function of vflag. Each PE interacts with its bottom neighbor. The variable of PE0 is transferred to PE1 and summed with of PE1, and the result is stored in PE1, since vflag of PE1 is 1. Similarly, of PE1 is transferred to PE2 and summed with , but the result is not stored in PE2 because vflag of PE1 is 0. In this way conditional operation is realized. B. Memory Organization
Fig. 9. (a) The flow to use 1-bit PE to finish the addition of two 4-bit variables and (b) conditional operations of PE.
th cycle, the th bits of the variables and the th carry produce the th sum and the new carry. In this way, multiple-bit operations can be realized. The PE array executes the sets of -bit addition in cycles: cycles for addition and 1 cycle for clearing the carry register. Conditional operations are required in a lot of applications. For example, one may want to only update the even row or the odd row of an image. Conditional operation of PE is supported through a
Since each PE has its own local memory, the PE array has a huge number of SRAMs. The SRAMs in the PE is designed to reduce area and to allow the memory share between the PE array and the RP array. A novel 7-T SRAM is used to enable single-end read and write of the memory. Its schematic is shown in Fig. 10. Compared to 6-T SRAM, a PMOS transistor is inserted between the two inverters. When WR is high, the feedback loop of the two inverters is disabled by the PMOS transistor, and the memory is written by the single-end terminal din. When RD is high, the memory can be read. The ALU can simultaneously read and write the memory, but the read address and the write address have to be different. Fig. 11 shows the memory connection schematic considering a 64-bit PE SRAM performing the read operation. Eight address lines (controlled by Addr_PE_L1) across the PE row and select an 8-bit data from each PE. The 8-bit data are buffered and
2140
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 11. Memory interconnection schematic of one row of PEs.
connected to two groups of pass-transistors. The first group is controlled by another eight lines (decoded from Addr_PE_L2) to select 1 bit as the data1 operand from the 8-bit data. The second group is controlled by lines decoded from Addr_ROW and is used to select one 8-bit data from the row of PEs. The selected 8-bit data is connected to the row processor. In this way, the memories are shared by the PE array and the RP array. When the PE array is active, the PE address selects the PE operands, and the row of memories act as of 64 1-bit PE memories. When the RP array is active, the RP address selects the load data for the RP, and the row of memories acts as an 8-bit RP memory. All multiplexers here are realized by groups of pass-transistors to reduce circuit area, and the area
increase due to the memory share periphery circuits is not significant (approximately 5% of the total memory area). C. Row Processor Fig. 12 shows the schematic and function of a row processor. A simple 8-bit one-stage RISC processor is implemented. Its data memory is provided by the SRAMs of a row of PEs. More sophisticated processor structure can also be used on demand. The basic function of the RP array is listed in Fig. 12. The RP can perform arithmetic operations, transfer data with its neighbor, and perform index addressing. All RPs are controlled by the same instructions, but with different memory addresses. This allows the RP array to finish row-parallel algorithms that
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2141
Fig. 12. Schematic and functions of the row processor.
Fig. 13. Schematics and functions of the APS sensor, CDS, PGA and ADC circuits.
cannot be finished by the PE array, like the image histogram and row-systolic operations [22]. In the chip, the RP array has other two functions: 1) register the column pixel data outputted from the ADC and 2) implement data I/O with the MPU. D. Analog Circuits Fig. 13 shows the functions and schematics of a CMOS active pixel sensor (APS) and a row of its post-processing circuits, including the correlated-double-sampling (CDS), PGA and ADC. The sensor pixel has a standard 3-T APS structure. The CDS is realized by controlling the opening of S1 and S2 and it reduces fixed pixel noises. The PGA simply consists of an operational amplifier and capacitors to reduce area cost. The gain of PGA can be set to 0.5, 1, 2 or 4. The PGA can be turned off by the MPU to save power. The ADC is a small-area single-slope comparator-type ADC [29]. During the load of image, a ramp voltage is applied to the comparator and compared with , and a digital counter is increased every cycle. This counter updates the register value of a RP until the output of the comparator is 0. E. Programming Language To achieve high flexibility and to support the running of parallel processing algorithms on the processors, a parallel
extended C-like language was designed, and a specific compiler was developed. The compile flow is shown in Fig. 14. The compiler first scans the code and separates the code to be executed on the MPU and the code to be executed on the PE array or RP array. Then the code for the MPU is compiled by commercial C compiler, and the code for the parallel processors is processed by our dedicated compiler. Compared with previous works on parallel programming languages/compilers for general processors [31]–[35] and for vision chip [36], our work has several different features. First, the parallel programming language is specifically designed for vision applications. Our goal is not to design a common parallel language, and we do not need to consider common difficulties in parallel languages. The basic data type of our language can be viewed as image pixels and image rows, and the instructions for all image pixels /rows are same at one time. The supported operations are mainly composed of image pixels operating with neighbor pixels or image rows operating with adjacent rows. Second, the language has pixel-parallel variables (PE variable), row-parallel variables (RP-variable) and non-parallel variables. The operations of different variable types are compiled into instructions for different processors. The communications among different variable types are automatically handled. Third, the compiler outputs instructions for 1-bit PEs (or
2142
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 15. Microphotograph of the prototype vision chip.
TABLE I CHIP SPECIFICATIONS Fig. 14. Code example of the parallel extended C-like parallel programming language.
2-bit/4-bit PE if the pixel-PE mapping relationship is changed). The PE array finishes -bit operation within cycles, and the compiler converts one -bit operation in the language into instructions for the PE, like the process shown in Fig. 5. The language does not have fixed-length variables like 8-bit integer. Variables can have different length (1-bit, 3-bit, 9-bit) to reduce cycle number and to improve efficiency. Variables with longer data length require longer processing cycles. Last, the programming language directly handles the operation of image ( pixels), not the operation of PEs. This reduces the programming complexity. Therefore the user does not have to care about the different mapping relationships between the image and the PE array discussed in Section II.B. A code example was given in Fig. 14. The PE operations are realized by the PE variables, which are declared using the PE_Var keyword. The RP variables are declared using the RP_Var keyword. The PE variables represent pixel-parallel data which are distributed in an 2D plane. The braces after the variable are used to get the data from other neighborhood pixels. For example, the command adds the variable of each pixel with its left neighbor and right neighbor and stores the result in the new variable . This process is parallel performed on all PEs. The compiler supports common arithmetic operations and conditional operations of the PEs. For the RPs, array RP variable declaration is supported. The compiler supports arithmetic operations and memory index addressing of the RP array. The left/right neighbor RP is also reached via the braces. IV. VLSI IMPLEMENTATION A prototype vision chip was fabricated in 0.18 m one-poly six-metal (1P6M) standard CMOS technology with 3.3 V and 1.8 V voltage supplies. Fig. 15 shows the chip photo and the circuit blocks. The chip size is 5 2.6 mm . The chip consists of a 128 128 sensor array, 128 PGAs, 128 ADCs, 32 RPs, a
128 32 PE array, an 8051 MPU, on-chip program memory, and periphery circuits. Table I summarizes the chip performances. In the sensor array, an -well/ -sub SAB diode without salicide is used as a high-sensitivity photodiode to achieve high frame rate ( 1000 fps) [14]. The single-slope ADC has an 8-bit resolution. Each PE has a 64-bit left SRAM, 8-bit right SRAM and 1-bit ALU. The PE layout is fully designed by hand. The area of 7-T memory cell is 11 m . The memory array has smaller area than standard 6-T SRAM array, due to the lack of decoders and sense amplifiers. Its power consumption is higher than 6-T SRAM due to the increase of leakage current. The memory share between the RP array and the PE array reduces the total chip area by 35% (4.6 mm ), compared to using separate memories for PE array and RP array. There are 1000 pass transistors per row to implement the memory and PE interconnections. Note that the shared memory occupies 65% area of the PE array. The arithmetic part of the PE array, of the RP array, and the MPU occupies 2.4 mm , 0.9 mm , and 0.3 mm , respectively. The areas of PE array (without memory), RP array and MPU are approximately in proportional to their working load. The critical path of the PE array, the RP array, and the MPU is 5.2 ns, 6.5 ns and 8.6 ns, respectively. The PE array working frequency is mainly limited by the clock tree delay. The maximum working frequency of the single-end 7-T memory cell is 350 MHz. The maximum operation frequency is mainly limited
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2143
Fig. 17. Experimental results of background subtraction. The bottom row shows the subtracted images.
Fig. 16. Software tools and PCB development board of the vision chip.
by the 8051 MPU. The chip operates at 100 MHz, 115 MHz, and 75 MHz with 1.8 V supply voltage, 2.0 V supply voltage, and 1.6 V supply voltage respectively. With 100 MHz frequency, the total power dissipation is 350–450 mW. Analog circuits contribute 30~40% of power dissipation. With 100 MHz operation frequency, the chip’s peak performance is 44 GOPS when performing 8-bit arithmetic operations. Fig. 16 shows the software tools and the test board of the vision chip. The MPU in the chip controls the bottom actuator to realize applications like target tracking. Due to the use of two levels of parallel processors, the chip can perform lowlevel, mid-level and high-level vision algorithms at high speed. Fig. 17 shows the experimental result of simple background subtraction. The background subtraction of the captured 128 128 image is finished in 48 cycles. The top images are the original images acquired at 200 fps, and the bottom images are the subtracted ones. As an example of mid-level image processing algorithm, the flow of 8-bit radix-2 FFT algorithm is explained in Fig. 18. The 8-bit FFT algorithm has three “butterfly” levels, and Fig. 18 shows the operation steps to finish the first level. In the first step, the RP array adjusts the order of inputs according to the butterfly diagram. The RP moves the variable pairs of FFT to adjacent PEs, so that in the next steps, the PE array can process these variables in parallel. In the next three steps, the multiplication, addition and subtraction of adjusted inputs and interme-
diate variables are performed in 2D-parallel by the PE array and the processing speed of FFT is accelerated. Here multiplication consumes largest number of cycles. The order of the output of this butterfly level will then be adjusted by the RP array again for the next level. Fig. 19 shows an example of high-level image feature extraction algorithm. We adopted the Projected Principal-Edge Distribution (PPED) algorithm [25] as an example feature extraction method. The PPED algorithm uses the directional edge information to form the feature vector of an image, and it has been successfully applied to handwritten character recognition and face detection [25]. Its processing flow is briefly described in Fig. 19(a). The first step is to get the gradient map in four directions. The second step is to determine the threshold value for edge detection by using a 5 5 median kernel. The third step is to apply edge detection to generate four 128 128 binary edge maps. All the first three steps are finished by the PE array in parallel. Finally, a 64-dimension vector is formed by concatenating the four histograms of the edge maps in four directions in a 64 64 recognition window. This step is finished by using the PE array (to sum the edges in one row) and the RP array (to operate with different rows) together, and it takes about 6000 cycles. The 128 128 image has 4096 recognition windows and this vector formation process runs simultaneously on all windows. The vectors of different windows are stored locally in different PEs. The whole feature vector generation process of the 128 128 image is finished in 12000 cycles, or 0.12 ms with 100 MHz frequency. Experimental results are shown in Fig. 19(b) and (c). The acquired image, the threshold value for edge detection, the edge maps in four directions in a selected recognition window, and the output feature vectors are shown. The feature vectors can be then fetched by the MPU for image recognition. In comparison, the ASIC chip which is designed specifically to realize the PPED algorithm generates a vector of a single window in 0.64 s [37]. To scan a 128 128 size image it will take 2.6 ms. Our vision chip is 22 times faster, due to the use of massive parallel PEs along with mid-grained RPs. Table II summarizes the chip performances when performing low-level, mid-level and high-level image processing algorithms. All input variables have 8-bit precision. The cycle number per pixel/block, the distribution of cycles among PE array, RP array and MPU, and the measured processing time of the 128 128 8-bit captured image is shown. Low-level image processing is mainly finished by the PE array. The pixel-parallel part and row-parallel part of mid-level image processing algorithms is finished by the PE array and the RP array, respectively.
2144
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
Fig. 18. Use both PEs and RP to finish the first level of 8-bit radix-2 FFT algorithm.
Fig. 19. (a) Flow of the PPED pattern extraction algorithm. (b) Experimental results of the PPED algorithm applied to a human face. (c) Results of the PPED algorithm applied to a human hand.
The last column in Table II shows the cycle number improvement of the proposed two-level parallel architecture compared with the 1D SIMD mid-grained row-parallel architecture. The improvement of image pre-processing and vision applications is significant (ranging from 50% to 900%). The cycle number improvement of FFT, DCT and FIR seems not significant (ranging from 5% to 48%), because FFT and DCT involves a large number of multiplication, and currently our chip does not have dedicated multipliers. Mid-grained processor superiors in performing multiplication compared to the PEs. Nevertheless, the chip equivalently finishes a 128-bit FFT in 1.8 s, which is comparable to the fine-grained massive parallel Matrix chip [23], the mid-grained parallel IMAP-CE chip [26], and the
FFTE algorithm based on Intel’s Streaming SIMD Extensions 3 (SSE3) instructions running on a dual-core Intel Core2Duo processor [38]. In Table II, the target tracking algorithm used 1-pixel-to-4-PE mapping; the image segmentation and image histogram used 4-pixel-to-1-PE mapping; the character-recognition and PPED extraction used 2-pixel-to-1-PE mapping; other algorithms are based on 1-pixel-to-1-PE mapping. The flexible mapping relationship increases the flexibility and decreases the complexity in realizing different algorithms. Table III shows the performance comparison with other recent reported vision chips. Our vision chip has two levels of parallel SIMD processors. The integration of MPU greatly enhances the flexibility. The area & energy efficiency of the chips
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
2145
TABLE II CHIP PERFORMANCES IN REALIZING DIFFERENT LEVELS OF VISION ALGORITHMS
TABLE III COMPARISON WITH PREVIOUS VISION CHIPS
can be expressed by two parameters: GOPS/mm and GOPS/W (OPS means operations per second for the chip’s data type. For our chip, it corresponds to 8-bit operations.). The chip performance is 3.4 GOPS/mm and 97.8 GOPS/W. Compared with analog vision chips, the proposed chip can handle more complicated algorithms and applications. Compared with previous vision chips, the proposed vision chip has better flexibility and performance. Another difference is that the proposed chip did not integrate the PE with each pixel. In other words, the sensor array and PE array are separated. This is favorable for high sensor fill factor, strong PE function and processing flexibility. The proposed chip architecture has good scalability. Due to the flexible mapping relationship between the sensor array and
the processors, it is possible to increase the sensor resolution and the PE array size separately while maintaining the chip architecture. The operation performance increases linearly with the number of PEs and RPs. One weakness of the chip is its inefficiency in performing multiplications. We plan to solve this problem in the next version by integrating more powerful row processors and replacing the 8051 MPU with a 32-bit general processor. The chip architecture remains the same. V. CONCLUSION A vision chip based on multiple levels of SIMD parallel processors was proposed. The CMOS sensor, two levels of parallel processors and the MPU were integrated into a single
2146
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 9, SEPTEMBER 2011
chip to form a system-on-a-chip. The fine-grained PEs finished pixel-parallel image processing algorithms, while the row processors finished row-parallel image processing algorithms. The PE array has a reconfigurable mapping relationship with the sensor’s image. The multiple levels of parallel processors and the flexible PE array greatly enhance the flexibility of the vision chip and make it suitable for different vision applications. Dedicated parallel programming language and compiler for the vision chip were developed. A prototype chip was designed and fabricated in a 0.18 m CMOS technology. The chip is able to finish low-level to high-level image processing algorithms, and image acquisition and feature extraction can be finished within 1 ms. REFERENCES [1] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed. Upper Saddle River, NJ: Pearson Education, 2002. [2] K. Aizawa, “Computational sensors – vision VLSI,” IEICE Trans. Inf. Syst., vol. E82D, no. 3, pp. 580–588, 1999. [3] M. Ishikawa, K. Ogawa, T. Komuro, and I. Ishii, “A CMOS vision chip with SIMD processing element array for 1 ms image processing,” presented at the IEEE Int. Solid State Circuits Conf. (ISSCC), San Francisco, CA, 1999, Paper No. TP 12.2. [4] C.-Y. Wu and C.-F. Chiu, “A new structure of the 2D silicon retina,” IEEE J. Solid-State Circuits, vol. 30, no. 8, pp. 890–897, Aug. 1995. [5] A. Rodriguez-Vazquez et al., “ACE16k: The third generation of mixedsignal SIMD-CNN ACE chips toward VSoCs,” IEEE Trans. Circuits Syst. I, vol. 51, no. 5, pp. 851–863, May 2004. [6] Y. M. Chi, U. Mallik, M. A. Clapp, E. Choi, G. Cauwenberghs, and R. Etienne-Cummings, “CMOS camera with in-pixel temporal change detection and ADC,” IEEE J. Solid-State Circuits, vol. 42, no. 10, pp. 2187–2196, Oct. 2007. [7] J. Dubois, D. Ginhac, M. Paindavoine, and B. Heyrman, “A 10 000 fps CMOS sensor with massively parallel image processing,” IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 706–717, Mar. 2008. [8] E. Funatsu, Y. Nitta, Y. Miyake, T. Toyoda, J. Ohta, and K. Kyuma, “An artificial retina chip with current-mode focal plane image processing functions,” IEEE Trans. Electron Devices, vol. 44, no. 10, pp. 1977–1982, Oct. 1997. [9] V. M. Brea, D. L. Vilariño, A. Paasio, and D. Cabello, “Design of the processing core of a mixed-signal CMOS DTCNN chip for pixel-level snakes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 5, pp. 997–1013, May 2004. [10] D. Kim, J. Cho, S. Lim, D. Lee, and G. Han, “A 5000 S/s singlechip smart eye-tracking sensor,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2008, pp. 46–594. [11] P. Dudek and P. J. Hicks, “A general-purpose processor-per-pixel analog SIMD vision chip,” IEEE Trans. Circuits Syst. I, vol. 52, no. 1, pp. 13–20, Jan. 2005. [12] N. Massari and M. Gottardi, “A 100 dB dynamic-range CMOS vision sensor with programmable image processing and global feature extraction,” IEEE J. Solid-State Circuits, vol. 42, no. 3, pp. 647–657, Mar. 2007. [13] A. Moini, A. Bouzerdoum, K. Eshraghian, A. Yakovleff, A. X. T. N. Blanksby, R. Beare, D. Abbott, and R. E. Bogner, “An insect visionbased motion detection chip,” IEEE J. Solid-State Circuits, vol. 32, no. 2, pp. 279–284, Feb. 1997. [14] W. Miao, Q.-Y. Lin, and N.-J. Wu, “A novel vision chip for high-speed target tracking,” Jpn. J. Appl. Phys., vol. 46, no. 4B, Apr. 2007. [15] W. D. Leon-Salas, S. Balkir, K. Sayood, N. Schemm, and M. W. Hoffman, “A CMOS imager with focal plane compression using predictive coding,” IEEE J. Solid-State Circuits, vol. 42, no. 11, pp. 2555–2572, Nov. 2007. [16] Y. Oike, M. Ikeda, and K. Asada, “A 375x365 high-speed 3-D rangefinding image sensor using row-parallel search architecture and multisampling technique,” IEEE J. Solid-State Circuits, vol. 40, no. 2, pp. 444–453, Feb. 2005. [17] T. Komuro, A. Iwashita, and M. Ishikawa, “A QVGA-size pixel-parallel image processor for 1,000-fps vision,” IEEE Micro, vol. 29, no. 6, pp. 58–67, Nov.-Dec. 2009.
[18] W. Miao, Q. Lin, W. Zhang, and N.-J. Wu, “A programmable SIMD vision chip for real-time vision applications,” IEEE J. Solid-State Circuits, vol. 43, no. 6, pp. 1470–1479, Jun. 2008. [19] T. Komuro, S. Kagami, and M. Ishikawa, “A dynamically reconfigurable SIMD processor for a vision chip,” IEEE J. Solid-State Circuits, vol. 39, no. 1, pp. 265–268, Jan. 2004. [20] Q.-Y. Lin, W. Miao, W.-C. Zhang, Q.-Y. Fu, and N.-J. Wu, “A 1000 frames/s programmable vision chip with variable resolution and row-pixel-mixed parallel image processors,” Sensors, vol. 9, pp. 5933–5951, 2009. [21] K. Yamaguchi, Y. Watanabe, T. Komuro, and M. Ishikawa, “Design of a massively parallel vision processor based on multi-SIMD architecture,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS2007), May 2007, pp. 3498–3501. [22] S. Kyo, S. Okazaki, and T. Arai, “An integrated memory array processor for embedded image recognition systems,” IEEE Trans. Comput., vol. 56, no. 5, pp. 622–634, May 2007. [23] H. Noda, M. Nakajima, K. Dosaka, K. Nakata, M. Higashida, O. Yamamoto, K. Mizumoto, T. Tanizaki, T. Gyohten, Y. Okuno, H. Kondo, Y. Shimazu, K. Arimoto, K. Saito, and T. Shimizu, “The design and implementation of the massively parallel processor based on the matrix architecture,” IEEE J. Solid-State Circuits., vol. 42, no. 1, pp. 183–192, Jan. 2007. [24] J. D. Owens, “Computer graphics on a STREAM architecture,” Ph.D. dissertation, Stanford University, Stanford, CA, 2002. [25] M. Yagi and T. Shibata, “An image representation algorithm compatible to neural-associative-processor-based hardware recognition systems,” IEEE Trans. Neural Networks, vol. 14, no. 5, pp. 1144–1161, Sep. 2003. [26] S. Kyo, T. Koga, S. Okazaki, and I. Kuroda, “A 51.2-GOPS scalable video recognition processor for intelligent cruise control based on a linear array of 128 four-way VLIW processing elements,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1992–2000, Nov. 2003. [27] A. Abbo, R. Kleihorst, V. Choudhary, L. Sevat, P. Wielage, S. Mouy, and M. Heijligers, “XETAL-II: A 107 GOPS, 600 mW massively-parallel processor for video scene analysis,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2007, pp. 270–271, Paper No. 15.1. [28] P. P. Jonker, “Why linear arrays are better image processors,” in Proc. 12th IAPR Int. Conf. Pattern Recognition, 1994, vol. 3, pp. 334–338. [29] S. Kleinfeiler, S. Lim, X. Liu, and A. El Gamal, “A 10000 frames/s CMOS digital pixel sensor,” IEEE J. Solid-State Circuits, vol. 36, no. 12, pp. 2049–2059, Dec. 2001. [30] S. P. VanderWiel, D. Nathanson, and D. J. Lijia, “Complexity and performance in parallel programming languages,” in Proc. 2nd Int. Workshop on High-Level Programming Models and Supportive Environments, Apr. 1997, pp. 3–12. [31] S. Kyo, T. Koga, S. Okazaki, and I. Kuroda, “A programmable parallel processor LSI for video-based driver assistance systems,” in Proc. 2003 IEEE Proc. Intelligent Transportation Systems, Oct. 2003, vol. 1, pp. 257–262. [32] G. Kahn, “The semantics of a simple language for parallel programming,” in Information Processing, J. L. Rosenfeld, Ed. Amsterdam, The Netherlands: North-Holland, , 1974, pp. 471–475. [33] C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, and K. Yelick, “Hardware/compiler codevelopment for an embedded media processor,” Proc. IEEE, vol. 89, no. 11, pp. 1694–1709, Nov. 2001. [34] P. R. Mattson, “A programming system for the Imagine media processor,” Ph.D. dissertation, Stanford University, Stanford, CA, 2002. [35] S. Kyo, S. Okazaki, and I. Kuroda, “An extended C language and A SIMD compiler for efficient implementation of image filters on media extended micro-processors,” in Proc. Advanced Concepts for Intelligent Vision Systems (ACIVS 2003), Ghent, Belgium, Sep. 2003, pp. 234–241. [36] T. Komuro, S. Kagami, M. Ishikawa, and Y. Katayama, “Development of a bit-level compiler for massively parallel vision chips,” in Proc. IEEE 7th Int. Workshop on Computer Architecture for Machine Perception (CAMP’05), Jul. 2005, pp. 204–209. [37] H. Yamasaki and T. Shibata, “A real-time image-feature-extraction and vector-generation VLSI employing arrayed-shift-register architecture,” IEEE J. Solid-State Circuits, vol. 42, no. 9, pp. 2046–2053, Sep. 2007. [38] D. Takahashi, “Implementation and evaluation of parallel FFT using SIMD instructions on multi-core processors,” in Proc. Int. Workshop on Innovative Architecture for Future Generation Processors and Systems, 2007, pp. 53–59.
ZHANG et al.: A PROGRAMMABLE VISION CHIP BASED ON MULTIPLE LEVELS OF PARALLEL PROCESSORS
Wancheng Zhang was born in Tianjin, China, on April 1985. He received the B.S. degree in physics from Peking University, Beijing, China, in 2004, and the Ph.D. degree in microelectronics from the Institute of Semiconductors, Chinese Academy of Sciences, Beijing, in 2009.
Qiuyu Fu was born in Hebei, China, in 1982. She received the B.S. degree in electronic science and technology from Beijing Jiaotong University, China, in 2005. She has been pursuing the Ph.D. degree at the State Key Laboratory for Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing, China. Her research interests are in the field of CMOS integrated optical sensors and real-time vision chip design.
2147
Nan-Jian Wu (M’06) was born in Zhejiang, China, on February 27, 1961. He received the B.S. degree in physics from Heilongjiang University, Harbin, China, in 1982, the M.S. degree in electronic engineering from Jilin University, Changchun, China, in 1985, and the Ph.D. degree in electronic engineering from the University of Electro-Communications, Tokyo, Japan, in 1992, respectively. In 1992, he joined the Research Center for Interface Quantum Electronics and Faculty of Engineering, Hokkaido University, Sapporo, Japan, as Researcher Associate. In 1998, he was an Associate Professor in the Department of Electro-Communications of the University of Electro-Communications . Since 2000, he has been a Professor in the Institute of Semiconductors, Chinese Academy of Sciences, Beijing. In 2005, as a Visiting Professor, he visited the Research Center for Integrated Quantum Electronics, Hokkaido University. His research is in the field of semiconductor quantum devices and circuits, and design of analog–digital mixed-signal LSI.