42
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
A 345 mW Heterogeneous Many-Core Processor With an Intelligent Inference Engine for Robust Object Recognition Seungjin Lee, Student Member, IEEE, Jinwook Oh, Student Member, IEEE, Junyoung Park, Student Member, IEEE, Joonsoo Kwon, Student Member, IEEE, Minsu Kim, Student Member, IEEE, and Hoi-Jun Yoo, Fellow, IEEE
Abstract—A heterogeneous many-core object recognition processor is proposed to realize robust and efficient object recognition on real-time video of cluttered scenes. Unlike previous approaches that simply aimed for high GOPS/W, we aim to achieve high Effective GOPS/W, or EGOPS/W, which only counts operations carried out on meaningful regions of an input image. This is achieved by the Unified Visual Attention Model (UVAM) which confines complex Scale Invariant Feature Transform (SIFT) feature extraction to meaningful object regions while rejecting meaningless background regions. The Intelligent Inference Engine (IIE), a mixed-mode neuro-fuzzy inference system, performs the top-down familiarity attention of the UVAM which guides attention toward pre-learned objects. Weight perturbation-based learning of the IIE ensures high attention precision through online adaptation. The SIFT recognition is accelerated by an optimized array of 4 20-way SIMD Vector Processing Elements, 32 MIMD Scalar Processing Elements, and 1 Feature Matching Processor. When processing 30 fps 640 480 video, the 50 mm2 object recognition processor implemented in a 0.13 m process achieves 246 EGOPS/W, which is 46% higher than the previous work. The average power consumption is only 345 mW. Index Terms—Multi-core processor, network-on-chip, neurofuzzy logic, object recognition, visual attention.
I. INTRODUCTION
R
OBUST object recognition is the key component in vision-based applications such as augmented reality [1], content-based image retrieval, and intelligent robots. In these applications, local descriptor matching object recognition algorithms, such as the Scale Invariant Feature Transform (SIFT) [2], are widely used due to their invariance to scaling, rotation, and illumination. However, the huge number of computations required by the multiple cascaded transformations make it difficult to achieve real-time performance on a general purpose processor, especially on a power-constrained battery-powered platform. Even on a modern PC, the SIFT algorithm performs on the order of 1 frame per second for 640 480 pixel images. Manuscript received May 11, 2010; revised July 11, 2010; accepted September 01, 2010. Date of publication October 14, 2010; date of current version December 27, 2010. This paper was approved by Guest Editor Kazutami Arimoto. The authors are with the Department of Electrical Engineering, KAIST, Daejeon 305-701, Korea (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSSC.2010.2075430
Object recognition can be accelerated by integrating a large number of processing elements that operate on the image in parallel [3]–[7]. General purpose multi-core processors such as the Stream Processor [3] are very flexible but consume high power. In [4], a multi-core processor optimized for image processing achieved similar performance for about 1 fourth of the power. In [5], 8 processors are integrated to accelerate SIFT feature extraction on 320 240 pixel images. Real-time object recognition is achieved on 320 240 pixel images in [6] by dividing the input image into 8 columns which are each processed in parallel by 8 8-way SIMD processors. Reference [7] achieves 30 fps performance on 640 480 images by doubling the SIMD processor count to 16, and employing a tile-based approach. Simply increasing the number of processing elements, however, comes at the cost of higher power consumption. Visual attention, which confines processing to regions containing meaningful objects, can help control the power consumption. In [6], saliency-based visual attention [8] was accelerated by a visual attention engine (VAE) [9] to select regions that contain conspicuous points in the image. The region selection was improved in [7] with the help of a neuro-fuzzy hardware accelerated region growing scheme [10]. The limitation of saliency-based visual attention is that it relies only on feed-forward bottom-up features to select object regions under the assumption that objects are more salient than the background. If the background is more salient, then the attention precision will be reduced, leading to wasted computations on background regions. In order to achieve high attention precision even for scenes with salient backgrounds we presented a new algorithm, the Unified Visual Attention Model (UVAM) [11], shown in Fig. 1, which incorporates the familiarity map on top of the saliency map for the search of attentive points. It can cross check the accuracy of attention deployment by combining top-down attention, which searches for “meaningful objects”, and bottom-up attention, which just looks for conspicuous points. In this paper, a heterogeneous many-core processor [12] is presented to realize the UVAM algorithm at 30 fps frame rate for object recognition of cluttered scenes. The proposed processor exploits three key features. First, the analog-digital mixed-mode Intelligent Inference Engine (IIE) [13] accurately distinguishes target objects from clutter using the adaptive neuro-fuzzy inference system (ANFIS) [14] to improve the accuracy of the attention feedback. Second, 4 feature extraction clusters (FEC) composed of 4 SIMD vector processing elements (VPE) and 32 MIMD scalar processing elements (SPE) with hierarchical
0018-9200/$26.00 © 2010 IEEE
LEE et al.: 345 MW HETEROGENEOUS MANY-CORE PROCESSOR
43
Fig. 1. Attention recognition loop of the unified visual attention map.
task management accelerates feature detection and generation stages. Third, per-frame power mode control based on workload prediction by the IIE minimizes power consumption. It is not fair to compare object recognition chips in terms of the traditional power efficiency, which is obtained by dividing the peak performance, measured in GOPS, by the peak power consumption, measured in Watts. This is because the peak performance includes operations on background regions which are meaningless. We propose the recognition effectiveness as a fairer metric for comparing the performance of object recognition chips. This is obtained by multiplying the attention precision and the power efficiency, so that only operations on meaningful object regions are accounted for. Here, attention precision is defined as the ratio of object area within the total area selected by attention. The unit for recognition effectiveness is EGOPS/W, or effective Giga operations per second per Watt. By measuring the recognition effectiveness, we are able to consider not only the brute parallel performance but also the intelligence of the chip.
Fig. 2. (a) 3 steps of SIFT-based object recognition and (b) tile-based object recognition.
II. ALGORITHM A. Tile-Based Parallel Object Recognition Local image patch description features such as the Scale Invariant Feature Transform (SIFT) [2] features have been shown to perform well in general object recognition tasks. Object recognition using SIFT features consists of feature detection, feature description, and feature matching as shown in Fig. 2(a). Since SIFT features can be extracted from a local image region, SIFT-based object recognition of a 640 480 pixel image can be decomposed into independent tasks on 300
32 32 pixel tiles, as shown in Fig. 2(b). Thus it is possible to accelerate SIFT-based object recognition by executing multiple recognition threads on different tiles in parallel. In the tile-based approach, each image tile can be seen as a unit of attention deployment. In the example shown in Fig. 2(b), out of a total of 300 tiles, only 114 tiles containing meaningful objects are selected for further processing. Arithmetically, this amounts to a 62% reduction in the computation workload. However, this reduction would be meaningless if the selected tiles do not contain the target object.
44
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
Fig. 3. Block diagram of the proposed heterogeneous many-core processor.
In tile-based parallel object recognition, the attention precision can be defined as
(1) Using this definition of attention precision, the recognition effectiveness is defined as
(2) B. Unified Visual Attention Model Visual attention in humans selects only the relevant information among the massive amount of data that enters our brain through the retina [15]. This is necessary because 1) the visual cortex(the portion of our brain responsible for vision perception) has limited computational capacity and 2) information that is irrelevant to the task at hand degrades the performance for that task. Similarly, visual attention can be used in computer object recognition to 1) reduce the computational burden and 2) improve recognition accuracy by filtering away background data. In this work, we adopt the Unified Visual Attention Model (UVAM) [11], which combines bottom-up and top-down visual attention to achieve high visual attention precision. Object recognition is performed on unit tiles to facilitate selective execution of object containing tiles, which are selected by visual attention.
The usefulness of visual attention to object recognition depends on its ability to select regions containing target objects while rejecting regions containing background clutter. The UVAM is outlined in Fig. 1. The bottom-up information, which is encoded in the saliency map [8], integrates the conspicuity information of the low-level features of color, intensity, orientation and motion. The top-down component is represented by the familiarity map which contains information on how “familiar” certain regions of the image are to the recognition system. Compared to previous works employing visual attention in object recognition [6], [7], the UVAM adds a top-down attention feedback loop to improve attention precision. Visual attention based on bottom-up saliency cannot distinguish salient backgrounds from salient objects and thus performs poorly when the background contains salient points. However, the top-down attention, newly introduced in the UVAM, utilizes familiarity information fed back from object recognition results to identify potential objects in the scene and concentrate processing to those areas. Familiarity evaluation is performed by neuro-fuzzy inference using size, orientation, and motion as clues as well as the features stored in the database. A potentially beneficial side-effect of the UVAM is the large difference between the minimum and maximum workloads. While exact numbers will vary depending on image content, in the tested VGA video sequences, a large portion of the frames required less than 25% of the maximum processing capacity. Since the actual hardware design specifications must meet the worst case maximum workload scenario, this opens up the possibility for very large power savings through aggressive voltage and frequency scaling.
LEE et al.: 345 MW HETEROGENEOUS MANY-CORE PROCESSOR
45
Fig. 4. The Vector Processing Element. Fig. 5. The Scalar Processing Element.
III. CHIP ARCHITECTURE The overall block diagram of the heterogeneous many-core processor is shown in Fig. 3. A total of 51 IPs are connected by a hierarchical star ring Network-on-Chip (NoC) [16] and organized into two layers: the cognitive control layer (CCL), which performs global attention and power management functions, and the parallel processing layer (PPL), which performs SIFT feature extraction and matching. The CCL consists of the IIE, a RISC host processor, the power mode controller (PMC), and fixed function units for accelerating feed-forward visual attention such as the VAE2, Motion Estimator (ME), and the Stereo Correspondence Processor (STCP). The PPL consists of 4 FECs for feature detection and description generation, and 1 feature matching processor (FMP) for feature matching. The proposed chip reuses several features from previous generations of object recognition chip research at our group [5]–[7]. The host RISC processor is completely reused from [7] with some modifications to the cache for reduced latency. The VAE2, which performs low-level saliency-based visual attention, is adopted from the VAE [9] first implemented in [6]. The Motion Estimator (ME), first implemented in [7] to extract dynamic features, is also reused. The NoC protocol is practically identical to [7] except that the flit bit width is increased from 34 to 38 to accommodate byte-wise write masking. A. Feature Extraction Cluster (FEC) Each FEC consists of 1 SIMD Vector Processing Element (VPE) for exploiting data level parallelism of the feature detection task, and 8 Scalar Processing Elements (SPE) for exploiting task level parallelism of the feature description task. The VPE and SPE architectures are explained in more detail in Figs. 4 and 5. Both the VPE and SPE are fully programmable processors with 4 kB of instruction memory, a RISC-like ISA with special extensions, and C compiler support. The VPE’s main features are its 20 B wide 40 kB data memory which supports byte-aligned read and write accesses, and its convolution optimized coefficient memory (CMEM) and convolution loop controller. The 20 B wide ALU can handle common logic and arithmetic functions including single cycle MAC. Unlike most similar architectures it contains only 1 20 16 bit accumulation register and no SIMD register file to minimize area and power. Thanks to these features the VPE performs Gaussian filtering
operations at a sustained rate of 3.65 GMAC/sec at 200 MHz, and feature detection on a 32 32 tile is completed in 180 us, which is sufficient to process VGA images (300 tiles) at 30 fps. The SPE is optimized for high scalar IPC. It features a 5-stage pipeline that enables direct operations on memory operands in a CISC-like fashion. Branching with zero delay-slot is made possible by condition code generation in the ALU1/Addr stage which immediately follows the Fetch stage. Special functions such as SQRT, DIV/MOD, SIN/COS, and ATAN are supported by the ALU with maximum latency of 3 cycles. Thanks to these optimizations, the SPE achieves 0.8 sustained IPC at 200 MHz operation, and the descriptor generation stage can be completed in 161 us. B. Fine-Grained Task Scheduling Resource aware fine-grained task scheduling, shown in Fig. 6, minimizes external memory accesses and maximizes utilization of the FECs to achieve high performance. The global task management unit (GTMU) maintains a Tile Memory Allocation Table (TMAT) that tracks the input image tiles loaded within each VPE. When the GTMU schedules new tile tasks for the VPEs, the TMAT is used to minimize the number of new tiles that must be loaded. As a result, external memory access is reduced by 53% compared to when input image tiles are not reused, and by 32% compared to sequential task scheduling. SPE sharing among FECs through the collaboration of the Local Task Management Units (LTMU) enables high utilization of the SPEs. A major challenge in achieving high utilization in our heterogeneous multi-core pipeline is that the number of features detected, , varies for each tile shown in Fig. 6. Thereis less than 8, then some of the 8 SPEs will be unutifore, if lized. If NF is greater than 8, then the recognition pipeline will have to stall to accommodate additional iterations of the SPEs, thus seriously degrading the recognition speed. However, with SPE sharing, unutilized SPEs are made available to neighboring greater than 8, thus effectively averaging out the FECs with among the 4 FEC. As a result, pipeline stall ocvariation of currences are reduced by 82%, and the average tile processing speed is increased by 22% to 13811 tiles per second. This translates to a worst case performance of 33.8 fps for a 640 480 input when all tiles in the frame are selected.
46
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
Fig. 6. Hierarchical task management scheme.
C. Intelligent Inference Engine (IIE) The IIE, shown in Fig. 7, is a mixed mode ANFIS [14], capable of efficiently performing neuro-fuzzy inferences with up to 3 inputs. The mixed-mode design takes advantage of low power and high performance of current mode analog functional units, such as multipliers, adders, and Gaussian functions, and the high flexibility of a digital controller to perform parameter loading and learning. The conversion overhead of DACs and ADCs are kept to a minimum by using multplying DACs, which directly calculate the product of analog signals and digital parameters, and by optimizing the resolution of DACs and ADCs to between 5 and 8 bits. The analog datapath of the IIE consists of 5-stages for neurofuzzy inference. The key circuit in the analog datapath is the parameterized Gaussian membership function, shown in Fig. 7, which performs non-linear conversion of the crisp input to fuzzy values. The high and low boundaries of the parameterized Gaussian membership function’s shape can be controlled and as shown in the waveforms. The slope of by of M1 the Gaussian function is controlled by the variable through M4. A detailed explanation of the circuit operation can be found in [13]. Digital parameters for the analog datapath are provided by the digital controller. Parameters for fuzzy rules corresponding to each object reside in an off-chip memory, which can introduce high latency. 4 kB of internal cache reduces memory access overhead by 86%, thereby improving inference speed by 21%. As a result, the IIE achieves 1 M fuzzy logic inferences per second (FLIPS), and area and power consumption of the analog datapath are just 0.176 mm and 1.2 mW, or 54% and 15%, respectively, compared with an equivalent digital implementation. Fig. 8 shows the perturbation learning [17] scheme employed by the IIE to achieve real-time adaptation. The evaluation reof the outer large circle, and the perturbed result, sult, of the inner 9 iterative calculation paths, are used to calculate the antecedent parameters for the next epoch, where are the current antecedent parameters, is the input,
is the familiarity output, is the desired output, and denotes perturbation. One iteration, or epoch, of perturbation learning takes 3.5 us and learning with less than 5% error is achieved in just 20 epochs or 70 us. D. Power Mode Controller (PMC) There is high potential for power savings through voltage/frequency scaling of the PPL since the workload of the FECs and FMP varies widely depending on the number of selected tiles. Since frames are input at a fixed interval in a real-time application (33 ms for a 30 fps video stream), power consumption can be minimized by using the lowest possible voltage/frequency level that provides enough performance to process each frame within that time period. However, the required performance is not known for certain until processing on a frame is finished, due to the iterative tile selection of feedback attention. Therefore, accurate prediction of the workload is critical in order to effectively scale the voltage and frequency at the beginning of each frame. The voltage and frequency of the PPL is scaled before performing recognition on each frame, based on the workload prediction of the IIE. The IIE performs accurate workload prediction using two clues as inputs. First, the workload history of previous frames is used to exploit the high correlation between consecutive frames. However, this is not enough to predict sudden spikes in the workload, so the energy of the saliency map, obtained during the bottom-up attention stage before recognition, is also used. The IIE is an adaptive neuro-fuzzy inference system (ANFIS), which means it can be trained to adapt to different scene environments. In our experiments the workload prediction is trained and tested using two separate video sequences of similar locations but taken from different angles with different events (i.e., camera movement, object onset). Training is performed with a strong bias for over-prediction by placing a higher (2 times) penalty for under-predictions. This is since under-prediction could result in increased latency which is undesirable under real-time constraints.
LEE et al.: 345 MW HETEROGENEOUS MANY-CORE PROCESSOR
47
Fig. 7. Intelligent Inference Engine.
Fig. 8. Perturbation learning in the Intelligent Inference Engine.
The workload prediction result of the IIE for a 100 frame video sequence is shown in Fig. 9. Due to the biased training, for
most frames the predicted workload slightly exceeds the actual workload. However in some cases such as frame 97, under-pre-
48
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
TABLE I VOLTAGE AND FREQUENCY SCALING OF PPL POWER MODES
Fig. 9. Workload prediction of the Intelligent Inference Engine.
Fig. 11. Hierarchical Star Fig. 10. Measured minimum VDD of PPL versus frequency.
diction may occur. We assume a soft real-time constraint, in which the average frame rate is important, rather than a hard real-time constraint, in which every frame must complete within its allocated deadline. Therefore, in the case of under-prediction the frame processing time is allowed to exceed the deadline of 33 ms, which will appear as increased latency. This latency is compensated by the following frame, which operates at an elevated operating point so that it may complete within its shortened deadline. Should hard real-time operation be absolutely necessary, the premature termination of under-predicted frames will result in some loss of recognition rate, on the order of 1% decrease according to our experiments. Based on the workload prediction of the IIE, the power mode of the PPL is decided among the 8 voltage and frequency pairs as described in Table I. The voltage and frequency combinations are obtained by applying the Alpha-power Law MOSFET Model [18] with wide margins to account for device mismatches in lower operating voltages. The voltage and frequency scaling properties of the implemented chip are shown in Fig. 10. The measured minimum VDD for each frequency point exhibits reasonable margins compared to the PMC settings down to 50 MHz. The lowest operating mode of 50 Mhz at 0.65 V was determined by the embedded SRAM modules, whose performance degrades drastically beyond that point. E. Hierarchical Star
Ring Network-on-Chip
A total of 51 IPs are connected by a 2-level hierarchical star NoC, consisting of a top global router, and 4 local routers, aug-
+ Ring Network-on-Chip.
mented by a ring network connecting the FECs’ local routers, as shown in Fig. 11. Each link that is shown provides a theoretical peak bandwidth of 640 MB/s in both directions when operating at 200 MHz resulting in a theoretical peak bandwidth of 76.8 GB/s. The global router is operated at a clock of 400 MHz for low latency packet forwarding. The hierarchical star network provides low latency, high bandwidth within each local network, as well as a low maximum hop count of 3 between any two nodes in the network. The ring network provides additional bandwidth for inter-FEC communication. An important role of the global router is the level-shifting and synchronization of incoming and outgoing packets. The only function-related (non-debugging) level-shifters in the chip are placed on the 5 links between the global router and the PPL as shown in Fig. 11, thus significantly simplifying implementation. In addition, synchronizing dual-clock FIFOs are only placed in the input and output ports of the global router. IV. IMPLEMENTATION The proposed chip, shown in Fig. 12, occupies 50 mm in a 0.13 um 8 metal CMOS process with a NAND2 equivalent gate count of 2.93 M and 626 kB of on-chip SRAM. All blocks are implemented by a standard cell automatic PnR flow with the exception of the level shifting circuits and the analog datapath of the IIE, which are custom designed. 5 linear arrays of 41 level shifters are implemented as hard macro blocks and manually placed at the power domain boundaries to simplify the power routing,. The analog datapath of the IIE, shown in the lower left
LEE et al.: 345 MW HETEROGENEOUS MANY-CORE PROCESSOR
49
Fig. 12. Chip photograph and summary.
corner of the chip, is spaced away from the digital devices by at least 150 us to mitigate noise effects. V. EVALUATION The proposed chip was evaluated in a test setup using 10 60 second 30 fps 640 480 pixel videos that were recorded in an urban environment. Target objects included outdoor objects such as vehicles, road signs, and buildings as well as indoor objects such as book covers, dolls, and soda cans. The recognition accuracy measured in terms of the true positive rate was approximately 90% with a false positive rate of below 1%. The recognition accuracy of the UVAM algorithm is on par with other implementations of the SIFT algorithm [11]. The attention precision of the UVAM tile selection was measured at an average of 76%, which is higher than the 58% attention precision of saliency-based attention employed in previous works [6], [7]. Without attention, the attention precision is only 35%, which is equal to the average ratio of object area within the entire image area. The power efficiency of the chip is measured to be 324 GOPS/W when the PPL is operating at 200 MHz/1.2 V. As shown in Fig. 13(a), this is just 12% higher than the previous work [7]. However, the recognition effectiveness, which factors in the attention precision, is 246 EGOPS/W, or 46% higher than the previous work [7] as shown in Fig. 13(b). The measured voltage and frequency scaling results of the PPL are shown in Fig. 14. Under the control of the PMC, the voltage and frequency is transitioned between frames of different predicted workloads. On average, only 30% of the image tiles in the test videos were selected for SIFT processing. Thanks to the frame-based voltage and frequency control of the PMC, a low average power consumption of 345 mW was achieved, which is 48% lower than when only saliency attention is used without the PMC. In a second test setup, the chip was integrated with an application processor board, camera, head-mounted display (HMD), and battery to demonstrate a fully integrated augmented reality headset as shown in Fig. 15. In the demonstration system, the object recognition chip recognizes pre-learned objects that are
Fig. 13. (a) Power efficiency and (b) effective power efficiency comparison.
Fig. 14. Measured dynamic voltage and frequency control of the PPL.
saved in the database at a rate of 30 fps from 640 480 video images. Information about the recognized objects are overlaid on the HMD for convenient viewing. Tests carried out in indoor/outdoor environments verify the robustness of the object recognition chip, and the viability of vision-based augmented reality systems in a mobile form factor. The current generation of the proposed chip supports 30 fps processing of 640 480 pixel video which is sufficient for many mobile real-time applications. However, the chip’s architecture is highly scalable and could easily support higher resolutions such as 1280 720 or 1920 1080 by increasing the number of FECs and FMPs. This could even be achieved by board-
50
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011
[7] J.-Y. Kim et al., “A 201.4 GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine,” IEEE J. Solid-State Circuits, vol. 45, no. 1, pp. 32–45, Jan. 2010. [8] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 124–1259, Nov. 1998. [9] S. Lee et al., “The brain mimicking visual attention engine: An 80 60 digital cellular neural network for rapid global feature extraction,” in Proc. 2008 Symp. VLSI Circuits Dig. Tech. Papers, pp. 26–27. [10] M. Kim et al., “A 22.8 GOPS 2.83 mW neuro-fuzzy object detection engine for fast multi-object recognition,” in Proc. 2009 Symp. VLSI Circuits Dig. Tech. Papers, pp. 260–261. [11] S. Lee et al., “Familiarity based unified visual attention model for fast and robust object recognition,” Pattern Recognition, vol. 43, pp. 1116–1128, 2010. [12] S. Lee et al., “A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition,” in Proc. IEEE ISSCC 2010 Dig. Tech. Papers, pp. 332–333. [13] J. Oh et al., “A 1.2 mW on-line learning mixed mode intelligent inference engine for robust object recognition,” in Proc. 2010 Symp. VLSI Circuits, 2010, pp. 17–18. [14] J.-S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 65–685, Mar. 1993. [15] C. Koch and S. Ullman, “Shifts in selective visual attention: Towards the underlying neural circuitry,” Human Neurobiol., vol. 4, pp. 219–297, 916, 1985. [16] J.-Y. Kim et al., “A 118.4 GB/s multi-casting network-on-chip for realtime object recognition processor,” in Proc. IEEE ESSCIRC, 2009, pp. 400–403. [17] M. Jabri, “Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks,” Neural Comput., vol. 3, no. 4, pp. 546–565, 1991. [18] T. Sakurai, “Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas,” IEEE J. Solid-State Circuits, vol. 25, no. 2, pp. 584–594, 1990.
2
Fig. 15. Augmented reality headset demonstration.
level integration of multiple chips thanks to the off-chip gateway interface. VI. CONCLUSION In this paper we propose a heterogeneous many-core processor that employs intelligent computing blocks to achieve high recognition effectiveness. Three key features are proposed. First, the Intelligent Inference Engine (IIE) performs familiarity inference of a potential object match based on its characteristic features to achieve high attention precision. Second, Feature Extraction Clusters (FEC) consisting of SIMD VPEs and MIMD SPEs exploit data parallelism and task parallelism of SIFT-based recognition to achieve high peak performance while maintaining high utilization through a hierarchical task management scheme. Third, a Power Mode Controller (PMC) performs dynamic voltage and frequency scaling of the PPL based on workload prediction by the IIE to achieve low power consumption. The chip is integrated in a live augmented reality application to verify the viability of a mobile vision-based augmented reality system. It is shown that recognition effectiveness can be greatly improved by combining parallel processing elements with an intelligent neuro-fuzzy inference system. REFERENCES [1] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Real-time detection and tracking for augmented reality on mobile phones,” IEEE Trans. Visual Comput. Graph., vol. 16, no. 3, pp. 355–368, 2010. [2] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Intl. J. Comp. Vis., vol. 60, no. 2, pp. 91–110, 2004. [3] B. K. Khailany et al., “A programmable 512 GOPS stream processor for signal, image, and video processing,” IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 202–213, Jan. 2008. [4] S. Arakawa et al., “A 512 GOPS fully-programmable digital image processor with full HD 1080p processing capabilities,” in Proc. IEEE ISSCC 2008 Dig. Tech. Papers, pp. 312–313. [5] D. Kim et al., “An 81.6 GOPS object recognition processor based on NoC and visual image processing memory,” in Proc. of IEEE CICC, 2007, pp. 443–446. [6] K. Kim et al., “A 125 GOPS 583 mW network-on-chip based parallel processor with bio-inspired visual attention engine,” IEEE J. SolidState Circuits, vol. 44, no. 1, pp. 136–147, Jan. 2009.
Seungjin Lee (S’06) received the B.S. and M.S. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2006 and 2008, respectively, where he is currently working toward the Ph.D. degree in electrical engineering and computer science. His previous research interests include low power digital signal processors for digital hearing aids and body area communication. Currently, he is investigating parallel architectures for computer vision processing. Jinwook Oh (S’08) received the B.S. degree in electrical engineering and computer science from Seoul National University, Seoul, Korea in 2008 and the M.S. degree in electrical engineering and computer science from KAIST in 2010, where he is currently working toward the Ph.D. degree in electrical engineering and computer science. His research interests include low power digital signal processors for computer vision. Recently, he is involved with the VLSI implementation of neural networks and fuzzy logics. Junyoung Park (S’09) received the B.S. degree in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2009 where he is currently working toward the M.S. degree in electrical engineering and computer science. Since 2009, he has been involved with the development of the parallel processors for computer vision. Currently, his research interests are many-core architecture and VLSI implementation for bio-inspired vision processor.
LEE et al.: 345 MW HETEROGENEOUS MANY-CORE PROCESSOR
Joonsoo Kwon (S’09) received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2009 where he is currently working toward the M.S. degree in electrical engineering. Since 2009, he has been involved with the development of the parallel processors for computer vision. Currently, his research interests are image enhancement algorithm and VLSI implementation for bio-inspired vision processor.
Minsu Kim (S’07) received the B.S. and M.S. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 2007 and 2009, respectively. His research interests include Network-on-chip based SoC design and bio-inspired VLSI architecture for intelligent vision processing.
51
Hoi-Jun Yoo (M’95–SM’04–F’08) graduated from the Electronic Department of Seoul National University, Seoul, Korea, in 1983 and received the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1985 and 1988, respectively. His Ph.D. work concerned the fabrication process for GaAs vertical optoelectronic integrated circuits. From 1988 to 1990, he was with Bell Communications Research, Red Bank, NJ, where he invented the two-dimensional phase-locked VCSEL array, the front-surface-emitting laser, and the high-speed lateral HBT. In 1991, he became Manager of a DRAM design group at Hyundai Electronics and designed a family of from fast-1M DRAMs and 256M synchronous DRAMs. In 1998 he joined the faculty of the Department of Electrical Engineering at KAIST and now is a full professor. From 2001 to 2005, he was the director of System Integration and IP Authoring Research Center (SIPAC), funded by Korean government to promote worldwide IP authoring and its SOC application. From 2003 to 2005, he was the full time Advisor to Minister of Korea Ministry of Information and Communication and National Project Manager for SoC and Computer. In 2007, he founded SDIA(System Design Innovation & Application Research Center) at KAIST to research and develop SoCs for intelligent robots, wearable computers and bio systems. His current interests are high-speed and low-power Network on Chips, 3D graphics, Body Area Networks, biomedical devices and circuits, and memory circuits and systems. He is the author of the books DRAM Design (Seoul, Korea: Hongleung, 1996; in Korean), High Performance DRAM (Seoul, Korea: Sigma, 1999; in Korean), and chapters of Networks on Chips (New York, Morgan Kaufmann, 2006). Dr. Yoo received the Electronic Industrial Association of Korea Award for his contribution to DRAM technology the 1994, Hynix Development Award in 1995, the Korea Semiconductor Industry Association Award in 2002, Best Research of KAIST Award in 2007, Design Award of 2001 ASP-DAC, and Outstanding Design Awards 2005, 2006, 2007 A-SSCC. He is a member of the executive committee of ISSCC, Symposium on VLSI, and A-SSCC. He is the TPC chair of the A-SSCC 2008.