1.2-mW Online Learning Mixed-Mode Intelligent ... - IEEE Xplore

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

1

1.2-mW Online Learning Mixed-Mode Intelligent Inference Engine for Low-Power Real-Time Object Recognition Processor Jinwook Oh, Student Member, IEEE, Seungjin Lee, Member, IEEE, and Hoi-Jun Yoo, Fellow, IEEE

Abstract— Object recognition is computationally intensive and it is challenging to meet 30-f/s real-time processing demands under sub-watt low-power constraints of mobile platforms even for heterogeneous many-core architecture. In this paper, an intelligent inference engine (IIE) is proposed as a hardware controller for a many-core processor to satisfy the requirements of low-power real-time object recognition. The IIE exploits learning and inference capabilities of the neurofuzzy system by adopting the versatile adaptive neurofuzzy inference system (VANFIS) with the proposed hardware-oriented learning algorithm. Using the programmable VANFIS, the IIE can configure its hardware topology adaptively for different target classifications. Its architecture contains analog/digital mixed-mode neurofuzzy circuits for updating online parameters to increase attention efficiency of object recognition process. It is implemented in 0.13-µm CMOS process and achieves 1.2-mW power consumption with 94% average classification accuracy within 1-µs operation delay. The 0.765-mm2 IIE achieves 76% attention efficiency and reduces power and processing delay of the 50-mm2 image processor by up to 37% and 28%, respectively, when 96% recognition accuracy is achieved. Index Terms— Mixed-mode processor, neurofuzzy, object recognition, VLSI.

I. I NTRODUCTION

R

ECENTLY, several image processors have been developed to realize low-power real-time object recognition for mobile environments. They are used in a wide range of vision applications, such as robot vision system [1]–[2], vehicle navigation [3]–[5], and surveillance cameras [6]. Basically, object recognition processors are composed of a large number of parallel processing units, including single instruction multiple data (SIMD) processors, multiple instruction multiple data (MIMD) processors, and/or application-specific instruction-set processors, to perform complex recognition algorithms in real time with low power consumption [1]–[7].

Manuscript received September 23, 2011; revised March 23, 2012; accepted April 12, 2012. This work was supported in part by the Global Frontier Research and Development Program on Human-Centered Interaction for Coexistence, under the National Research Foundation of Korea Grant funded by the Korean Government (MEST) under Grant NRF-M1AXA003-20110028368. The authors are with the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2012.2198249

Since their operations involve a series of complex image processing tasks spanning from low-level image processing to high-level image processing, for example, kernel-based filtering and object retrieval/classification, respectively, the processors require high internal and external bandwidth for heavy data communication between cores and memories and consume a large amount of power for extensively parallelized processing elements. Thus, to satisfy the mobile environment system requirements, they should be optimized with respect to power, speed, and performance at once. In order to satisfy the requirements, previously, the software-based task management has been deployed in complex processing units such as multicore CPU or GPU, requiring an extra computing process and hardware resource to interact [8]. Such an approach can increase the system utilization and power efficiency, and more importantly, it can effectively cope with different application programs adaptively, but their overheads are not acceptable in the mobile environment. On the other hand, the applicationspecific integrated circuit (ASIC)-based task management has been applied to power control of digital signal processing (DSP) or CPU [9], coming as little as power and time consumption for resource management. This hardware approach can obtain higher efficiency and shorter response time, however, the inherent limitation of controllability and exception coverage incurs performance and efficiency degradation when encountering unexpected situations. A. Related Work To mitigate the overhead and the functionality degradation of conventional approaches for resource management of the processor, having high parallelism, in [10], an intelligent inference engine (IIE) was proposed using the neurofuzzy learning/inference system as an on-chip hardware controller of the heterogeneous many-core processor for object recognition. It was based on a neurofuzzy hardware merging a reduced instruction set computer (RISC)-based controller and ASIC circuits. This approach provides high programmability as well as high energy efficiency. As shown in Fig. 1(a), the processor consists of the cognitive control layer and the parallel processing layer. The cognitive control layer is a cluster of control IPs for the parallel processing layer, which contains a number of SIMD/MIMD

1063–8210/$31.00 © 2012 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Cognitive Control Layer

HOST RISC

DVFS Controller

Intelligent Inference Engine (IIE)

ProtoObject

Object Features

IIE 58%

Fuzzy Rule

Size

Size

Global Task Management Unit Orien. Local Task Management Unit

Object Confidence

Object DB

95% Motion

Orien.

5%

LTMU Motion

MIMD

MIMD

MIMD

MIMD

20way SIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

(b)

20way SIMD MIMD

MIMD

MIMD

MIMD

FEC0

FEC1

LTMU

LTMU

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

20way SIMD

Many-core Features

IIE

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

MIMD

20way SIMD

FEC2 FEC3 Parallel Processing Layer (a) Fig. 1.

Workload Prediction

Task History

Number of ROI Number of Threads

Workload

Fuzzy Rule ROI

Thread

Core Utilization

Utilization

Frame

(c)

(a) IIE implementation for object recognition processor and use cases of IIE. (b) Algorithm acceleration. (c) Hardware workload prediction.

processing elements for object recognition algorithm. Belonging to the control layer as a task prediction and allocation IP, the IIE can perform the neurofuzzy-based object classification to increase the application performance, and dynamic workload prediction to increase the operational energy efficiency of the overall system simultaneously. For instance, as an algorithm accelerator depicted in Fig. 1(b), the IIE generates the input object confidence, as similarity between the features of the proto-object and the fuzzy rule (FR) of target objects in database. On the other hand, as a hardware resource controller depicted in Fig. 1(c), the IIE can predict next workload by comparison with the current status of the processor and the prelearned workload history for more energy-efficient hardware control. It adopts neurofuzzy algorithm for algorithm acceleration and hardware resource management concurrently to contribute system performance and energy efficiency, using the accurate inference capability of the fuzzy system (FS) and the online learning capability of neural network (NN). B. Contribution and Organization In this paper, the IIE will be presented in detail, focusing on the implementation methods of the IIE. To achieve both energy efficiency, based on fast processing speed and low power consumption, and system performance, the following algorithm and hardware solutions can address challenges in the proposed controller, IIE. 1) A hardware-oriented neurofuzzy system is proposed by adopting reconfigurable network in topology configuration and low cost but fast learning algorithm. The proposed algorithm contributes on increasing neurofuzzy system performance with reduced amount of power and time overhead. 2) Analog-digital mixed-mode circuit implementation of the IIE is employed to reduce overall energy consumption while sustaining the system accuracy suitable to various object recognition applications.

3) The online learning IIE adopts the proposed mixedmode techniques to minimize the degraded throughput and classification accuracy, and domain conversion overhead, incurred from the mixed-mode system for robustness and energy efficiency simultaneously. Thanks to the proposed architecture and mixed-mode techniques, the IIE increases application performance, which is classification accuracy for the neurofuzzy system with low power consumption. Integrated into the object recognition processor, the overall system can increase the recognition accuracy with reduced power and delay with the help of the IIE. The rest of this paper is organized as follows. Section II describes the neurofuzzy algorithm proposed for the flexible hardware controller. It also briefly introduces the proposed online learning algorithm suitable for on-chip realization. Section III explains the detailed architecture of the IIE which consists of analog and digital circuits for low energy consumption. Then, Section IV describes the mixed-mode techniques for online learning realization. The chip implementation and evaluation results follow in Section V. Finally, Section VI concludes this paper. II. D ETAILED N EUROFUZZY A LGORITHM OF IIE In this section, the basic concepts of neurofuzzy algorithm are introduced to explore the detailed algorithm proposed for hardware IIE. Unlike the conventional algorithm stand-alone approach, the proposed algorithm is designed for the hardware controller, satisfying performance and energy constraints at the same time. Thus, the major concerns of the algorithm are hardware feasibility and achievable system performance with high energy efficiency. A. Neurofuzzy Algorithm Fig. 2(a) shows the overall block diagram of applied neurofuzzy algorithm in the IIE, consisting of FS-based inference at

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. OH et al.: 1.2-mW ONLINE LEARNING MIXED-MODE INTELLIGENT INFERENCE ENGINE

the feed-forward path and NN-based learning at the feedback path. Since each path is based on different algorithms for different functions, algorithm optimization is applied independently with different constraints. The feed-forward path, or the FS-based inference which classifies input features based on FR comparison, consists of fuzzification, FR generation, and rule selection steps. It generates the classification confidence as the similarity between the characteristics of the query input and those of trained rule of the target. The FS is one of soft computing algorithms proposed to classify the input observations of a query object by dealing with the qualitative aspects of human knowledge and reasoning processes without employing precise quantitative analyses. We utilize the adaptive neurofuzzy inference system (ANFIS) [11] for FS for its transparency and functionality. As shown in Fig. 2(b), it is composed of three steps: fuzzification, FR generation, and rule selection. The first step, fuzzification, is a process to convert crisp input values to continuous fuzzy numbers by the nonlinear function, indicating the degree of qualitative characteristics of the input feature. The function, which is called the membership function (MF) in FS, is defined as 1 f A,i (x) = 2bi , 1 + x − acii for i = 1, 2, 3, . . . , number of MFs for A

(1)

where A is the input feature type, i is the MF number, {ai , bi , ci } is the parameter set that changes the shapes of the MF, and x is the input of A. The second step, FR generation, calculates the combination of fuzzy numbers which compose feature rules of the target objects. Every rule node multiplies the incoming signals from MFs and represents the firing strength as an FR w computed as Ri (x, y, z) = wi = f A, j (x) f B,k (y) f C,l (z), for i = 1, 2, . . . , number of FRs

(2)

where wi is the i th FR composed of combinations of MFs and f A, j (x), f B,k (y), and f C,l (z) are the j th, kth, and lth MFs of the input feature A, B, and C, respectively. The last step, rule selection, determines the contribution of FRs to the decision of the target object so that the result of the similarity between the query object and the target object is generated as confidence of the decision with (3) and (4). The i th node of the rule selection layer computes the normalized firing strengths and linear combinations of inputs Si (x, y, z) = w¯ i ( pi x + q i y + r i z + si ), for all i = 1, 2, . . . , number of FRs

(3)

where w¯ i is the normalized rule from FR generation and { pi , qi , ri , si } is the linear parameter set of linear combination of input features. Then each Si (x, y, z) is summed together and generates the confidence of the input features (x, y, z) C(x, y, z) = Si (x, y, z) i

3

Fig. 2. (a) Block diagram of the applied neurofuzzy system. (b) Functional steps of ANFIS (three inputs–eight rules).

=

i

wi ( pi x + q i y + r i z + si ) . wi

(4)

i

The feedback path or NN-based learning updates the parameter {ai , bi , ci } of MFs in fuzzification and {pi , qi , ri , si} of rule selection to train target object characteristics in the FR of inference by using learning algorithm. After measuring the inference result, learning algorithm is performed using the given parameter sets of each objects. In this paper, we apply the conventional back propagation-based NN learning. It updates parameters in NN for reducing the accumulated errors of measured outputs of updated parameters, thereby approaching the most minimized error point of the derived error function in consequences. The computational complexity and stability of learning algorithm are critical to overall performance since it mostly consists of iterative numerical operations and varying divergence characteristics. To realize this system with consideration of system performance and energy efficiency, we propose the hardwarefriendly learning and inference. Since an excessive number of parameters in NN will cause overfitting [12], which means that the user over estimates the complexity of the target problem, it is important to sustain proper number of parameters in FS for the target application by increasing the programmability of hardwired system. In addition, the complex learning algorithm is not affordable by the online learning hardware, so a fast and accurate learning algorithm should be chosen to increase the speed and energy efficiency of the system. Therefore, we propose the versatile adaptive neurofuzzy inference system (VANFIS) to obtain programmability for different types of tasks and target query inputs for inference, and the online

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Granular Network

27 FRs (a)

9 MFs

Classification Accuracy

Configuration [3,3] Configuration [3,2]

[2,3]

NORM

NORM

1.0 In

Out

NORM

In

In

9 MFs 27 FRs

multipliers is integrated at the last node of VANFIS, thereby calculating inference result, as a confidence of input query features compared to the target features. It modifies inference capability of the system by changing the total number of parameters in VANFIS for different classification. We deploy four configurations for VANFIS programmability, as shown in the Fig. 3(b). In the [3, 3] configuration of the diagram, VANFIS has three different kinds of inputs and each input has three associative MFs, thereby generating 27 compositions of input fuzzy characteristics. Then the system can have 135 parameters of adjusting MFs and weight strengths that determine inference complexity and learning information capacity based on the next relationship n(parameter) = n(input) × n(membership) + {n(input) + 1} (6) ×n(input)n(membership) .

0.9 0.8

6 MFs 8 FRs

[3,3]

[3,2]

learning algorithm, the adaptive parameter perturbation (APP), to reduce the processing delay and extra hardware costs for learning.

The number of parameters, n(parameter), is determined with the number of MFs, n(membership), and the number of inputs, n(input). To decide the optimized number of connections for hardware implementation without any performance degradation, the sustainable classification accuracy by the configuration is examined for different number of active rules. Considering the trade-offs between the accuracy and the network overheard, the four configurations are enough to maintain more than 90% accuracy for active rule ranges and reduce the extra cost of network, which is only 6% of total hardware. As a result, the four configurations of Fig. 3(c) that can classify query objects accurately for the target database with minimized energy consumption are chosen for the target system.

B. VANFIS

C. APP

Fig. 3(a) shows a concept diagram of the proposed VANFIS for inference operation of the IIE. The ANFIS has the fixed network for connections between MFs and FR generation stages. On the contrary, with the help of the proposed granular network for the reconfigurability between the fuzzification and FR generation stages, the VANFIS can change the connection corresponding to the complexity of target object feature characteristics. The FR w of (2) can be configured by changing the number of MFs and FRs, such as j , k, and l and i, respectively. Thus, the inference capability of the system, maximum composition of FRs of the configured network, is changed by the network configuration, as shown in

Fig. 4(a) shows the concept of the applied learning algorithm of the proposed IIE hardware. In order to achieve higher learning accuracy with small extra costs, we propose the APP algorithm for VANFIS as hardware-friendly learning operation. It is fundamentally based on the conventional perturbation learning method [13] which has fixed perturbation methods, the sequential perturbation and the simultaneous. At the contour map of an error function, defined by J (w) = (y − yd )2/2 = error2/2, where y is an output and yd is the desired output of the target object, each conventional algorithm searches the global minimum points of the error function by the fixed perturbation, while having a long process delay or inaccurate learning results, respectively. However, the APP learning, which modifies perturbation steps from fast simultaneous perturbation to accurate sequential perturbation, can increase the learning accuracy as well as convergence speed [14]. It reduces the number of perturbed parameters when the learning accuracy is not improved by the given step. Fig. 4 shows the flow diagram of APP with the proposed perturbation step modification. After the measured error of the perturbed input vector is generated, the step change occurs depending on the loss index, which measures the nth parameter set pn of the accumulated accuracy increments for a series of last m epochs

In

In

NORM

NORM

Configuration [2,4] Configuration [2,3]

0.7

[2,2]

0.6 0

8 MFs 16 FRs

6 MFs 9 FRs

9

18 Active Rules

27

(c)

(b)

Fig. 3. (a) VANFIS architecture with the proposed granular network. (b) Four configurations of the granular network. (c) Classification accuracy based on the granular network.

Ri (x, y, z) = wi = f A,∀ j (x) f B,

∀k (y) f C,∀l (z),

for ∀j ∈ A ∀j ∈ B ∀j ∈ C.

(5)

As shown in Fig. 3(a), the VANFIS can support at most three inputs that each have three MFs for physical implementation, thereby having 27 FRs at most. The granular network can be implemented with a crossbar switch which modifies the data signal paths of the fuzzification between the FR generation. The FR generation is composed of multiple minimum operators that realize min-max operation of fuzzy inference process. And the normalization operation is performed and followed by the weight multiplier arrays. Each output result from

Trace( pn ) = J ( pn ) − J ( pn−1 )

(7)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. OH et al.: 1.2-mW ONLINE LEARNING MIXED-MODE INTELLIGENT INFERENCE ENGINE

5

Input Vector

MIN 0

MF1

MIN 1

Measure Loss Index(N) Error Saturated

Still Converging

L.I. < th1

th1