Computer Aided Design of Fuzzy Systems based on generic VHDL Speci cations Thomas Hollstein, Saman K. Halgamuge and Manfred Glesner
Abstract | Fuzzy systems implemented in hardware can operate with much higher performance than software implementations on standard microcontrollers. In this contribution three types of fuzzy systems and related hardware architectures are discussed: standard fuzzy controllers, FuNe I fuzzy systems, and fuzzy classi ers based on a neural network structure. Two Computer Aided Design (CAD) packages for automatic hardware synthesis of standard fuzzy controllers are presented: a hard-wired implementation of a complete fuzzy system on a single or multiple Field Programmable Gate Arrays (FPGA) and a modular toolbox, called fuzzyCAD, for synthesis of reprogrammable fuzzy controllers with architectures due to speci ed designer constraints. In the fuzzyCAD system, an ecient design methodology has been implemented, which covers a large design space in terms of signal representations and component architectures as well as system architectures. VHDL descriptions and usage of powerful synthesis tools allow dierent technologies easily and eciently to be targeted. In the last part of this contribution, properties and hardware realizations of fuzzy classi ers based on a neural network are introduced. Finally future perspectives and possible enhancements of the existing toolkits are outlined. Keywords | Fuzzy systems, fuzzy hardware, fuzzy controller, fuzzy classi er, generic VHDL fuzzy modules, fuzzyCAD, neuro-fuzzy systems, FPGA designs.
I. Introduction
The functionality of a fuzzy system can be acquired from either expert knowledge or from training data. This source of knowledge has an basic impact on the fuzzy system structure to be applied. Generally a multiple input multiple output (MIMO) fuzzy system can be abstracted as a system with n inputs fX ; : : : ; Xn g and m outputs fZ ; : : : ; Zm g (Fig. 1). 1
1
Fig. 2 shows an example for the operation of an standard fuzzy controller. Low
High
Medium Low
OR
Slow
Standard Fuzzy Controller (Mamdani)
Fast
AND
Fig. 2. Standard Fuzzy Controller (Mamdani)
Without a-priori knowledge of F , the number and form of the rules, membership functions and the defuzzi cation parameters have to be generated by neuro-fuzzy software based on training data fX; Z g. Examples of two such models known in the neuro-fuzzy eld are shown in the Figures 3 and 4. W1 Low
High
FuNe I Fuzzy System
OR K1
Medium
Low
W2
Sig(Σ K iWi )
AND K2
X1
Z1
Fuzzy System
Xm
Zm
Fig. 1. General MIMO Fuzzy System
Assume, that the functionality of this system is described by the functional relation F :X !Z (1) with X = fX ; : : : ; Xn g and Z = fZ ; : : : ; Zm g. If F is acquired from expert knowledge, the fuzzy system can be realized by a classical standard fuzzy controller implementation, which has been introduced by Mamdani [MA75]. 1
1
The authors are with the Institute of Microelectronic Systems, Darmstadt University of Technology, Darmstadt, Germany, E-Mail:
[email protected]
Fig. 3. FuNe I Fuzzy system
Comparing the standard fuzzy controller operation with these alternative fuzzy systems, it is obvious that the major dierence is in the conclusion/defuzzi cation parts. The FuNe I fuzzy system performs the defuzzi cation by weighted addition of singletons, passing the result through a sigmoid function. In the classi er fuzzy system, which is also a neural network that can be interpreted as a fuzzy system [HPG95], [HG95], the defuzzi cation is not required, since a decision about the membership to an output class [Bez93] is sucient. The three introduced fuzzy models are considered by the authors for designing fuzzy hardware. Every type of fuzzy system can be implemented on standard microcontrollers.
Low
Classifier Fuzzy System
OR
High
K1
MAX( K i)
Low
Medium
AND K2
Fig. 4. Classi er Fuzzy System
As soon as dedicated fuzzy hardware is taken into consideration, restrictions can apply due to specialized hardware modules. Assume that a software programmable fuzzy hardware covers a functionality space S = fS ; : : : ; Sk g, where all Si (i 2 f1::kg) are possible fuzzy systems, which can be realized. Any projected fuzzy system functionality F can be mapped on this hardware, if 9Si 2 S with Si = F; i 2 f1; ::; kg (2) If jS j = 1, the hardware is a fully hard-wired implementation with a xed rule base. This special case is especially interesting for rapid prototyping on RAM-based FPGAs, where a software-programmable rule-base is not required, since the whole system can be re-synthesized, if rule modi cations are required. In the following sections, hardware implementations and synthesis toolkits for the previously introduced fuzzy system types will be described. The general structure for standard fuzzy controllers and a similar hardware architecture, which can be con gured by the software neuro-fuzzy system FuNe I [HG94] are shown in Fig. 5. 1
Designer: Hardware Architecture and Constraints
Rules and Membership functions
typing purposes can be generated for an FPGA technology (Xilinx). By use of the advanced system fuzzyCAD, based on generic VHDL descriptions, reprogrammable fuzzy controllers can be synthesized for dierent ASIC target technologies. The system architecture and module selection is in uenced by interaction of the designer in order to meet the required timing and area constraints. By selecting a dedicated defuzzi cation module, a special architecture can be synthesized, which can be con gured by FuNe I (o-line software training). The hardware requirements for the third system (neurofuzzy classi er) are totally dierent, since the structure of this system is based on a three-layer neural network. This fuzzy-interpretable neural network can be trained on the chip. The amount of neurons in the hidden layer is not xed and can be varied during the training process. Therefore a special generic systolic array architecture is required for the hardware implementation. An initial structure (number of hidden neurons, initial edge weights) can be programmed based on optionally available expert knowledge. The structure of this system is shown in Fig. 6. Expert Knowledge for Neural Network Structure Initialisation
Training Database
Designer: Systolic Array Structure and Size
Target Lib.
Hardware Synthesis (Synopsys)
Generic VHDL Component Library
X
Neuro-Fuzzy Classifier Hardware
Z
Fuzzy System Operation Pre-Configuration by Software
Training Database
Technology Target Library
Hardware Synthesis On-Chip Training
fuzzyCAD FuNe I Software
X
Design Toolkit & Synopsys Synthesis
FuNe I config. Fuzzy System
Z
X
Programming Software Generic VHDL Module Library
Standard Fuzzy Controller
Fig. 6. Design and Con guration Flow: Neuro-Fuzzy Classi er
This special fuzzy system structure and related hardware is described in the last part of this contribution. Z
II. Rapid Prototyping of Fuzzy Systems on FPGA Target Architectures
A toolkit FUZ2LCA for automatic generation of application speci c fuzzy controllers, using a high-level fuzzy language input, has been successfully implemented and tested [HHKG94]. A similar approach is described recently in [Hun95]. Advantages of direct hard-wired implementation Fig. 5. Design and Con guration Flow: Standard Fuzzy Controllers is the minimum hardware overhead in both the data-path (Mamdani) and FuNe I con gurable Systems and the controller of the design, which leads to a miniBased on a generic netlist module library, hard-wired im- mum number of logic cells required on the target devices. plementations of standard fuzzy controllers for rapid proto- SRAM-based Field Programmable Gate Arrays (FPGA) Fuzzy System Operation
Configuration/Programming by Software Hardware Synthesis
Fig. 8. Fuzzi cation block
Fig. 7. Generated fuzzy hardware
are well-suited for prototyping purposes due to their reprogrammability. An additional advantage is, that the con gurable logic cells may also be used eciently for an on-chip realization of small SRAM memory blocks. The compiler FUZ2LCA automatically creates a design of a complete standard fuzzy controller, based on netlist module library with generic parameters. A large design space in terms of timing and area can be covered, since the designer can the number of computation units to work in parallel. Fuzzy systems, written either in the C programming language or a type of fuzzy programming language (e.g. Togai's FPL language) can be synthesized and converted to Xilinx Netlist Format (XNF). This enables the user to de ne the fuzzy system in a problem speci c manner. Problems arising in mapping speci cations of large fuzzy systems can be solved by eectively partitioning the design into several FPGAs. Each fuzzy system design consists of three modules or functional units: fuzzi cation, rule inference, and composition/defuzzi cation. All modules have their own local controllers allowing them to operate independently. The user can set parameters depending on the availability of hardware resources and the required speed, so that a highly parallel design, a completely sequential design or a compromise can be is created. Due to the high time consumption of many commonly used methods, the defuzzi cation unit should normally operate in parallel to the fuzzi cation and inference units. The system controller supports, depending on user selectable parameters, both sequential and pipe-line modes. Due to the modularity FUZ2LCA can be easily extended by adding alternative modules. In addition to the FPGAs external memories are needed for storing antecedent and consequent membership functions (MSF in Fig. 7).
A. Fuzzi cation Unit Membership functions Xk ;i can be easily stored as lookup tables, using two dierent external RAM blocks [SU95]. All odd numbered membership functions
(i 2 f1; 3; 5; : : :g) are stored in an 'odd'-RAM block, while the even numbered membership functions (i 2 f0; 2; 4; : : :g) are stored in an 'even'-RAM block (Fig. 8). The restriction in this method is, that at maximum only two membership functions can overlap, but the RAM blocks can be accessed eciently in parallel. B. Rule Inference Inference is the process where the evaluation of the premise and the consequent membership function of a single rule is performed (Fig. 9 and leftmost part of Fig. 10), where as in the composition the inference results of many rules are combined (center part of Fig. 10, here: 'max' operation). Three dierent types of rule evaluators can be generated: simple evaluators that can either read or negate the membership value of an input Min/Max rule evaluators for rules with less complexity complex rule evaluators with maximum of 16 Min/Max operations and parenthesis hierarchies The premises are evaluated by using a single rule evaluator or several in parallel depending on the rule base complexity and timing constraints. The initially implemented, but easily extendable inference/composition method is Min/Max. The outcome of the composition is directly piped into the parallely running defuzzi cation (no additional intermediate memory required). C. Defuzzi cation The defuzzi cation unit normally is the most time consuming module, especially if a very resource consuming method, such as COG, is implemented. Two steps are taken in order to overcome this problem:
ω
A1
A2
0
l
u
n
z
Fig. 11. Ecient implementation of MOA defuzzi cation
Fig. 9. Rule inference
Mean Of Maxima (MOM), the center of gravity of the
area under the maxima of fuzzy output. MOM = zout
i2M zi
P
(5)
jM j
M = f i j !i = max(! ; : : : ; !n) g 1
! ; : : : !n are the curve segments which originate from 1
the corresponding consequent membership function segments Z; ; : : : ; Z;n (after composition). Center of Mean (COM), the middle of the area under the maxima of fuzzy output, introduced in [HHKG94]. 1
h
X
Fig. 10. Composition and MOA defuzzi cation
i
si =
=1
the defuzzi cation module is always generated as a
parallely running module less time consuming methods are generated with ecient hardware structures Midpoint of Area (MOA), also known as Center of Area (COA), Mean of Maxima (MOM), are the standard defuzzi cation methods ([DHR93]), which can be implemented ef ciently in hardware. Considering the composition output curve as ! (normalized to the maximum value unity: 0 ! 1), and denoting Zval as the nite set of possible normalized output values of a fuzzy controller with Zval = fz ; :::; zi ; :::; zn g, the dierent defuzzi cation methods can be formalized as follows: Center Of Gravity (COG), the center of gravity of the area n 0
P
COG = i zout n P =0
!i zi
i
=0
!i
Midpoint Of Area (MOA), the middle of the area h
X
i MOA . where zh = zout
=0
!i =
n
X
i h =
!i
where
n
X
i h
si
(6)
=
8k 2 f1; : : :; ng
!i if !i = max(!k ) 0 if !i < max(!k ) COM The defuzzi ed crisp output is zh = zout si =
The standard MOA method and two variations of it can be generated as defuzzi cation units [HHKG94]. The hardware implementation of the MOA method is depicted in Fig. 10. The pointers zu upper and zl lower cover the output range starting from the lower and the upper limit respectively and moving stepwise towards their meeting point. The area underneath the fuzzy output shape is added to a register as the pointer on the left moves and subtracted from this register as the pointer on the right moves. The meeting point zm equals the crisp output, since (3) the integration is performed such, that the condition X
l
i
=0
!i ,
n
X
i u =
!i = min
(7)
!
(4) is optimized in every step (see Fig. 11: A = A ). Since complex operations such as multiplications or divisions are not involved in the MOA strategies, these methods are much faster than COG. 1
2
Designer Interaction
fuzzyCAD Hypertext-based Design Manager Instantiation of selected structures
Fig. 12. Fuzzy truck control
D. Application Example After several tests, the compiler has been successfully applied for generating a fuzzy controller for the fuzzy truck with trailer, described in [HRG94]. This fuzzy controller consists of 11 fuzzy rules, 2 inputs and 1 output (each with 5 membership functions), employs Max-Min inference/composition and MOA defuzzi cation (Fig. 12). A 4-bit version of the generated fuzzy controller (the accuracy is sucient for this application), could be implemented in a XC4006-FPGA and only 42s are needed for calculating a new output. This result can be compared with standard solutions such as DSP-TMS320 (150 S), and special fuzzy solutions of Togai ASIC FC110 (32 S). III. fuzzyCAD: New Module oriented VHDL-based Design Approach
Based on the experiences with the previously described automated fuzzy controller implementations on FPGAs, a completely modular fuzzy controller design toolkit is developed. Compared to the FPGA solution, which was a pure rapid prototyping approach, the VHDL-based approach provides more exibility and is intended to become a CAD system for exible customer speci c solutions. The decision for VHDL as description language has been made, since a lot of design experience with VHDL speci cations and the SYNOPSYS simulation and synthesis toolkits were already available. The new system is not restricted to one target technology and the realized controller is fully reprogrammable by software. The user is able to design a MIMO standard fuzzy controller according to requirements of one or more application domains. Advantage of the library oriented VHDL concept is the exibility concerning integration of new modules and the possibility of making a rough estimate of resulting timing and area costs. Another bene t of this concept is the reduced simulation eort, since the modules have already been tested many times (reuse of design components). So the main simulation eort is given by the validation, whether the selected bit widths are sucient for the required computation accuracy. The fuzzy controller parameters, which are determined and xed by the design process are: number ninp of input and noutp of output signals number nMFin of membership functions (MF) for input signals
VHDL Module Library
Fuzzy Controller Frame Fuzzification Unit
Rule Evaluation Unit
Inference/ Defuzz. Unit
Fig. 13. Structure: Processor Instantiation Rule Coding Bitstream Generation Software Fuzzy System Knowledge Database
Fuzzy Controller
Printed Circuit Board Rule Memory (EE)PROM
Fig. 14. External Rule Con guration Memory
number nMFout of MFs for output signals the maximum overlap ovmax of MFs (the maximum
number of MFs which can produce a non-zero value for one crisp input value) MF storage technique external and internal bit widths number and capabilities of parallel running rule evaluation modules defuzzi cation method In combination with previously mentioned parameters, the user has an in uence on timing and area of the resulting implementation. A. Overview: General Structure The toolkit consists of a modular generic VHDL description library and a con guration software tool. Selection of VHDL modules and setting of generic instantiation parameters of VHDL modules will be automatically/interactively performed by this CAD program due to speci c user requirements. The structure for instantiation can be seen from Fig. 13. For application support of the controller chip, a bit stream generator program for binary rule coding will be provided. The bit stream can be stored in a (EE)PROM which is located on the board adjacent to the fuzzy controller and is read once after power-up (Fig. 14). B. The Design Flow The complete design ow can be seen from Fig. 15:
fuzzyCAD Design Manager Library-based Instatiation and Composition of VHDL Description
degree of membership
VHDL Module Library
RAM 0
MF 0
MF 3
crisp input degree of membership
degree of membership
VHDL Description of Fuzzy Controller
MF 0
MF 1
MF 2
MF 3
RAM 1 MF 1
MF 4
MF 4
crisp input
crisp input
High-Level and Logic Synthesis
degree of membership
RAM 2 MF 2
crisp input
Target Technology Netlist
Fig. 16. Overlap-free Membership Function Storage degree of membership
Standard Cell / FPGA Design Software
y4=255 m3
y_s y3
m4 m2
Physical Layout Description File (ASIC) or Device Configuration Bitstream (FPGA)
y5 y2 m5 y6 m1 y0,y1,y7 0=x0
m6
m0
m7 x1
x2
x3
x_s
x4
x5
x6
x7
crisp input
Fig. 15. Design Flow: Generic Fuzzy Processor
Fig. 17. Membership function approximation
Using the fuzzy controller design software fuzzyCAD, a VHDL description of the complete controller is generated. This VHDL source code can be mapped on a standard cell or FPGA target library using a high-level design tool (SYNOPSYS in our case). With vendor speci c design software, this net list can be compiled to a physical implementation (layout). Simulation can be performed on every level of abstraction in order to validate the processor functionality.
C.2 MF Shape: Piece-wise linear Representation In many neuro-fuzzy approaches, where fuzzy systems are automatically generated ([HG94], [HPG95], [HG95]), the resulting membership functions can be of bell shape. The typical approach utilizing look-up tables for fuzzi cation is inecient in many cases, because of the requirement of huge fuzzi cation memory. One solution to this problem is the approximation of membership functions using straight lines as shown in Fig. 17. In this example each membership function form is represented with maximum of 8 straight lines reducing the memory capacity. In case of implementing a sigmoidal membership function 256 bytes are needed for a simple lookup-table compared to 24 bytes which are needed for the approach with membership function approximation. Each _ , x )+y is characterized by the three paramline Y = a (X eters: the tangential coecient mi , and the coordinates of the leftmost position of the line x and y . Three memory words distributed in three short memories or concatenated to one bit string may contain those parameters. C.3 MF Shape: Look-Up Table Representation For high-speed MF access, the look-up table representation is well-suited: For every crisp input value, the MF-
C. Internal Membership Function Representation C.1 Overlap-free MF Storage For ecient defuzzi cation an overlap-free membership storage is very useful. The maximum overlap ovmax determines the number of RAM blocks, required for storing the MFs: R ; : : : ; Rovmax , . Generally a membership function MFi , i 2 f0; : : :; nMF , 1g will be stored in the RAM module Rx, where x = i mod ovmax . Fig. 16 shows an illustration for ovmax = 3. Overlap free MF storage is very eective for defuzzi cation, since the ovmax RAM benches, storing the MF functions of an output variable, can be processed in parallel for inference computation. This is important, because the defuzzi cation operation is the critical bottleneck concerning performance. 0
1
i
i
i
i
i
Crisp Input 1
from MF Memory Module
Membership Function Memory
Crisp Input 2 MUX
en clk reset
REGISTER
and Fuzzification
FuNr.
Y2
FuNr.
Y1
FuNr.
Y0
Crisp Input n
Input Registers Output RAM
Output to Rule Evaluation Unit ram_adr_sel adr
IN
IN
ram_adr_sel
r/w en
Fig. 18. Structure of Fuzzi cation Unit
r/w en
adr
RAM 1
OUT
IN
ram_adr_sel
r/w en
adr
RAM 2
RAM 0
OUT
OUT
from Rule Eval. Module MINIMUM
for writing of rule weight
MINIMUM
MINIMUM
RAMs
related fuzzi ed values can directly be read out of the memory without additional computation eort. This method is sometimes also more ecient for MFs with bent shapes, which would require a lot of base points for piece-wise linear representation. D. Fuzzi cation Unit Since the defuzzi cation is the most time-consuming unit, fuzzi cation can be performed serially without in uence on the global timing behavior of the circuit. Fig. 18 shows the data-path structure of this unit: Crisp inputs are captured in input registers and applied to the fuzzi er serially. The fuzzi er can be realized as look-up table. Since fuzzi cation is not too timecritical, in the presented approach the membership functions are stored piecewise linear and the fuzzi ed values are computed by interpolation. This also implies one search through the MF memory per input variable. Generic VHDL fuzzi cation unit entity: entity fuzzification is generic (
port ( : in std_logic ; : in std_logic ; : in std_logic ;
en_write_ram1 : in std_logic ; -- enable write to ram1 write_ram1 : in std_logic ; -- write to ram1 (from main controller) read_ram1 : in std_logic ; ram_copy : in std_logic ; -- read ram2 (from main controller) var : in num : in x : in
en
COMPARE
MAX MAX
en 1 Address applied to MF memory module
COUNTER (lower)
clk reset
COUNTER (upper)
en clk reset
0
Reg 0
0
1
hoechtbit
add_mux_sel
en clk reset
REGISTER
ALU
carry
Fig. 19. MOA Defuzzi cation: Inference and Integration Unit -- MF point: x coordinate y : in std_logic_vector( y_width -1 down-to 0 ); m : in std_logic_vector( m_width - 1 down-to 0); ein : in std_logic_vector(2**var_width * x_width-1 down-to 0); end_read_ram1 : out std_logic ; end_read_ram2 : out std_logic ; ram1_full : out std_logic ; out_ram2 : out std_logic_vector( var_width+num_width+y_width -1 down-to 0)); end fuzzification ;
var_width : integer; -- bit width for representation of # input variables num_width : integer; -- bit width for representation of # MF per variable x_width : integer; -- bit width x-input y_width : integer; -- bit width y-input m_width : integer; -- bit width m-input adr_width_ram2: integer; -- depth of output-memory adr_width_ram1: integer; -- depth of first memory ram1_width : integer ); -- width of first memory
start reset clk
calcul_end
std_logic_vector(var_width - 1 down-to 0); std_logic_vector(num_width - 1 down-to 0); std_logic_vector( x_width - 1 down-to 0);
E. Inference and Defuzzi cation Modules
The composition and defuzzi cation unit works similar to the defuzzi cation unit in the previously described FPGA prototyping system, using the midpoint-of-area method (MOA). Compared to the FUZ2LCA solution, the MF overlap can be any value ovmax now. The operational
ow is as follows: the integration begins at the zero-point of the output variable's value range. This value is applied to the MF memory and a certain number (< ovmax ) of non-zero MF values and the corresponding MF identi er are read out. This can be done fully parallel, since the MFs are stored overlap-free. Then a minimum operation is performed on these MF values and the corresponding rule weights. The results are fed into a max-tree and the nal value is used for the MOA defuzzi cation (integration), e.g. it is added to the accumulator or subtracted (depending on the actual integration direction). Depending on the sign of the integration result (stored in the accumulator register) the integration direction for the next step is determined. Fig. 19 shows the operational unit for defuzzi cation.
F. Programmable Rule Evaluation Kernel The rule evaluation kernel is programmable by software. During the design phase, the number of parallel running rule evaluators, their type and rule memory size is xed. Three classes of rules can be processed: 1. Trivial Rules: if a is high then x is medium 2. Normal Rules: conditions chained in a sequence with AND, OR and NOT operators 3. Hierarchical Rules: multilevel-nesting with parenthesis; operators: AND, OR, NOT. For every class of rules, instances of rule evaluators can be created. For normal applications the classes 1 and 2 are of primary interest. Rule evaluators of a class n may also evaluate rules of class m with m < n. The user has direct in uence on the created number of rule evaluators of each class. For low performance applications it may be sucient to create only one rule evaluator of the most complex class to be processed later on. All rules will be processed sequentially and the chip area (costs) is minimal. For high-performance applications multiple rule evaluators of one or dierent classes may operate in parallel. Each evaluator can process multiple rules sequentially. IV. Implementation of FuNe I Fuzzy Systems
Two possibilities can be considered for the real-time implementation of a fuzzy systems automatically generated by neuro-fuzzy systems. The rst and simple method is the direct implementation of all neurons and interconnections. The second method is to implement it as a fuzzy system. In case of a FuNe I multilayer perceptron based system, the computation time and the complexity of hardware needed for the rst method is much higher than that for the second method. But for fuzzy interpretable neural networks based on nearest prototype classi cation, the rst method is more appropriate as described in Section V. Although the FuNe I type fuzzy systems designed with o-line software training can be implemented in commercially available fuzzy processors, an application speci c design would increase the speed. The design must be easily con gurable for dierent generated fuzzy systems. The rst hardware implementation has been a simple FPGA design. The FuNe I fuzzy system with 4 inputs and 3 outputs, extracted from the popular Iris data set [And35] is implemented in a single Xilinx FPGA 4005 chip. This design is used in a prototype board that can be connected to a personal computer via ISA bus for the visualization of classi cation results. The typical approach utilizing look-up tables for fuzzi cation can be inecient in cases with high fan-in, because of the requirement of huge fuzzi cation memory. One solution to this problem is the approximation of membership functions using straight lines as indicated in a previous section (see also Fig. 17). A comparison of performance of the two eorts described is summarized in Table I
V. Real-time Fuzzy Interpretable Classifiers
Although FuNe I fuzzy systems can be eciently con gured in FPGA based prototype boards as discussed above, the o-line neural network training for designing is hardly implementable in FPGAs due to area limitations. But several new methods for the generation of fuzzy classi cation systems were presented, that can be implemented in FPGAs: Dynamic Vector Quantisation (DVQ) Variations (DVQ2 and DVQ3)as improved versions of Learning Vector Quantisation networks [HG95] Cubic Basis Function Networks (CBFN) (deduced from famous Radial Basis Function Networks) with modi ed Restricted Coulomb Energy (MRCE) learning presented in [HPG95]. Since those methods can be considered as nearest prototype neural networks, the distance between an input vector and all the reference vectors are calculated to decide upon the class membership of an input vector. The prerequisite is the selection of a distance measure with less computational intensity. The distance measure used in competitive learning can be more generally de ned by the Minkowski metric [Koh89]: n
~ = (X jI , W j ) = ; (~I; W) d
d
d
(8)
1
=1
The most commonly used measure the Euclidean distance ( = 2), the City block distance ( = 1) and the Maximum ( ! 1) can be derived from this general form. A. Computationally Feasible Distance Measures The Euclidean distance, though reported good simulation results, contains a multiplication operation per dimension, which is an disadvantage in hardware implementation. Therefore, the city block distance and the Maximum measure are compared. X
X
2
Maximum distance in all dimensions
X
2
City block distance
1
X
1
Fig. 20. Points of equal \distances" for dierent distance measures
n
~ = X jI , W j city block dist (~I; W) d
d
=1
d
~ = MaxjI , W j : 1 d n max dist (~I; W) d
d
(9) (10)
Features Design 1 Design 2 Inputs 4 128 Output 3 4 Mem. func. per input 4 128 No. of rules 16 256 Type of FPGA XC4005 XC4006 No. of sigmoids (EPROMs) 3 (256 bytes) 4 (256 bytes) No. of memory units 1 (256 bytes 3 (8 bytes for storing mem. func. per mem. func.) per mem. func.) Speed in million rules per second 1:25 1:25 TABLE I
Comparison of two FuNe I fuzzy system implementations
Number of false Number of false Data set Classi cations Classi cations city block dist 6 3 0
Iris Solder Digit
TABLE II
max dist 6 5 26
Comparison of distance measures
Both measures describe the points with equal \distances" as squares for a two dimensional input space (inputs ~I = fX ; X g) as shown in Fig. 20. The iris data set Iris [And35] is a real world data set, that can also be considered as a benchmark, having 4 inputs and 3 classes. The Solder data set is from a real world application [HPmG93], consisting of 23 inputs, classifying solder joints into 2 classes, \good" or \bad". In this paper authors use another real world data set Digit from the optical digit recognition containing 36 preprocessed inputs, and 10 outputs [HG94]. The complexity of the data sets Iris, Solder and Digit increases in terms of the number of inputs . Comparison in Table II indicates that the City block distance is better, specially for complicated large data sets such as Solder, Digit. 1
2
B. Fixed Point Calculation A reduced xed point format should be found without a signi cant degrading in performance. The error versus the number of bits are analyzed for dierent data sets limiting the number of reference vectors generated per each class to one. The simulation results clearly indicate that the number of bits needed is less than 6 even for complex classi cation examples (Fig. 21). The number of neurons per class remain constant throughout the simulation since the dynamically adding neurons also compensate the computational accuracy.
C. FPGA Architecture for Parallel Processing One can consider most of the tasks solved by neural networks as operations with arrays (e. g. input vectors and weights). These tasks are often solved with highly parallel architectures. Very often systolic arrays, as best suited structures to this class of problems are used. For performing nearest neighbor decisions with m n-dimensional reference vectors, a 2D systolic array with m rows and n + 1 columns can be used (Fig. 22). Reference vectors are stored in processing elements (PEs), one in each row of the systolic array. The rightmost additional column is used to nd the smallest distance. In case of CBFN, additional elements ~rj = fr ; r ; g, which determine the extensions of the hyper box from ~ j = fW ; W g to the dimensions d = the center W f1; 2; g, when the input vector ~I = fX ; X ; g is presented, are stored into each PE. Whatever the signal a PE receives from the left neighbour in the row, it has to pass the original input element X to the next PE in the column. If a PE receives `1' from its neighbour: if jX , W j r , it passes the `1' to the neighbour in the row; otherwise it passes `0' to the neighbour in the row. The output of the PEs in the row will be thereafter `0', since the input element is not in the attraction region of this hyper box. One-dimensional systolic arrays can also be considered for implementations (Fig. 23). Then no parameter data is stored in PEs and the parameters of all hyper boxes are applied to the array one after another. If the number of reference vectors (hyper boxes) is m and number of dimension is n, m(n + 2) + n cycles are needed to make a nearest neighbor decision. The last PE in the row stores the actual smallest distance and compares this with its input. In this way the smallest stored distance is updated. Considering hardware restrictions of FPGAs, systolic arrays with many elements could be hardly implemented. It is also possible to make an array with a feedback loop. If Single Instruction Multiple Data (SIMD) arrays are used for nearest neighbor classi cation, reference vectors j1
j1
j2
1
d
d
jd
jd
2
j2
ADDRESS
50
RAM
RAM
RAM
PE
PE
PE
45
40
"iris_data" "solder_data" "tyre_data"
35
Errors
30
25
COMMAND
INPUT
DATA IN
20
VECTOR
DATA OUT
15
CPU
10
Fig. 24. SIMD array
5
0 16
14
12
10
8 6 Accuracy [Bit]
4
2
0
Fig. 21. Classi cation error for dierent xed point formats
X1
X2
Xn
r 11 w11
r 12 w12
r 1n w1n
r 21 w21
r 22 w22
r 23 w23
r m1 wm1
r m2 w m2
r mn wmn
Min
Fig. 22. Systolic array (2D) for nearest neighbor classi cation
X1 r 11w11
X 2 r12w12
X n r1n w 1n
Fig. 23. Systolic array (1D) for nearest neighbor classi cation
are stored in local memories, either connected (external) or integrated into each PE. It takes n cycles for a PE to compute the distance between a n-dimensional reference vector and an input vector. If there are k PEs, then k distances are computed in parallel. It takes n + k cycles to get the result, if outputs of all PEs are sent to a common data bus sequentially. To classify m input vectors it takes m (n + k ) cycles. A SIMD array solution which is more k appropriate to DVQ variants is illustrated in Section V-D. C.1 Comparison of Dierent Architectures Considering the large application Digit there should be up to 10 output classes. The feature space should have up to 36 dimensions and the number of dynamically generated neurons can be limited to 150. For every dimension of the feature space 6 to 8 bits can be used. The rst architecture (Fig. 22) is hardly implementable on 4 FPGAs, because of the high number of PEs. Although these PEs are simple, not more than 10 of them can be implemented on one chip (with external memory). But it's also important to consider, that there exist many connections between elements, which in turn, take a lot of routing resources of the FPGA. In case of assigning a generated neuron to a PE, the number of PEs needed for this solution is very large: 150 36 + 150 = 7550. The second and the third solution seem to be more suitable for FPGAs, but still the rst of these two architectures is dicult to implement (Fig. 23). If there are 36 dimensions, 37 PEs are needed. This means, that 3 of the FPGA chips should contain 9 PEs and one chip 10 PEs. If three parameters are inputs to each PE simultaneously (X ; W ; r ), every PE needs 3 8 = 24 I/O pins and overall 10 24 = 240 I/O pins are needed, but a XC4013 FPGA has only 192 I/O blocks. If data is presented to PEs sequentially with 4 bit size, the number of available I/O blocks is sucient, but it is, of course, twice as time consuming and every PE needs also 3 4-bit registers for input data. The third solution (1D systolic array with a feedback loop) is somewhat slower than the second one, but seems to be more easily d
jd
jd
cnt w
db ir
ac
enb
ac - calculated distance db - data bus buffer ir - instruction register enb - enable buffer cnt - RAM addressing register w - RAM for weight vectors
Fig. 25. Registers of a processing unit
implementable - the number of PEs is less and the number of registers which are needed for input data. Since the neurons, representing reference vectors, are dynamically created, nding the optimal number of PEs is complicated and the highest classi cation speed is seldom achieved. Concerning the solutions which are presented above, none of these 3 architectures allows training. In case of CBFN with MRCE learning, the dimension which causes the misclassi cation must be known. Here the systolic array solution shows only, that the classi cation result is correct or not. If SIMD array is used for implementing a nearest neighbor classi er and CBFN, every PE needs at least 4-5 registers and a separate control unit that can be implemented in DSP. If an application works in real time and calculation speed of one single input vector has to be as high as possible, a SIMD array is the best solution. D. SIMD Array with FPGAs
Each dynamically generated neuron is assigned to a processing unit in a FPGA, which calculates the distance between an input vector and the neuron. Since the distance to all the reference vectors (processing elements) should be calculated for each input vector, a SIMD array can be used for this purpose. Since no multiplication is used in training as well as in recall operation, the algorithms proposed seem to be very eective. Although it is sucient to have a real-time implementation with a DSP-board for classi cation benchmarks such as Iris data, it is absolute necessary to implement parallel parts of algorithms with the SIMD array for complicated applications (e.g. Digit data set, where number of reference vectors generated is at least 46 (in the case of DVQ3). Even though the clock rate for FPGAs are several times less than that for a powerful DSP solution, the overall speed gain is much higher for such applications due to the exploitation of massive parallel implementation of processing units in FPGAs. Depending on the application either CBFN/MRCE, DVQ2 or DVQ3 can be selected. In case of highly representative training data sets CBFN/MRCE is appropriate. For other applications either DVQ2 or DVQ3 more suitable, since the CBFN/MRCE method does not have the ability to generalize properly. For highly overlapping data sets DVQ3 gives the best results due to its excellent generalization capability.
Fig. 26. Structure of a PE
1 1101 1011 0100 0011
2 1101 1011
3 1101
4 1101
MSB
LSB
Fig. 27. Calculation of minimum distance
move to the next bit. D.1 Distance Calculation The registers implemented in a processing unit are shown nal result is to inverted to get the minimum distance. in Fig 25. The 64 word RAM w with 6 bit word length The In the stores the weight vectors for 64 dimensions. The accumu- \0010".example shown in Fig. 27 the minimum distance is lator ac stores the calculated \distance" and cnt is used for indirect addressing of RAM. VI. Conclusions and Future Work Table III shows the number of Con gurable Logic Blocks The presented compiler FUZ2LCA for automatic gen(CLBs) occupied by each register of a PE. eration of fuzzy controller implementations on FPGAs is tested with several application examples. Since rules are register width [Bit] Number of CLBs hard-wired, this concept had to be improved for the autow, RAM 6x64 12 mated design of fuzzy ASICS, where re-programmability of ac 12 6 the rules is required. Since high-level design entry makes ir 3 1.5 possible the mapping on dierent target technologies (stancnt 6 3 dard cell libraries, FPGA libraries, ...), a VHDL-based apdb 6 3 proach is well-suited for a fuzzy CAD toolkit. The addienb 1 0.5 tional advantage is that the system can already be simTABLE III ulated on behavioral level in order to validate, if the seNumber of CLBs needed for registers lected bit widths for internal and external signals are sucient for achieving a required computation precision. Since the whole system is instantiated from a generic VHDL liThe following register transfers are implemented for cal- brary, basic rough design faults can be excluded. The basic culation of the city block distance. modules of the VHDL library are already available. Currently the modules are integrated into the complete con adac calculates the city block distance accordtroller structure. The fuzzyCAD design manager is curing to the equation 9: ac ac + j w(cnt) { db j rently implemented with an hypertext-based user interface. cnt cnt + 1 Furthermore implementation cost models implemented as ldcnt load the index register cnt: estimators for timing and area will be developed. Addicnt db tionally a defuzzi cation module for a FuNe I con gurable rdac reads the accumulator ac of a PE: fuzzy system is currently speci ed in VHDL. The neurodbus ac fuzzy approaches can either deliver fuzzy modules that can ldac initialize the accumulator ac: be implemented by the generic fuzzy processor or they are ac db hardware friendly and fuzzy interpretable neural structures ldwc initialize the weight vectors of a neuron: that are directly considered as fuzzy hardware solutions. w(cnt) db cnt cnt + 1 References The processing elements described in VHDL are synthe- [And35] E. Anderson. The Irises of the Gaspe Peninsula. Bull. Amer. Iris Soc., 59:2{5, 1935. sized with a commercial tool (Synopsis) to get the FPGA [Bez93] J. C Bezdek. A Review of Probabilistic, Fuzzy, and net list (see also Fig. 26). Neural Models for Pattern Recognition. Journal of Intelligent Fuzzy Systems, 1, 1993. D. Driankov, H. Hellendoorn, and M. Reinfrank. An Introduction to Fuzzy Control. Springer-Verlag, USA, 1993. S. K. Halgamuge and M. Glesner. Neural Networks in Designing Fuzzy Systems for Real World Applications. International Journal for Fuzzy Sets and Systems, 65(1):1{12, 1994. North Holland. [HG95] S. K. Halgamuge and M. Glesner. Fuzzy Neural Networks: Between Functional Equivalence and Applicability. IEE International Journal on Neural Systems (in press), 1995. World Scienti c Publishing. [HHKG94] S. K. Halgamuge, T. Hollstein, A. Kirschbaum, and M. Glesner. Automatic Generation of Application Speci c Fuzzy Controllers for Rapid Prototyping. In IEEE International Conference on Fuzzy Systems' 94, Orlando, USA, June 1994. [HPaMG94] S. K. Halgamuge, W. Pochmuller, and C. Grimm and M. Glesner. Fuzzy Interpretable Dynamically Developing Neural Networks with FPGA Based Implementation. In Fourth International Conference on Microelectronics for Neural Networks and Fuzzy Systems, Torino, Italy, September 1994. [HPG95] S. K. Halgamuge, W. Pochmuller, and M. Glesner. An Alternative Approach for Generation of Membership
D.2 Nearest Neighbor Calculation [DHR93] After the calculation of city block distances, the minimum distance has to be calculated. The method pre- [HG94] sented in Fig. 27 uses a Wired-OR bus for this purpose [HPaMG94]. Invert all the bits. starting from the most signi cant bit (MSB), for all the bits: for all PEs activated by the controller write the distances as binary numbers to the Wired-OR bus. If the resulting binary number on the bus is \1": deactivate all the PEs, that have written a \0" to the bus move to the next bit. If the resulting bit in the bus is \0":
Functions and Fuzzy Rules Based on Radial and Cubic Basis Function Networks. International Journal of Approximate Reasoning (in press), 1995. Elsevier. [HPmG93] S. K. Halgamuge, W. Pochmuller, and M. Glesner. A Rule based Prototype System for Automatic Classi cation in Industrial Quality Control. In IEEE International Conference on Neural Networks' 93, pages 238{ 243, San Francisco, USA, March 1993. IEEE Service Center; Piscataway. ISBN 0-7803-0999-5. [HRG94] S. K. Halgamuge, T. A. Runkler, and M. Glesner. A Hierarchical Hybrid Fuzzy Controller for Realtime Reverse Driving Support of Vehicles with Long Trailers. In IEEE International Conference on Fuzzy Systems' 94, Orlando, USA, June 1994. [Hun95] D. L. Hung. Dedicated Digital Fuzzy Hardware. IEEE MICRO, 15(4), August 1995. [Koh89] T. Kohonen. Self-Organization and Associative Memory. Springer Verlag, 1989. [MA75] E. H. Mamdani and S. Assilian. An experiment in linguistic synthesis with a fuzzy logic controller. IJMMS 7, 1975. [SU95] H. Surmann and A. P. Ungering. Fuzzy Rule-Based Systems on General-Purpose Processors. IEEE MICRO, 15(4), August 1995.