WPA3-1 A Modular gm-C Programmable CNN Implementation

3 downloads 0 Views 328KB Size Report
output function, and the cell connections (template val- ues) are realized as sets ..... ble, Modular CNN Cell,” in IEEE International Workshop on Cellular Neural ...
A MODULAR gm−C PROGRAMMABLE CNN IMPLEMENTATION Drahoslav Lím and George S. Moschytz Signal and Information Processing Laboratory, Swiss Federal Institute of Technology CH – 8092 Zurich, Switzerland [email protected]

ABSTRACT A programmable cellular neural network has been designed in a 0.8µ CMOS technology. An arbitrarily large analog CNN can be constructed by modularly connecting ‘tile’ CNN chips, each with a modest number of cells. The network operates in continuous time, has a PWL output function, and the cell connections (template values) are realized as sets of switchable unit and half-unit transconductors. Matching accuracy, including matching among chips from different manufacturing runs, and operation was verified on uncoupled and coupled templates. 1. INTRODUCTION Numerous applications of cellular neural networks [1] and the CNN ‘Universal Machine’ [2] have been reported in the literature, for both image and signal processing. Each cell of the network [1] has the schematic shown in Fig. 1. The output function is a piecewise linear (PWL) limiting function, 1 |x + 1| – |x – 1| . f (x) = 2 In order to achieve a high degree of efficiency, applications rely on the CNN being implemented as a parallel computing structure in analog hardware. Analog circuits are usually, by their nature, faster, more compact and consume less power than equivalent digital circuits, provided the application does not require high accuracy, but are typically more difficult to design. A few different approaches have been taken in implementing CNNs, e.g., [3–5]. The designs differ considerably with respect to available templates, degree of connectivity and programmability, form of data input/output, and in general implementation philosophy. Some implementations modify the circuit by using an output non-linearity other than a (smooth approximation to the) piece-wise linear function [5], or by restricting the swing of the state variable x i j (the capacitor voltage) [3]. Although these alterations significantly increase the achievable cell density, e.g., [5], such modified cell circuits may not be possible to use for processing

grayscale images, which can occur both in image [6] and other signal processing [7]. As with any parallel computing structure, we would like the CNN to comprise as many cells as possible. However, extremely large die sizes or wafer-scale-integration are not practical. Therefore, as a further consideration, it is desirable to have the ability to construct an arbitrarily large CNN in a modular way [4, 8], i.e., to interconnect chips with a modest number of cells to build up a larger network containing many cells. This is analogous to the bit-slice organization used in some digital systems, though in the case of a CNN the 2-D arrangement is more akin to a ‘mosaic’ composed of tiles. The additional chip and circuit board area and power consumption are not a problem in a CNN designed for general algorithm testing and development; for a welldefined task, a high density, single-purpose chip can then be designed. Also, the value of the cell RC time constant, and hence the CNN transient settling time, will have to be increased, from tens of nanoseconds to a few microseconds in order to accommodate the delay and parasitic capacitances due to inter-chip connections. This is usually not a severe performance degradation, as the throughput of a CNN processor tends to be limited by the time required for programming, setup, data input/output, etc., rather than by the cell time constant. In order to incorporate the CNN circuit into a signal processing system, it is necessary to have electrical inputs, outputs and control lines, irrespective of how the template programming is actually accomplished inside the CNN. Although optical input/output structures may be attractive for their parallel operation [3], they are practical only when the data is already available in an optical form. For data in electrical or computer form, e.g., transformed acoustical data [7], a scheme using serial data or row-by-row transmission has to be used. This paper presents some aspects of a CNN integrated circuit designed to implement the original CNN circuit [1] with nearest neighbor connections and a piecewise linear output characteristic. A cell of this type without further changes was chosen for the implementation.

0-7803-4455-3/98/$10.00 (c) 1998 IEEE

ui j

xi j

Bkl u kl ···

Ei j

C

I

yi j

Akl ykl ···

R

Figure 1: CNN circuit of [1]

Iout I0 /2

I0

I0

I0

+ Vin – S0

S1

S2

S3

Vbias

VSS Figure 2: step-wise programmable VCSS realization

In particular, all eighteen A and B template entries (connections) of the neighborhood plus the constant source were retained and made programmable, with each controlled source in each cell being programmable separately. 2. ARCHITECTURE AND PROGRAMMABILITY The fact that the entire CNN is composed of ‘tile’ chips rather than being on a single die, means that circuitry must be provided so that chips, possibly not even made during the same fabrication run, can be tuned to some common (absolute) standard. In the present design this is accomplished by master/slave control loops and a voltage and resistance reference. In the original CNN circuit [1], the connections between cells are formed by voltage controlled current sources (VCCSs) with continuously variable strengths. However, little functional benefit is gained from continuously variable template values, since whole ranges of values perform the same signal processing task. Furthermore, such an arrangement has practical drawbacks: since the host controller will be a digital system of some kind, a set of digital–to–analog converters is required, either on each chip of a multi-chip network, or one set for the whole CNN. Long analog control wires, liable to pick up noise during the CNN transient, must then be run throughout the chip or even the circuit board. More importantly, both alternatives require the tuning circuit of each chip to guarantee tracking of tuning voltages. We therefore implement step-wise programmable templates, whose values, however, are not related to the

digital programming word using standard binary weighting. Instead, the binary code selects one from a set of values chosen so that many signal processing tasks can be performed given a modest number of available codes and a sign bit [8]. In the present design, the center A and B sources have eight values available, the off-center A and B sources four, and the constant source I has fifteen. Conceptually, each VCCS (realizing a single template value in one cell) comprises several unit and halfunit transconductors; every VCCS is therefore a simple non-linear (with respect to the digital control code) digital–to–analog converter, Fig. 2. The units are switched on or off by enabling or disabling the bias current of their input differential pairs. Tuning circuits adjust this set of unit and half-unit transconductances through Vbias . The unit transconductors themselves are realized as conventional cascode OTAs, Fig. 3. Only the differential pair (boxed in Fig. 3) needs be repeated as per Fig. 2; the surrounding set of current mirrors can be common to the entire VCCS. A cross-coupled differential pair (not shown) is used to extend the near-linear input range of the OTAs. Step-wise programmability also allows each VCCS of each cell to be programmed differently, which would be difficult to implement with analog control lines. Though not many tasks require spatially-variant templates, this feature greatly simplifies testing (exhaustive testing is virtually impossible without it), and it allows the effect of different network sizes and boundary values to be examined [9]. The cell resistor R (Fig. 1) is also implemented

0-7803-4455-3/98/$10.00 (c) 1998 IEEE

VDD M3b

M3a

M4a

M2a

M2b

M4b

M8

VCP

M7 VIN–

VCP

VIN+ OUT

VC

M2 M6

Vbias

VC

M1