A Recon gurable Hardware Accelerator for Back

1 downloads 0 Views 1MB Size Report
3.11 Implementation of (a) port I/O decoder and (b) a one-bit register. : : : : : 87. 3.12 Multiplexer for selecting among the di erent types of con guration streams.
University of California Santa Cruz

A Recon gurable Hardware Accelerator for Back-Propagation Connectionist Classi ers A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science

in Computer Engineering

by Marcelo H. Mart n

June 1994 The thesis of Marcelo H. Martn is approved: Pak K. Chan Martine D. F. Schlag Anujan Varma

Dean of Graduate Studies and Research

c by Copyright Marcelo H. Martn 1994

iii

Contents Abstract

xiii

Acknowledgments

xiv

1. Introduction

1

1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1

1.2 Technology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

6

1.3 Design Flow, Programming, and Terminology : : : : : : : : : : : : : : : : :

8

1.4 Contribution : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

9

2. Interconnection Network

11

2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

12

2.1.1 Terminology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

12

2.1.2 Motivating a Flexible Recon gurable Interconnection Network : : :

16

2.2 Interconnection Architectures : : : : : : : : : : : : : : : : : : : : : : : : : :

16

2.2.1 A Bus Can be Simple to Expand : : : : : : : : : : : : : : : : : : : :

19

2.2.2 Mesh Structures Provide an Elegant Expansion : : : : : : : : : : : :

20

2.2.3 Direction Sensing: Routing Bi-directional Nets : : : : : : : : : : : :

22

2.2.4 A Crossbar Provides Relatively Predictable Delay, But is Dicult to Expand : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

23

2.2.5 Passing the Baton : : : : : : : : : : : : : : : : : : : : : : : : : : : :

26

2.3 Theory of Clos Interconnection Networks : : : : : : : : : : : : : : : : : : :

27

2.3.1 Mapping a Clos (Three-Stage) Network to FPGAs : : : : : : : : : :

31

2.3.2 Implementable Interconnection Architectures : : : : : : : : : : : : :

35

2.3.3 Existing Routing Architectures : : : : : : : : : : : : : : : : : : : : :

36

2.4 Connectivity in ACME : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

37

iv 2.4.1 Types of Connections : : : : : : : : : : : : : : : : : : : : : : : : : :

37

2.4.2 Pins Required by Inter-neuron Connections (pinc) : : : : : : : : : :

41

2.4.3 Pins Required by Memory Connections (pmc) : : : : : : : : : : : : :

42

2.4.4 Growth of Pins: Pin-Count is Bound by f 2 : : : : : : : : : : : : : :

43

2.4.5 Pin-count Required for a System of 14 Functional Units : : : : : : :

44

2.5 The Middle Stage of The Clos-like Network is Implemented With Routing FPGAs : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

46

2.5.1 Number of Pins Needed by a Routing FPGA : : : : : : : : : : : : :

47

2.5.2 Analysis of Three Di erent Routing Schemes : : : : : : : : : : : : :

48

2.5.3 Pins Required by Inter-Neuron Connections (pinc) According to Scheme (c) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

57

2.5.4 Pins Required by Memory Connections (pmc) According to Scheme (c) 57 2.5.5 Growth of Pins According to Scheme (c) : : : : : : : : : : : : : : : :

59

2.5.6 Minimum Number of Routing FPGAs : : : : : : : : : : : : : : : : :

60

2.6 Implementation of a Clos-like Interconnection Network : : : : : : : : : : : :

63

2.7 A System With 6 Input, 3 Hidden, and 2 Output Nodes : : : : : : : : : : :

65

2.8 A Note on the Scalability of ACME : : : : : : : : : : : : : : : : : : : : : :

70

3. Hardware Implementation

71

3.1 Introduction: Overview of Implementation : : : : : : : : : : : : : : : : : : :

72

3.2 The ACME Board : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

75

3.3 Interfacing the ACME Board and the SPARC Station : : : : : : : : : : : :

80

3.3.1 Direct Memory Access : : : : : : : : : : : : : : : : : : : : : : : : : :

82

3.3.2 Port Input/Output Transfers : : : : : : : : : : : : : : : : : : : : : :

82

3.3.3 DMA Controller Card and Alterations : : : : : : : : : : : : : : : : :

83

3.4 X0: the Controller : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

83

3.4.1 Controller Board : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

85

v 3.4.2 Internal View of the X0 Controller and Description of Signals : : : :

86

3.4.3 Port Decoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

87

3.4.4 Con guring the Neurons and Routing Network : : : : : : : : : : : :

88

3.4.5 Controlling the Neurons: Epoch Counter and Control Signals : : : :

91

3.4.6 Reading From X0: Read-Back Multiplexer : : : : : : : : : : : : : : :

93

3.4.7 Memory Interface : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

93

3.5 Status and Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

95

3.5.1 Status : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

95

3.5.2 Neural Network Performance : : : : : : : : : : : : : : : : : : : : : :

96

3.5.3 Memory Access and Con guration Performance : : : : : : : : : : : :

97

4. Supporting Software

99

4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

99

4.2 System Generation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

99

4.3 Utility Software : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 101

A. Synchronization

103

A.1 Synchronization Issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 A.2 Lessons Learnt : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104

B. Software

108

B.1 The color Program : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 B.2 The configR0 Program: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 B.3 Accessing Memory Chips With access memo : : : : : : : : : : : : : : : : : 117 B.4 Program net Controls ACME's Basic Functions : : : : : : : : : : : : : : : : 120 B.5 Testing the Components : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121

vi

C. Schematics

125

C.1 Description of Schematics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 125 C.2 Schematics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139

References

143

vii

List of Figures 1.1 The McCulloch-Pitts neuron. The output is activated if the weighted sum of inputs is greater than or equal to , the threshold value. : : : : : : : : : 1.2 A feed-forward neural network with two layers. Data is presented at the inputs (s1 , s2 , and s3 ). o1 and o2 are the outputs. : : : : : : : : : : : : : : 1.3 A back-propagation neural network. (a) feed-forward. The four hidden neurons (H1, H2, H3, and H4 broadcast their output, Vj , to all output neurons. O1 and O2 produce their outputs ok . (b) back-propagation. The output nodes read the target output (tk ) and send the error values k;j to the hidden nodes. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.4 Block diagrams of (a) hidden and (b) output neurons showing the synaptic processors responsible for generating dot product multiplications and backpropagation weight updates (marked with `*'), the dot product summer, the  and  0 function generators and the control units. External signals si represent the outputs of the input neurons (input neurons not shown). Vj represents the output of hidden node j . k;j represent the weight updates from output neurons to hidden neurons. tk represents the target output, and ok the real output. There are a maximum of n output neurons, m hidden neurons, and l input neurons. wj;i are hidden weights that are multiplied by the output of input neurons. wk;i are output weights that are multiplied by the output of hidden neurons. : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1 Interconnections required during the feed-forward (solid lines) and feed-back phases of a system with three input, four hidden, and two output nodes. Square boxes denote particular data bits of the global memory, while circles denote the input, hidden, and output nodes of the system. : : : : : : : : : : 2.2 High-level view of the ACME architecture showing a system with n neurons. Forced external signals (to X0 and to the private memories) are shown in solid lines, group-forced external signals among functional units and the global memory in dotted lines, and group-forced external signals among functional units in dashed lines. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.3 Some representative interconnection schemes: (a) shared bus, (b) linear array, (c) 4-way mesh (or simply mesh), (d) Clos-like, (e) crossbar. : : : : : 2.4 Bi-directional signals need a second \direction" signal if FPGAs are used as the interconnect medium. (a) FPGA used as interconnect device (b) FPIC used as interconnect device. : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.5 Block diagram of the Aptix AXB-AP4. The FPGAs can be Xilinx XC4000 series, and the FPICs are AX1024. Inside the dotted lines is group 1 of four. 2.6 (a) structure of a connecting system as described in [1] , (b) 2-by-2 crossbar switch providing two point-to-point connections, (c) 2-by-2 crossbar switch providing a broadcast connection. : : : : : : : : : : : : : : : : : : : : : : : :

2 3

4

6

13

15 18 23 25 28

viii 2.7 (a) shows a single crossbar with 12 inputs, 12 outputs, and 144 cross points. (b) shows a three-stage symmetrical Clos network with 12 inputs, 12 outputs, and 120 cross-points. The advantage of (b) over (a) is that the crossbars are smaller (perhaps more cost-e ective) and the number of cross-points is less. 2.8 Two possible mappings of a Clos network. (a) same user FPGA pins connected to the same middle routing chip. (b) shifting pattern. The boxes inside the functional units and routing FPGAs denote conceptual switches that we treat as small crossbars. : : : : : : : : : : : : : : : : : : : : : : : : 2.9 (a) Conceptual view of ACME's physical functional units (F) and routing (R) FPGA placement. (b) Areas assigned to R1 for its connections to functional units. There are eleven wires between functional units and routing FPGAs. 2.10 A simple example showing ACME's interconnect architecture : : : : : : : : 2.11 ACME shows an interconnect architecture of the Clos type once a system is de ned. An example of a system with two hidden and one output nodes. The hidden nodes can be considered to form part of the input stage, while the output nodes form part of the output stage. : : : : : : : : : : : : : : : : 2.12 Block diagrams of (a) hidden and (b) output nodes showing the synaptic processors responsible for generating dot product multiplications and backpropagation weight updates (marked with '*'), the dot product summer, the  and  0 function generators and the control units. External signals si represent the outputs of the input nodes (input nodes not shown). Vj represents the output of hidden node j . k;j represent the weight updates from output nodes to hidden nodes. tk represents the target output, and ok the real output. There are a maximum of n output nodes, m hidden nodes, and l input nodes. wj;i are hidden weight unis that are multipied by the output of input nodes. wk;i are output weight units that are multiplied by the output of hidden nodes. : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.13 The ACME architecture. Hidden and output nodes are functional units. Dashed lines represent memory connections (mc) and solid lines represent inter-neuron connections (inc). : : : : : : : : : : : : : : : : : : : : : : : : : 2.14 A system composed of 13 hidden nodes, 1 output node and a 24-bit input word. (a) Memory connections (mc). (b) Inter-neuron connections (inc). The circles in (a) and (b) stand for the hidden and output nodes. The hidden nodes in (b) are unfolded to show the connections to and from the output node more clearly. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.15 A system composed of 12 hidden nodes, 2 output nodes and a 24-bit input word. Connections to o1 are drawn using a dashed line for clarity. (a) Memory connections (mc), (b) inter-neuron connections (inc). The circles in (a) and (b) stand for the hidden and output nodes. The hidden nodes in (b) are unfolded to show the connections to and from the output node more clearly. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.16 Plot of tp assuming a system with 40 functional units. h (number of hidden nodes) and o (number of output nodes) are subject to constraints (2) and (3). The input data width is xed at 24 bits. Vertical axis: tp. : : : : : : :

30 33 33 34 35

38 39

40

41 44

ix 2.17 Pin usage versus number of output nodes. The number of hidden nodes is always h = 14 ? o, and the input data width is always 24.tp = rhombus, pinc = crosses, pmc = squares. : : : : : : : : : : : : : : : : : : : : : : : : : 2.18 Routing chips (r1, r2, ..., rk ) are used to implement the middle stage of our three-stage Clos-like routing network. : : : : : : : : : : : : : : : : : : : : : 2.19 Two possible mappings of a Clos network. (a) same user FPGA pins connected to the same middle routing chip. (b) shifting pattern. The boxes inside the functional units and routing FPGAs denote conceptual switches that we treat as small crossbars. : : : : : : : : : : : : : : : : : : : : : : : : 2.20 System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the xed-pin pattern interconnect. (a) feed-forward and feed-back connections. The routing chips are unfolded for clarity. No memory connections are shown. (b) Connections between routing chip ri , one hidden and one output node. Memory connections are accounted for: din=r between ri and hj , and 1 (target or real output) among ri to ok . Connections among hidden node hj and routing chip rx, and output node ok and routing chip rx. : : : : : : 2.21 System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the shifting pattern construction. Routing chip r3 can not supply neither V1 nor V2 to output node o2 . : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.22 (a) A system with 3 hidden nodes, 2 output nodes, and 2 routing chips assuming the shifting pattern construction of Scheme (b). Routing chips and hidden nodes are unfolded for clarity. Fixed-pins are used by output nodes to receive V3 , and produce 1;3 and 2;3. (b) Connections between routing chip ri, one hidden, and one output node. Memory connections are accounted for: din=r among ri and hj , and 1 (target or real output) among ri and ok . : : 2.23 System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the shifting pattern interconnect. (a) Feed-forward connections. (b) Feedback connections. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.24 Connections between routing FPGA ri, hidden node hj , and output node ok according to Scheme (c). : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.25 Connections among routing chip ri , hidden node hj , and output node ok : 2.26 Pin consumption (tp2) versus number of output nodes (o). The number of hidden nodes is always 14 ? outputs, and the input data width is always 24. Rhombus: four routing chip; plus: ve routing chips; square: 10 routing chips; x: 14 routing chips; triangle: 32 routing chips. : : : : : : : : : : : : : 2.27 ACME architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.28 Footprint of functional units. The numbers inside the FPGAs represent input/output pads grouped into 11 \switches." The labels on the outside of the boxes represent connections to routing FPGAs. : : : : : : : : : : : : : : 2.29 Organization of the global memory. : : : : : : : : : : : : : : : : : : : : : :

46 47 49

50 52

53 54 56 56 59 64 65 66

x 2.30 Memory connections of a system with 3 hidden nodes, 2 output nodes, and 6 inputs. The gure shows memory-to-hidden connections on the top and memory-to-output connections on the bottom. The routing FPGAs on the top and on the bottom of the gure are the same physical units that have been unfolded for clarity. Note that because the system has six inputs (and six is not a multiple of the number of routing FPGAs), the sixth input is allocated to memory data bit 20 (one of the special data bits). The boxes inside the hidden-nodes are weight units. The weights are shifted serially into the hidden nodes. The dotted lines going into the hidden nodes and feeding the rst (or last) weight unit are bi-directional lines connected to the X0 controller. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.31 Inter-neuron connections of a system with 3 hidden (H1, H2, and H3) and 2 output (O1 and O2) nodes. Notice how the Vj signals are broadcast inside the hidden nodes, in the rst stage of the routing network. The routing FPGAs on the top and the bottom of the gure are the same physical devices. They are unfolded to illustrate the two di erent types of inter-neuron connections: feed-forward (Vj ) and feed-back (k;j ). The boxes inside the output nodes are output weight units. The weights are shifted serially into the hidden nodes. The dotted lines going into the output units and feeding the rst (or last) weight unit are bi-directional lines connected to the X0 controller. : : 3.1 High level view of the ACME environment. : : : : : : : : : : : : : : : : : : 3.2 Picture of the ACME development center. From left to right: top: oscilloscopes and power supplies. Bottom: ACME board and controller board, SPARC station, monitor and keyboard. : : : : : : : : : : : : : : : : : : : : 3.3 The ACME board at an intermediate stage of its development. Ribbon cables and bu ers are at the top. The larger chips are the Xilinx FPGAs. The ve at the periphery are XC4010s, and the six in the center are XC3195s. The smaller chips are the memory units. The seven in the center are global memories. The ve at the periphery are private memories. : : : : : : : : : : 3.4 ACME board, DMA controller card, and X0 controller board. : : : : : : : 3.5 Processing element. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.6 Tree of bu ers to decrease the load on each memory data bit. The P stands for pullup resistors. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.7 Signals used among the controller board and the DMA controller card. : : 3.8 Timing of \reads" generated by DMA controller card. (a) Port I/O, (b) direct memory access. Both (a) and (b) according to slow transfer mode. Subtract two cycles if using \fast" transfer mode. Charts show \reads"; \writes" have similar waveforms. The frequency of the clock (clk) is 25 megahertz, its period 40 nanoseconds. : : : : : : : : : : : : : : : : : : : : : 3.9 Controller board. From left to right, top: connector to SPARC station and transceivers, connector to power, X0 controller FPGA. bottom: two 50-pin connectors to ACME board and transceivers, bank of LEDs. Right: connector for the xchecker cable and proms for storing permanent con guration for the X0 controller. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

68

69 73 75

76 77 78 79 81

85

86

xi 3.10 High-level view of X0's architecture. : : : : : : : : : : : : : : : : : : : : : : 3.11 Implementation of (a) port I/O decoder and (b) a one-bit register. : : : : : 3.12 Multiplexer for selecting among the di erent types of con guration streams and a weight value to be written to the XC4010 FPGAs. Only two inputs to the multiplexer are shown: one for FPGA F11 and another for FPGA F10. There are twelve more inputs to the multiplexer. : : : : : : : : : : : : : : : 3.13 Speed of the DMA controller during DMA transfers. d ack- signals the transfer of a byte from the SPARC station to the X0 controller board. Fast mode: 1.57 micro seconds per four bytes, or 2.5 Mb./sec. Slow mode: 2.02 micro seconds per four bytes, or 1.9 Mb./sec. : : : : : : : : : : : : : : : : :

87 87 90 97

4.1 Flow of system con guration. : : : : : : : : : : : : : : : : : : : : : : : : : : 100 A.1 Synchronization of functional units fails because go- is sampled at slightly di erent times. busy- signals generated by the three functional units in our system: First and second traces starting from the top (labeled 4 and 2 respectively) show hidden 1 and hidden 2's busy- signals. The third trace (labeled 1) shows busy- belonging to the output node being delayed by one cycle according to the fourth trace (main board's clock, labeled 3). : : : : : 105 A.2 Signals ghting in the ribbon cable. Dotted lines inside X0 show the original implementation for the direction control to the transceivers. Solid lines inside X0 show the solution. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106 B.1 A run of the color program. : : : : : : : : : : : : : : : : : : : : : : : : : : : B.2 Steps for creating the unrouted LCA and XNF les for routing chip R0. : : B.3 Final steps performed to obtain a le for routing chip R0 that can be downloaded to the main board. All the steps are performed using Xilinx proprietary tools. The last statement produces the le r0 r.mcs. : : : : : : : B.4 Organization of the global memory. : : : : : : : : : : : : : : : : : : : : : : B.5 Original, non-permuted wts0 le shows format of weights to be downloaded to the hidden and output neurons in ACME. : : : : : : : : : : : : : : : : : B.6 wts0p is the permuted version of wts0. Note that only hidden node weights need be permuted. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : C.1 C.2 C.3 C.4 C.5 C.6 C.7 C.8 C.9

X0 controller top level. : : : : : : : : : : : : : : : : : Memory address counter and memory chip selectors. : Xilinx's macro for a 4-bit up-down counter. : : : : : : Port address decoder. : : : : : : : : : : : : : : : : : : Xilinx's macro for a decoder. : : : : : : : : : : : : : : Miscellaneous functions. : : : : : : : : : : : : : : : : : Multiplexer. : : : : : : : : : : : : : : : : : : : : : : : Xilinx's macro for a 2-to-1 multiplexer. : : : : : : : : Interface to the net-board. Epoch counter. : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

109 117 117 119 121 121 128 129 130 131 132 133 134 135 136

xii C.10 Pak Chan's and Martine Schlag's design of a preloadable up-down 3-bit counter. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 137 C.11 Con guration path for ACME board FPGAs. : : : : : : : : : : : : : : : : 138

A Recon gurable Hardware Accelerator for Back-Propagation Connectionist Classi ers Marcelo H. Mart n

abstract This thesis describes the realization of the prototype of a recon gurable hardware accelerator. The accelerator emulates back-propagation connectionist classi ers. The prototype is called ACME, for Adaptive Connectionist Model Emulator. ACME is based on recon gurable hardware devices called Field Programmable Gate Arrays (FPGAs) and facilitates the emulation of di erent size connectionist classi ers. This thesis details the implementation of a Clos-like interconnection network that enables the basic units of the classi er to communicate, and for the classi er to expand to fourteen units while maintaining constant con guration time and memory storage requirements. The interconnection network exploits the recon gurability of the FPGA devices to attain its exibility. A careful design of the interface to the host computer aids in accomplishing the constant con guration time.

xiv

Acknowledgments I wish to acknowledge Dr. Pak K. Chan for his invaluable help in all the aspects of the design and implementation of the hardware, and for teaching me through his exemplary perseverance how to get the work completed. Dr. Martine Schlag patiently dedicated many hours to help answer my repetitive questions, and was instrumental in the development of schemes (a), (b), and (c) that appear in Chapter 2. Dr. Anujan Varma read a copy of the thesis and made valuable comments and suggestions. Aaron Ferrucci is a key member of the group and designer of the units of ACME. His contributions, from the speci cation of the system requirements and excellent suggestions for the architecture, to the endless hours of debugging, were of paramount importance during our collaborative e orts. Jason Zien read part of this thesis and gave helpful suggestions. Dimitrios Stiliadis and Lampros Kalampoukas of the High-Speed Network lab (next door to ours) made the rough times more pleasant. My mother, Beatriz, fed my spirit. I gratefully acknowledge the support provided by NSF Grants MIP-8896276, MIP9223740, RIA MIP-9111607, by the University of California MICRO Grant, and by Xilinx Inc.

1

1. Introduction This thesis describes the realization of the prototype of a recon gurable hardware accelerator. The accelerator emulates back-propagation connectionist classi ers. The prototype is called ACME, for Adaptive Connectionist Model Emulator, and has been jointly developed with [2]. The design and implementation of the basic units of ACME (neurons) can be found in [2]. This thesis focuses on the hardware environment that supports the neurons. The contribution of our joint work is the realization of a system which enables the back-propagation training stage of the neural network to be performed in hardware. The main contribution of this thesis is the design and realization of the interconnection network among the neurons, the interface of ACME to our host machine, and some of the details of the hardware implementation. The particular topology of the interconnections among the neurons, and the fact that we allow the number of neurons to vary motivates the study of a means to recon gure the interconnections among the neurons. We present such a study in Chapter 2. The hardware platform we design, together with the interface requirements, are detailed in Chapter 3. The supporting software is discussed in Chapter 4. Appendix A includes a discussion of solutions to some of the problems we encountered while prototyping ACME. Appendix B gives the most relevant details of the supporting software. Finally, Appendix C includes the schematics of most of the hardware and of our SBus interface controller. In the following section we describe the type of neural networks that ACME emulates. Section 1.2 introduces the integrated circuit devices we use to implement our prototype. Section 1.3 gives a brief overview of the steps taken to design and use the devices presented in Section 1.2. Section 1.4 summarizes the contributions of our work.

1.1 Background Neural networks were originally developed towards the modeling of neurons in the brain [3]. The term connectionist network, classi er, or model [2, 4], is used to highlight the

2 di erence between the biological neural networks and their emulations. We will call the networks, neural or connectionist networks. Neural networks o er an alternative computational paradigm to the Von Neumann computers (based on a programmed instruction sequence). Neural networks are built of basic elements called neurons, units, or nodes. A simple model for a neuron was proposed by McCulloch and Pitts in 1943 [3], and is shown in Figure 1.1. In this model, a neuron is viewed as a binary threshold unit. It computes the weighted sum of its inputs from other neurons, and outputs a one or a zero according to whether the sum is above or below a threshold, . To form a neural network, neurons are connected among themselves. The connections among neurons are weighted, and wj;i represents the strength of the connection from neuron i to neuron j . The McCulloch and Pitts model of a neuron, while it emulates a real neuron, is far from its biological counterpart. This is also true for neural networks. There is no intention to perfectly simulate the human brain; the focus is on a new paradigm of computation, not on biological accuracy. The information presented in this section is gathered from [2], [3], and [4]. θ in3

w1

in2

w2

in1

w3

Σ

output

Figure 1.1: The McCulloch-Pitts neuron. The output is activated if the weighted sum of inputs is greater than or equal to , the threshold value. Neurons connected in the form of Figure 1.2 give rise to a two-layer feed-forward neural network. The solid circles are called input neurons and do no processing. Input neurons simply present a pattern to the neurons in layer one. In feed-forward neural networks every neuron provides inputs only to the neurons in the next layer. There are no connections among neurons that are in the same layer. In Figure 1.2 the neurons in layer one are called hidden neurons, because they have no connections outside of the neural network. The neurons in layer two are called output neurons. Input patterns are provided by the input

3 neurons, some processing is done in the hidden and output neurons, and the output of the neural network is generated by the output neurons. All hidden neurons (and in turn all output neurons) can process data simultaneously. This is because of the independence of the various neuron outputs in each layer from one another o1

o2

output neurons (layer 2)

hidden neurons (layer 1)

s1

s2

s3

input neurons

Figure 1.2: A feed-forward neural network with two layers. Data is presented at the inputs (s1, s2 , and s3 ). o1 and o2 are the outputs. One of the features of a neural network is its ability to adapt to new situations. The neural network can be trained on a number of examples until it learns the correct or an approximate answer. A neural network learns by adjusting the weights wj;i. There are two primary methods to adjust the weights: supervised learning, and unsupervised learning. Supervised learning can be accomplished by presenting the correct answers to the neural network. In unsupervised learning, the neural network discovers \statistically salient features" of the inputs on its own, with no hints provided. The back-propagation algorithm or learning rule is a supervised learning algorithm, and is a means to adjust the weights among the neurons. The weights are updated by propagating errors backwards, from neurons in one layer to the neurons in the previous layer. Neurons in one layer of a back-propagation neural network have connections to neurons in the previous layer. Figure 1.3(a) shows our original feed-forward two-layer neural network, and Figure 1.3(b) shows the connections needed for back-propagating the errors (j;k ). Only output neurons propagate errors back in a two-layer neural network. Input neurons are idle during the back-propagation stage.

4 o1

o2

O1

O2

output neurons (layer 2)

t1

t2

O1

O2

∆ 1,1 V1

∆ 2,4

V4

H1

H3

H2

s1

s2

(a)

H4

s3

hidden neurons (layer 1)

H1

H3

H2

H4

input neurons

(b)

Figure 1.3: A back-propagation neural network. (a) feed-forward. The four hidden neurons (H1, H2, H3, and H4 broadcast their output, Vj , to all output neurons. O1 and O2 produce their outputs ok . (b) back-propagation. The output nodes read the target output (tk ) and send the error values k;j to the hidden nodes. ACME can emulate back-propagation connectionist classi ers. ACME uses backpropagation, a supervised learning algorithm, for updating the weights among the neurons. The classi er can train as well as classify in hardware. The interconnections of the neurons of ACME resemble those of Figure 1.3. We can distinguish two main processing stages: feed-forward, and back-propagation. The following describes the steps taken to accomplish the feed-forward and back-propagation processing stages. The description is an excerpt from Aaron Ferrucci's Master Thesis ([2]). The output of the neural network is computed layer by layer, starting with the application of data to the input neurons. Each input unit's value si is set to the value of the current input pattern. Each hidden neuron j , receiving the si inputs, computes the value hj , the dot product of input values si and its weight values wj;i. The output of a hidden unit, Vj , is the result of applying an activation function  to hj . Let i range over all inputs, j range over all hidden neurons, and wj;i be the weight in hidden neuron j corresponding to input value si . Hidden neurons provide data to each output neuron k which computes the dot product of the Vj 's of the hidden neurons and the output neuron's weights, wk;j . The output neuron applies the activation function  to yield the output value ok with k

5 ranging over all output neurons, and wk;j being the weight value in output neuron k that multiplies the output of hidden neuron j . The activation function is typically a sigmoid function:  (x) = 1+exp1 ? x ( is a parameter used to regulate the steepness of the function). During the back-propagation step, the desired output is compared to the actual output, and the weight parameters can be adjusted in order to decrease error. The amount of the adjustment is regulated by a small constant, called the learning rate and is denoted by  . The functions used by the neurons to accomplish the back-propagation are summarized below. Feed-forward: Inputs (si) are presented to the hidden neurons. Each hidden neuron applies Equation (1.1) to produce its output Vj that is broadcast to the output neurons.

Vj = 

X i

wj;isi

!

(1:1)

Each output neuron receives the Vj outputs from all hidden neurons and in turn applies Equation (1.2) to produce its output ok .

0 1 X ok =  @ wk;j Vj A j

(1:2)

At this point, the output neurons show the output generated by the application of the inputs s to the neural network.

Back-propagation: The output nodes read their expected outputs (target outputs tk ) and use them to update the weights wk;j that modify the connections from the hidden neurons. Output neurons use Equation (1.3) to update wk;j . Output neurons compute the error value k;j , Equation (1.4), and back-propagate it to the hidden neurons.

wk;j = wk;j +  (tk ? ok )Vj

(1:3)

k;j = (tk ? ok )wk;j

(1:4)

6 After the output neurons have passed the k;j error values to the hidden neurons, hidden neurons apply Formula (1.5) to update the wi;k weights that a ect their connections with the input neurons.  0 is the derivative of the sigmoid function.

wj;i = wj;i + si 0(Vj )

X k

(1:5)

k;j

Figure 1.4 (a) and (b) show a hidden and an output neuron in ACME, respectively. The nodes and the principle of their operation are explained in [2]. HIDDEN NODE j ,0  j < m OUTPUT NODE k,0  k < n

V1

s1 s2

k;1

wj;1 wj;2

sl

wj;l

 Vj



0 control unit

(a)



V2

1;j 2;j 3;j n;j

Vm

k;m

wk;1 wk;2 wk;m

 

 ok -1

tk

control unit

(b)

Figure 1.4: Block diagrams of (a) hidden and (b) output neurons showing the synaptic processors responsible for generating dot product multiplications and back-propagation weight updates (marked with `*'), the dot product summer, the  and  0 function generators and the control units. External signals si represent the outputs of the input neurons (input neurons not shown). Vj represents the output of hidden node j . k;j represent the weight updates from output neurons to hidden neurons. tk represents the target output, and ok the real output. There are a maximum of n output neurons, m hidden neurons, and l input neurons. wj;i are hidden weights that are multiplied by the output of input neurons. wk;i are output weights that are multiplied by the output of hidden neurons.

1.2 Technology We use Xilinx XC4010 eld-programmable gate arrays (FPGAs) to implement the neurons of ACME [2] and Xilinx XC3195 eld-programmable gate arrays to implement

7 the interconnection network among the neurons. A Xilinx XC3190 eld-programmable gate array is our controller; it facilitates the interface among the host computer and the neurons. Field-Programmable Gate Arrays (FPGAs) provide a medium to accelerate the process of prototyping digital designs [5]. They are integrated circuits that consist of array of gates. These devices can be con gured and re-con gured by the system designer through software in the designer's laboratory, rather than by the chip manufacturer in the factory [6]. In particular, Xilinx Field Programmable Gate Arrays [6] consist of an interior matrix of Con gurable Logic Blocks (CLBs) and a surrounding ring of Input/Output Blocks (IOBs). Interconnect resources occupy the channels between the rows and the columns of logic blocks, and between the logic blocks and the input/output blocks. The functions of the logic blocks, input/output blocks, and the interconnections among them are controlled by a con guration program stored in the on-chip memory. In our case, the con guration program (or con guration bit-stream) is loaded into the FPGAs from our host computer. Each input/output block at the periphery of the FPGAs can be con gured to be an input, an output, or a bidirectional pin. We select the Xilinx FPGAs because of our familiarity with the devices and they match our needs. We use the Xilinx XC4010 FPGAs to implement the neurons because they provide design exibility such as a fast-carry chain to aid in the construction of fast adders, and abundant routing resources. More on the criteria for the selection of the XC4010 FPGAs can be found in [2]. We exploit the exibility of the internal interconnect structure of the Xilinx FPGA as well as the device's re-programmability to realize our interconnection network. We select the XC3195 FPGA as the building blocks because they o er the highest pin-count (176 user input/output blocks) of all the Xilinx FPGAs that come in Pin Grid Array (PGA) packages as of December 1993. Xilinx XC3195 FPGAs are also the fastest rated of the Xilinx FPGAs. The controller is realized in an XC3190 FPGA because of the high pin-count and the device's relatively good performance.

8

1.3 Design Flow, Programming, and Terminology A design to be realized in an FPGA can be speci ed by means such as schematic entry, logic expressions, or hardware description languages [7]. Once the design is speci ed using one or a combination of the methods mentioned above, it is mapped to a network of logic cells particular to the FPGA architecture (technology mapping). The next step is the assignment of the network cells to physical cells on the logic cell array and the allocation of routing structures to realize the interconnection of the cells as in the network (placement and routing) [8]. The last steps in the process are to generate the con guration bit-stream, and nally, to con gure (or program) the FPGA. Xilinx provides the xact graphical design editor to enter a design at the lowest possible level. The editor enables absolute control over the utilization of the available resources in the FPGA. We use the xact editor to enter the speci cation for ve of the six XC3195 FPGAs we use to implement the routing among the neurons. Our controller is designed at the schematic-entry level, using the ViewLogic's workview schematic capture software. The neurons are designed at the schematic entry level too, using the xdp/wireC schematic editor (a university tool). We use the vendor's tools to map, place, and route our designs. The Partitioning Placing and Routing (PPR) program is used to realize the neurons. PPR can accept a constraint le (extension cst) where di erent constraints (such as the assignment of signals to speci c input/output blocks) can be de ned. The input to PPR is an xnf le (Xilnx netlist format), and its output is an lca le (Xilinx logic cell array). The six XC3195 FPGAs and the controller FPGA are placed and routed with the Automatic Placement and Routing (APR) program. APR is the XC3000 series version of PPR. A command-line option tells APR to route a design incrementally. For example, if only minor changes are made to a design, the incremental option of APR only routes the changes, leaving the previous information relatively intact. An incremental option for PPR was not available at the time of writing this thesis. The makebits program generates the con guration bit-stream required to program the

9 particular device (XC4010, XC3195, or XC3190). The input to makebits is a placed and routed lca design le. The output is a le with extension bit. The bit description of the design is good enough to program the FPGAs via the vendor's xchecker software and cable. We develop our own con guration software particular to our host machine, interface, and design requirements to program the FPGAs in ACME. Our program expects the con guration les to be in Intel hexadecimal format. Program makeprom translates the bit format to the hexadecimal format. The input to makeprom is a bit le, and the output an mcs le.

1.4 Contribution Our work is jointly developed with [2]. In [2], Aaron Ferrucci reports a digital implementation of a self-adapting and recon gurable connectionist classi er using eld programmable gate arrays (ACME). The implementation of the back-propagation algorithm in FPGAs and of the neurons is discussed in [2]. Our work is inspired by GANGLION [9]. GANGLION is an implementation of a connectionist classi er in eld-programmable gate array (FPGA) devices. But GANGLION has no learning ability: it performs only the feed-forward part of the classi cation problem, the weights are computed o -line using a software program. The weights are downloaded onto the hardware, so no weights are updated during processing. In contrast, the architecture of ACME is self-adapting and scalable, as well as recon gurable. The architecture is self-adapting because ACME can implement the back-propagation algorithm in hardware. Scalability is attained by interconnecting the neurons via a exible Clos-like interconnection network. Recon gurability is inherent in the eld-programmable devices [2]. Providing the interconnections among the neurons presents a special challenge, because the number of hidden nodes, output nodes, and input nodes is variable. To decrease the number of connections, the information is passed among the neurons in serial fashion. In this thesis we report a way to provide the exible interconnections among the neurons: we analyze the neurons' requirements and implement a Clos-like interconnection network.

10 Only one copy of a hidden and one copy of an output neuron are needed because of the construction of the interconnection network. The copies are replicated in hardware, during con guration time, to produce neural networks of up to fourteen neurons. The con guration of the integrated circuit devices (FPGAs) on the ACME board is performed in parallel. The con guration speed is a constant for any neural network of up to fourteen neurons. We provide a detailed study of the interconnection network, a description of of the ACME board, and a description of the interface between the ACME board and our host computer.

11

2. Interconnection Network The best architecture to interconnect elements in a multi-element board is the one speci cally designed to suite a particular application. This architecture is obviously the best for such an application, but it might not adapt well to accommodate other, more general applications, or even, (as in ACME) might not be able to support another, similar system built of the same basic units. There is no clear-cut \best" architecture to establish interconnections among FPGAs in a multi-FPGA board. In the particular case of ACME, we decided to implement a Clos-like interconnection network. We arrived at this decision after studying the di erent architectures that are possible to implement and after comparing their strengths and weaknesses. The Clos interconnection network is the most suitable to accommodate the topology required by a back-propagation neural network. In this chapter we study in detail the interconnection requirements as well as the implementation of the routing network. We start this chapter by de ning some terms and clarifying the requirements of our board, speci cally for the interconnection network in Section 2.1.1. Section 2.2 brie y describes the interconnection schemes used on other FPGA-based boards, and analyzes their strengths and weaknesses. Section 2.3 presents a more detailed analysis of one of the interconnection schemes: the Clos interconnection network. The section reviews the basic requirements for an interconnection network, and later explains how we map a Clos-like network onto ACME. Having decided upon a speci c type of interconnection architecture, in Section 2.4 and Section 2.5 we elaborate on the hardware requirements to implement such a network. In Section 2.6 we use the information from our previous analysis to implement our particular routing network. Finally we present a sample system in Section 2.7 and brie y discuss scalability issues in Section 2.8.

12

2.1 Motivation 2.1.1 Terminology The purpose of ACME is to provide a means to exercise di erent con gurations of back-propagation connectionist-classi ers [2]. We call each di erent con guration a system. Systems di er according to the function to be implemented, the precision desired, and the size of the classi er. Function and size of the classi er dictate the replication of the basic units that are present. We call the basic units nodes or neurons. There are three types of nodes: input, hidden, and output. A system consists of a set of input, hidden, and output nodes together with their interconnections as described in Chapter 1. The number of nodes is the size of the system. Nodes are connected to allow both the feed-forward and feed-back phases of the training of the classi er to take place. The number of connections among hidden, output, and input nodes varies from system to system according to the number of nodes. Variations in the precision of the weights, input data, or output data have no e ect on the number of interconnections between nodes, because the data is transferred in serial fashion. The precision does a ect the size of individual nodes; the higher the precision, the larger the nodes. The size of each node depends on the number of nodes in the system. Figure 2.1 illustrates a system consisting of three input, four hidden, and two output nodes. The gure illustrates the interconnections among the nodes during the feed-forward and feed-back phases of the training stage. The input nodes do no processing and reside in the global memory, as do the target and real outputs. For the remainder of this chapter let subscript i range over all input nodes (unless otherwise noted), j range over all hidden nodes, and k range over all output nodes. The input nodes, si , are located in what we call the global memory, or Gmem for short. Each input node occupies exactly one bit of the global memory, for example d0 through d2 in Figure 2.1, so the more input nodes there are, the more input bits of global memory that are required. The number of input nodes dictates the width of the input data, which is called the input word. The precision of an input node is the depth of the input node. The

13 h0 d0

s0

s1

d2

s2

∆ 0,1 ∆1,2

V2

∆1,3 ∆ 0,3

h3

d3

∆1,0

∆ 0,2

h2

t0

∆ 0,0

V1

h1 d1

V0

O0 O1

o0

o1

d5 d6

∆1,4

V3

t1

d4

Figure 2.1: Interconnections required during the feed-forward (solid lines) and feed-back phases of a system with three input, four hidden, and two output nodes. Square boxes denote particular data bits of the global memory, while circles denote the input, hidden, and output nodes of the system. higher the node's precision, the deeper the input word. A target output (tk ) is required for each output node to compare against its produced output to generate a correction or error value. The target output is also stored in the global memory (for example bits d3 and d4 as shown in Figure 2.1), and consists of one bit for each output node; the precision of the target output dictates the depth of bits d3 and d4 of the global memory. Hidden and output nodes (hj and ok ) are implemented by XC4010 FPGAs, and both hidden and output nodes require a private memory (Pmem) to store the values of an activation function required to produce the proper outputs. There is one private memory associated with each neuron. We call both the output node and its output ok . The meaning of ok will be clear from the context. The FPGAs can realize either a hidden, an output, or no neuron. We call those FPGAs that hold a neuron a functional unit, F. We call the set of one FPGA plus its private memory a processing element, or PE for short. Finally, a separate XC3190 FPGA interfaces ACME to the SPARC computer and provides the system the required signals for proper functioning. We call this FPGA controller X0. There are two types of interconnections among functional units, and between the functional units and the global memories that arise in implementing a system in ACME: one-toone and broadcasts. In general, a connection is one-to-one when there exists one source and one destination. One-to-one connections are those that are required among output nodes

14 and hidden nodes (weight updates, labeled k;j in Figure 2.1), and among output nodes and the global memory (target output and real output, labeled tk and ok in Figure 2.1). Broadcasts have a source and many destinations. In ACME the hidden nodes broadcast their output signals (Vj ) to every output node, and the global memory broadcasts the input data (from each si input node) to every hidden node. Hidden and output nodes are designed to t into one Xilinx XC4010 FPGA, and signals that enter and exit each FPGA have to do so via an Input-Output Block (IOB for short). We call these signals external signals. Particular to the Xilinx architecture and placement software (PPR) (see Chapter 1 for an explanation of PPR), external signals can be: a) pre-assigned to particular IOBs (we call these forced signals), b) pre-assigned to one of a group of IOBs (we call these group-forced signals), or c) left unconstrained (unconstrained signals). From the perspective of a hidden or an output node, external signals that connect to the X0 controller and to the private memories are forced, while the rest of the external signals are group-forced. During the placement and routing phases of PPR it is bene cial to let PPR allocate external signals in the location PPR deems more reasonable, that is, it would be ideal to let external signals be unconstrained. Several factors prevent us from achieving such a goal. In the case of signals that connect a functional unit to the X0 controller or a functional unit to its private memory, the reason is simple: there exists only one physical trace joining a particular pin in each FPGA to a pin in X0, and a particular pin in each FPGA to a pin in its private memory. The rest of the external signals in any functional unit are to be routed via an interconnection network, and will be group-forced for the following three reasons: 1) we would like to run the placement and routing program PPR as few times as possible, 2) there exists no incremental version of PPR at the moment of writing this thesis. (See Chapter 1 for an explanation of \incremental").

15 3) there exists only a limited number of resources (IOBs) to choose from for connecting external nets. Because there exist several di erent potential paths through the interconnection network these signals can be group-forced (more than one pin in a functional unit may be able to provide the required connection). These external signals can not be left unconstrained because there are some pins in each functional unit (about half the pins) that have no connection to any other device in the board, and some of the usable pins in a functional unit connect to the wrong part of the interconnection network. Usable pins are those that connect an FPGA to the interconnection network. Because PPR doesn't have the incremental place and route feature (at the time the writing of this thesis), the assignment of a signal to an IOB needs to be done correctly the rst time through. The latter requirement in conjunction with the fact that we use a tailored interconnection network, and that we wish to run PPR as few times as possible prevent us from leaving external signals unconstrained. These reasons together with point (3) will become more apparent during the development of this chapter. Figure 2.2 shows a block diagram of the ACME hardware and the inter-processor connection types. Forced nets are shown in solid lines. Group forced nets are shown in dotted lines to and from the global memory and in dashed lines among functional units. ..

.. PE1

SPARC station

X0

controller host

P m e m 1

F1

PE2 P m e m 2

PEn

...

F2

P m e m n

Fn

... interconnection network global memory

Figure 2.2: High-level view of the ACME architecture showing a system with n neurons. Forced external signals (to X0 and to the private memories) are shown in solid lines, group-forced external signals among functional units and the global memory in dotted lines, and group-forced external signals among functional units in dashed lines.

16

2.1.2 Motivating a Flexible Recon gurable Interconnection Network The number of interconnections among di erent functional units and the function performed by the individual functional units di er from system to system. The dual personality of the functional units (output node or hidden node) a ects the topology of the interconnections between the nodes. If we were to x the interconnections between functional units to avoid the use of a routing network, the fact that the functional units can change personality would make it very hard to re-use the xed interconnections from system to system. Because the purpose of ACME is to accommodate di erent systems (at di erent times) using the same hardware, we need to implement a exible, recon gurable interconnection scheme. In particular, and for reasons of performance, our interconnect scheme has to provide the one-to-one (bipartite) and broadcast connections required by each system introducing as little delay as possible. In summary, the need to implement a exible, recon gurable interconnection network to link the di erent nodes in our system stems from two facts: 1) the need to provide di erent types of communication links according to the size of the system, and 2) the wish to introduce the minimum constraints on the assignment of external signals to input-output blocks during the placement and routing of the FPGAs. In the following section we discuss architectures that are built or have recently been implemented using Field Programmable Gate Arrays (FPGAs) and Field Programmable Interconnection Chips (FPICs).

2.2 Interconnection Architectures There exist several implementations of emulators and prototyping systems built with FPGAs and FPICs as their basic unit. Some of the emulators are speci c to their applications, while others expect to be general purpose. Several FPGAs have to be used because

17 a single FPGA is limited in logic capacity. The number of FPGAs in di erent boards di er from a few ([5, 10]) to 10s ([11, 12, 13, 14]). The ideal situation, of course, is a single-chip implementation of the entire emulator. This is not only prohibitive in cost, but is simply not attainable with the current technology and pin-count. E orts to produce such a chip inexpensively have fueled the current research on \multi-chip modules" [15]. Since one single chip to hold all the logic is infeasible, the boards being built at this time try to implement the function of one big chip with many smaller ones. The fact that logic functions are spread into di erent chips requires to interconnect those chips in some manner and raises the interconnection issues we study in this chapter. Di erent interconnection schemes are proposed and have been studied, some with more theoretical foundation than others. The most applicable interconnection schemes to our ACME emulator are mesh structures and crossbar switches. Other variations have been implemented, and will be highlighted. Table 2.1 shows a list of some of the existing emulator and prototyping architectures. Board name Splash I Splash II Realizer PeRLe0,PeRLe1 Virtual Computer Virtual Wires Aptix AXB-AP4 Anyboard BORG ACME X-12

Chips used XC3090 XC4010, TISN74ACT8841 XC3090,XC2018 XC3020/XC3090 XC4010, IQ160 XC4005 XC4xxx, AX1024 XC3xxx XC3xxx XC4010, XC3195 XC3xxx

Interconnect among chips linear array linear array + crossbar hierarchy of partial crossbars 4-way mesh + torus 4-way mesh 4-way mesh 4 full-crossbars interconnected linear array + bus Clos Clos-like bus

Table 2.1: Partial list of emulators and prototyping boards describing interconnection architecture and chips supported. Some of the boards listed in the table use what we call pure interconnect structures, for example PeRLe0 [11], PeRLe1 [11], and Virtual Wires [16] all use a mesh architecture to implement the interconnection among FPGAs while the X-12 [17] of National Technologies Inc. uses a bus. The Anyboard [10] and Splash 1 [13] use a linear array. BORG [5] and ACME consist of user FPGAs connected together via a Clos-like network, also implemented with FPGAs. Realizer [18] consists of several boards, each board consists of \user" FPGAs

18 interconnected by \partial crossbars." Realizer is able to expand by allowing several boards to be connected with more \partial crossbars," e ectively attaining an interconnection network of more stages (or as is called in [18], a \hierarchy of partial crossbars"). Realizer also uses FPGAs to implement the interconnection network. Splash II [14] provides a structure that is somewhat more exible than the linear array (or neighbor-to-neighbor) interconnection structure by introducing a partial crossbar implemented with Texas Instruments crossbar switches. The Aptix AXB-AP4 [19] board allows sets of up to four FPGA chips to be interconnected via a single FPIC. There are four groups of these crossbars connected to a middle FPIC. Virtual Computer [12] consists of up to four \virtual pipelines" that can be connected together. The overall structure of the Virtual Computer is a mesh, where user FPGAs and FPICs are used to attain the routing.

...

... logic + ...

FPGA logic

FPGA logic

FPGA logic

FPGA

...

... ... ... ... FPGA logic

(d)

FPGA or FPIC routing

... ... ...

...

FPGA logic

...

...

(b)

...

routing

(a)

m FPGA logic

...

... logic +

...

FPGA or FPIC routing r

routing

FPGA logic + routing

...

FPGA logic + routing

...

(c)

FPGA logic FPGA logic FPGA logic FPGA FPGA FPGA logic logic logic

...

FPGA logic

...

FPGA logic

...

FPGA

FPGA logic

...

...

Figure 2.3 shows the basic structure of the di erent connection schemes presented in Table 2.1. The dots in Figure 2.3 indicate the direction of expansion of the system. In particular, it can be seen that a bus system is the architecture that allows the easiest expansion. Parameters m and r in Figure 2.3(d) stand for the number of \routing" and \user" units present in the system. Figure 2.3(e) shows one of the most dicult systems to expand in terms of the number of connections required.

(e)

Figure 2.3: Some representative interconnection schemes: (a) shared bus, (b) linear array, (c) 4-way mesh (or simply mesh), (d) Clos-like, (e) crossbar.

19 Following is a brief description of the advantages and disadvantages posed by using each one of the di erent schemes we depict in Figure 2.3 to implement an interconnection network.

2.2.1 A Bus Can be Simple to Expand A bus-oriented architecture (Figure 2.3 (a)) facilitates linear growth of a system and may be the easiest to expand. After all, we only need to extend the physical lines that form the bus and plug in the new components onto the bus to attain a larger system. At rst sight this seems a very good solution but there are many problems, some of which are stated below. a) electrical constraints are created because of the length of the bus and the number of elements placed on the bus, b) signals need to time-share the bus, and c) all signals coming in/out of an FPGA need to be forced to particular inputoutput blocks. Point (a) above is perhaps one of the most problematic. As the bus is expanded in length, the distance from the endpoints may get very large (for example, if ACME was implemented as a bus structure, the bus might extend twice the length of the board{minus a few inches, which is well above 30 cm.). Such length poses serious problems in terms of signal degradation and introduces serious re ection problems. A long wire needs to be terminated in some manner, adding complexity to the design of a board [20]. The load on each line may be large too, depending on the number and type of components that are placed on the bus. The major problem is that driving components need to have enough current capability to be able to drive the bus (and for matters of performance, we all want the driving to be fast!). But the elements that are placed on the bus add both capacitances and leakage currents{and this is also the case even with elements that are tristated. Both capacitances and leakages impair the ability of a driver to be fast and to maintain a valid logic level. A key point to note is that there should be only one driver on a speci c wire

20 of a bus at any time. This means that all the elements on the bus must be synchronized carefully to prevent contention (point (b) above). Synchronizing the elements may not be an easy task in a system with many elements, and at the least, it may reduce the speed at which results could be generated. At times some contention may be inevitable. In those cases the bus will generate much unwanted and possibly dangerous noise that could a ect the system adversely. To solve the driving and re ection problems we can include bu ers to re-power the signals. The major problem is where to put the bu ers. This is not an easy question to answer. There may be many di erent places to put the bu ers. One choice might be to simply partition the bus at certain places with bu ers. Another approach might be to create a tree of bu ers. Bu ers need to be bi-directional if the board is going to be recon gured to allow for new designs. Who will drive the direction pin of the bu ers? To which input-output block will signal x from FPGAi be routed such that signal y from FPGAi+1 can also use the same bu er? These questions pose the problem of forcing external signals to particular input-output blocks in all FPGAs, which brings up problem (c) above. Point (c) is a problem if the FPGAs are to be re-used in di erent applications. We would like to let the external signals be unconstrained (or at most be group forced) so that the chances of generating a routed design are enhanced, but adding bu ers necessarily creates major constraints. A bus, though, is a good vehicle for broadcasting signals. In fact, in ACME we use a small 6-bit bus to broadcast signals from X0 that are required by each one of the PEs to function correctly and as a unit. This means that, in ACME, those external signals that are on the bus are forced to particular input-output blocks of an FPGA.

2.2.2 Mesh Structures Provide an Elegant Expansion Mesh-based architectures (Figure 2.3(c)) are perhaps, after the bus architecture, the easiest to expand. Use of a mesh for interconnection is very common, as can be seen in Table 2.1. The reason may lie in the mesh's simplicity and relative exibility. Many designs can be mapped easily to mesh architectures, especially those that are systolic in nature.

21 To expand the system one need simply add chips at the periphery of the mesh, assuming that there are still free connections in the FPGAs at the periphery (the number of free connections might decrease if there is some element, like a memory or a controller, attached to the periphery). No major electrical problems-like those of a bus-are encountered in the designs because wires between FPGAs only travel among neighbors and should be very short assuming a good lay-out of the FPGAs. The complications (as relates to wire lengths) arise if the mesh is enhanced to become a torus, because now the FPGAs at the north and south (or east and west) become neighbors, and the distance among them might be large. A mesh architecture has the obvious advantage of introducing no more delay than might be introduced by a bus when an external signal has its source/destination pair in neighboring FPGAs. The disadvantages of this architecture appear when source-destination pairs are not in neighboring FPGAs. When the latter is the case, and considering a 4-way mesh, a net p might have to feed-through up to n FPGAs (let n be the number of FPGAs present in the board) to get to its destination. The delays of a mesh architecture may thus become highly unpredictable. A number of solutions are proposed to alleviate this problem by introducing connections to more neighbors (modifying a 4-way mesh to an 8-way mesh, for example) or by providing (like PeRLe0 and PeRLe1 [11]) wrap-around connections to those FPGAs that populate the periphery of the board, which results in a torus. For some designs, FPGAs in the middle of the mesh might need to have access to a global memory. This poses a serious problem, since the FPGAs in the middle of the mesh have no connections to the periphery other than through other FPGAs. The people at DEC [11] solved this problem by introducing a bus to broadcast the memory data, bypassing the underlying mesh architecture. But the main problem posed by the mesh architecture is that both logic and external signal routing may have to share the same chip (the user FPGA). The fact that some external signals might need to use intermediate \logic holding" chips to route from source to destination means that those intermediate chips lose part of their internal routing resources to signals that do not belong to them [18]. FPGAs, to the best of our knowledge, are not

22 designed to support external signals and local logic; while they can certainly accommodate both in many cases, the ratio of internal logic blocks or cells to routing resources does not take into account the external signals, which complicates the routing of the user's logic and could even impair the routing of a design. The interconnection structures we have just discussed (bus and mesh) can be built without any extra hardware dedicated for routing (other than perhaps re-powering bu ers in the case of a bus, and, of course, the wires to connect the pins in the user FPGAs). Only \user" FPGAs are present in these boards, and signals (in the case of a mesh structure) may need to be routed via intermediate \user" FPGAs to get from source to destination. The next structure we present, the crossbar, requires chips other than the user FPGAs; we call those extra chips routing chips. Crossbars may be implemented with commercially available crossbar switches, eld-programmable interconnection devices, or FPGAs. A brief explanation about signal transfers is timely before discussing crossbars and larger routing networks built from them, such as the Clos network.

2.2.3 Direction Sensing: Routing Bi-directional Nets An important factor to consider in analyzing a routing strategy is that external signals can be uni-directional or bi-directional. If we use a crossbar for interconnecting user FPGAs, the chip used to implement the crossbar (commercial crossbar, eld programmable interconnection device, or FPGA) has to be able to accommodate uni-directional and bidirectional signals. FPGAs allow their pins to be con gured to be either inputs, outputs or bi-directional, but in the case of a signal requiring bi-directionality, a second direction signal is required to change the direction of the input-output blocks \on the y." Field Programmable Interconnection Chips (FPICs) have an advantage over FPGAs due to their directionsensing capabilities. No extra control signals are required to change the direction of a pin.

23 In mesh type interconnect structures, bi-directional signals are routed together with a direction that allows for the \on the y" direction change. Other techniques exist for avoiding the use of a second signal, but their implementation is cumbersome [21]. The situation is further complicated in the case of a bi-directional bus that has a large fanout (by de nition). Assuming that all the signals are routed via the same intermediate \user" FPGA, the bus signals need just one direction control signal, but if the bus gets fragmented and routed via di erent chips, then one \direction" signal needs to be sent each one of the di erent routing chips. Figure 2.4 shows the requirements for bi-directional, digital signal communication according to two types of interconnect devices: FPGAs and eldprogrammable interconnection chips (FPICs). FPGA as interconnect direction I/O Block

FPIC as interconnect

data I/O

data I/O

(a)

dir sensing I/O block

dir sensing I/O block

I/O Block data I/O

data I/O

(b)

Figure 2.4: Bi-directional signals need a second \direction" signal if FPGAs are used as the interconnect medium. (a) FPGA used as interconnect device (b) FPIC used as interconnect device.

2.2.4 A Crossbar Provides Relatively Predictable Delay, But is Dicult to Expand The idea with crossbars is to connect all processing elements (\user" FPGAs) to each other via one single, specialized chip: a routing chip. So, for example, in Figure 2.3 (e) the grid pattern would be contained inside one single chip. The advantages of using a crossbar are:

24 a) it adds a predictable delay1 to the external signals that have source/destination pairs in di erent FPGAs connected to the crossbar, b) it is able to support both broadcasts and one-to-one connections, and c) it allows for external signals to be unconstrained during the placement and routing of the logic into the \user" FPGAs. Architectures that employ a crossbar to ful ll their interconnection requirements need just one level of interconnection devices between the source and destination pairs of external signals. A full crossbar is one that can be built with one chip, and it requires that all external signals be routed via that one, powerful chip. We call this a one-layer interconnect architecture, since there is just one \hop" required for an external signal to get from its source to its destination. The Aptix AXB-AP4 uses the concept of a full crossbar by letting four FPGAs be interconnected among each other via an APTIX 1024 pin FPIC. In this board, all input/output pins that belong to the four FPGAs are connected to the input-output pins of the FPIC that implements the crossbar. There are 4 groups of four FPGAs plus FPIC. The FPICs are connected among each other with the input/output pins that are left free, as can be seen in Figure 2.5. An architecture based on crossbars has two main problems, mainly a) the architecture is hard to expand, and b) external signals are delayed through the routing chip. Expandability of crossbars has been studied under telephone switching systems. The problem then was the number of \crosspoints" to build a crossbar. The elements connected to the crossbar can be, for example, telephone units, and the crosspoints connections that can be open and closed accordingly to create a path from the source (the calling party) to Predictable delay is only true for real crossbars (about 30ns. for ICUBE-160). For the Aptix FPIC and FPGAs one can only give a worst-case estimate. We measured a worst-case of about 25 ns. for a Xilinx XC3195 FPGA. 1

25 FPGA

group 1

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPIC

FPIC

FPGA

FPGA

FPIC

FPIC

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Figure 2.5: Block diagram of the Aptix AXB-AP4. The FPGAs can be Xilinx XC4000 series, and the FPICs are AX1024. Inside the dotted lines is group 1 of four. the destination (the receiving party). A measure of complexity is the number of crosspoints, since each crosspoint requires some hardware to be implemented. If we are to expand a system, and taking into account the number of crosspoints, the system expands as number of callers times the number of receiving parties, which ends up being the number of crosspoints. In our case, because we consider our routing FPGA or FPIC to be able to perform the function of a full crossbar, we are concerned with the number of input/output pins present in the routing chip, rather than on the number of \crosspoints." Our board can scale only according to the number of input/output pins present on the routing chip. Let pr denote the number of pins available in the routing chip, pi be the number of pins connected to the routing chip that belong to FPGA i, and n be the number of \user" FPGAs. The system can P scale as long as ni=1 pi  pr . The number of input/output pins present in the routing chip directly dictates the number of \user" FPGAs the system may accommodate. To obtain the largest possible system the size of the routing chip has to be increased (in terms of the number of input/output pins). The latter poses problems in both, expense and convenience. The larger the routing chip, the more expensive it is, and the more input/output pins it has, the more traces (or in the case of ACME, wire-wrap) that have to be routed to the chip. While we desire chips with more input/output pins, the undesirable side-e ect is that of an increased density of traces in the proximity of the routing chip.

26 External signals that are routed via a crossbar su er a delay. The delay is particular to the technology and the interconnect device in use. A solution to the delay problems is to insert latches at the source of the external signals to be routed through the crossbar, as suggested in [22]. Applying this method may not always be straightforward, since the original design has to be altered to allow for the latch insertion, and there might be timing problems to consider, especially if some of the external nets belong to combinational circuits and others to sequential circuits. The cost and expandability issues can be attacked by replacing the lone routing chip by several, possibly smaller ones (in terms of number of input/output pins), to give the same number of input/output pins, perhaps more, than that of the original routing chip. The problem with this strategy is that the interconnect structure no longer has the properties of a \full" crossbar. For example, broadcasting is now not guaranteed to work if it is done only inside the crossbars. Another idea to enhance expandability is to add more layers of interconnection devices, e ectively increasing the number of hops an external signal has to go through to get from its source to the destination. This is the topic of Section 2.3. Following is a brief description of the di erent ways that can be used to transfer information across a routing media.

2.2.5 Passing the Baton The information from chip to chip can be transferred in at least three di erent ways, regardless of the interconnection architecture: a) frequency coding b) time multiplexing (using more than one signal per pin) c) simple, direct connections (using one signal per pin) Frequency coding can be accomplished in digital systems by using the method called \pulse-code" modulation. In this method, a high frequency signal is represented by many changes in the logic level of the signal, while a low frequency might be represented by fewer

27 changes in the logic level (always referring to time). But this technique does not really help us for ACME, since the neurons are designed to process data in serial fashion. Time multiplexing is used as a means to reduce the number of pins required to implement a system and is used by \Virtual Wires" of MIT [16]. A specialized program has to insert extraneous logic at the edge of the \user de ned logic" to enable the multiplexing and de-multiplexing of signals (a chain of shift registers). The disadvantages of this implementation are that debugging with an oscilloscope is rendered almost impossible, and that the maximum frequency of operation is slower than if direct connections were used, since external nets are time-multiplexed. Direct connections paths allow for easy monitoring of nets and introduce the least number of modi cations to the original design, with the draw-back that many pins are required to implement a design. Direct connections allow a system to perform at its maximum frequency of operation with no additional delay other than that of the interconnect media. We decided to implement what we considered to be the easiest way to transfer the data by using direct connections.

2.3 Theory of Clos Interconnection Networks Clos and Benes [23], [1] networks are a result of research conducted in the eld of telephone switching. In telephone switching there is the notion of a connecting system. A connecting system is de ned in [1] to consist of a set of terminals (inputs and outputs), a connecting network to provide connections among the terminals, and a control unit to accomplish a physical path for the connection. Internally, the connecting network is composed of switches and interconnections among the switches. Switches may or may not have the ability to broadcast a signal. Figure 2.6(a) shows a high-level view of a connecting system. Figures 2.6(b) and 2.6(c) show a 2-by-2 switch implemented with a crossbar. Figure 2.6(b) shows a switch providing two point-to-point connections, while Figure 2.6(c) shows a switch providing a broadcast.

28 input terminals

..

connecting network

..

output terminals

control unit

(a)

(b)

(c)

Figure 2.6: (a) structure of a connecting system as described in [1], (b) 2-by2 crossbar switch providing two point-to-point connections, (c) 2-by-2 crossbar switch providing a broadcast connection. Because in telephone systems calls arrive at any time and require new connections to be made, the control unit has to change the state of the connecting network to accommodate the new calls. A valid combination of open and closed switches in a connecting network is called the state of the connecting network [24]. We can call the connecting network in a telephone system dynamic because as time passes the state of the network changes. As new calls are processed, new connections are made and old connections are removed or re-routed, creating a new network state. In the case of ACME, the connecting network is static, because ACME only requires a network that need not change its topology once a system is de ned. ACME behaves as if all calls were known before hand, so there is no need for a control unit. The control unit is conceptually moved to a software stage. An assignment for a given connecting network describes a set of input/output pairs that are connected at a given time. In ACME, the assignments of inputs to outputs happen only once: during the system con guration stage. An assignment is realizable if there exist disjoint paths in the network connecting all input/output pairs. A network is rearrangeable if any of all the possible assignments is realizable [25]. A connecting network is said to be non-blocking if any request for a connection can be satis ed without the need to re-arrange the network. We will use the terms connecting network and interconnection network interchangeably. The most direct way to implement the connecting network is by the use of a crossbar, but as we have discussed in Section 2.2.4, this is an expensive solution. Charles Clos [23] describes a method for reducing the number of cross-points required to implement an interconnection network. The idea is to replace a one-stage connecting network with several

29 stages, to e ectively reduce the number of cross-points. It is important to note that the original architecture of Clos assumes that all connections are point-to-point, that is, there is only one output terminal assigned to each input terminal. No broadcasts or multi-terminal nets are taken into consideration. Figure 2.7(a) shows a one-stage crossbar and Figure 2.7(b) shows a three-stage symmetrical Clos network that can be used to replace the original, onestage crossbar of Figure 2.7(a). Both the crossbar of Figure 2.7(a) and the three-stage Clos network of Figure 2.7(b) allow for every possible permutation of input/output pairs to occur. The architecture of Figure 2.7(a) supports broadcasts, while the three-stage symmetrical Clos network does not guarantee that broadcasts can be accommodated. The square boxes represent full crossbars, a concept that we emphasize by the use of the crosses. A symmetrical Clos network has three parameters n, m, and r that can completely describe its architecture. In a symmetric Clos network the two outer stages are identical, and consist of r (n  m) switches. The middle stage consists of m (r  r) switches. The network thus has N = r  n input and output terminals. The Clos symmetrical network can be characterized collectively by C (n; m; r) . Figure 2.7(b) is then a C (3; 3; 4) Clos network [25]. Clos networks can be expanded by replacing a switch or number of switches by another Clos network. The switches in a three-stage Clos network can be replaced by non-blocking networks to preserve a non-blocking network, or by rearrangeable networks to preserve a rearrangeable network [23, 26]. The original study of Clos networks assumed that no broadcasts take place, that is, all connections are point-to-point and none of the switches are allowed the capability to provide broadcasts. Several studies have been performed to nd upper and lower bounds on the number of switches that are required if broadcasts are allowed. Bellow we present a summary of known results. A three-stage Clos network is rearrangeable under the following conditions: 1) m  n; broadcasts are not guaranteed [1]. 2) m  nr; broadcasts are guaranteed at the expense of more switches [27]. The Slepian-Duguid theorem states that a Clos network C (n; m; r) is rearrangeable if

30 m =3 n=3

1 2 3

1 2 3

r=4 1 2 3 4

n = 12

1 2 3

1 2 3

n=3

1 2 3 4

n = 12

N = 12

N = 12 (a)

(b)

Figure 2.7: (a) shows a single crossbar with 12 inputs, 12 outputs, and 144 cross points. (b) shows a three-stage symmetrical Clos network with 12 inputs, 12 outputs, and 120 cross-points. The advantage of (b) over (a) is that the crossbars are smaller (perhaps more cost-e ective) and the number of cross-points is less. and only if m  n (condition (1) above) [1]. For our purposes we can assume, as in [25] that m = n, therefore all rst and last stage switches are square. Condition (2) above indicates a trade-o between the amount of hardware resources and the ability to accommodate broadcasts. In [27] it is pointed out that only the rst and last stage switches need to have fanout (broadcast) capability for condition (2) to hold. It is not necessary for middle stage switches to have fanout capability. A three-stage Clos network can be non-blocking in two di erent ways: a) in the wide sense, or b) in the strict sense. For a network to be wide-sense nonblocking means that there exists a control algorithm to prevent blocking. A controller has to assign the intermediate switches to the input/output terminal pairs in a manner that allows for more input/output terminal pairs (or multi-terminal nets) to be connected later. Strictly non-blocking means that there is enough hardware to support any connections with no control algorithm. This means that we can pick any available intermediate switch to accomplish a connection at any given time. There is a trade-o between the amount of hardware and the complexity of the control algorithm: the less hardware there is, the more complex the control algorithm [27]. Wide-sense (control algorithm dependent) non-blocking Clos networks can be constructed as long as:

31 1) m  b(3n=2)c; broadcasts are not guaranteed [1]. 2) m > (n ? 1)(log r + 2); broadcasts are guaranteed to be accommodated [27]. The control algorithm for condition (1) above is simple: \`do not use a fresh middle switch unless you have to!" [1]. Following this simple rule allows the network to be nonblocking in the wide-sense. In condition (2) above, Masson assumes that any switch has the capability of broadcasting (the broadcasting switch can be in any of rst, second, or third stages) [27]. Finally, non-blocking Clos networks in the strict sense can be attained as long as: 1) m  2n ? 1; broadcasts are not guaranteed [23]. 2) m  n  (r + 1) ? 1; broadcasts are guaranteed to be accommodated [27]. As can be seen from the results above, the networks that require the most hardware switches and interconnections arise if the network is to be strictly non-blocking. In the case of ACME it is sucient to have a rearrangeable network, since the interconnection requirement is static once it is established.

2.3.1 Mapping a Clos (Three-Stage) Network to FPGAs In this section we show how three-stage Clos networks are used in ACME. As in [25], the rst and last stages of the Clos network are implemented in part of the functional units, and the middle stage switches are implemented with \routing" FPGAs. Both the functional units and routing FPGAs can support several switches each, and we call such switches conceptual. Later we shall illustrate how the routing FPGAs themselves can be viewed as a large switch that can accommodate several of the conceptual switches. We shall use the term \switch" when it is not necessary to distinguish among \conceptual" and \large." We take advantage of the reprogrammability of the FPGAs to rearrange the external signals among di erent Input/Output Blocks (IOBs) to perform the necessary signal-to-IOB assignment according to which middle-stage \large" switch is used by the external signal.

32 The rearrangeability of the network is accomplished by reprogramming the functional units at the outer stages of the interconnection network and the routing FPGAs at the middle stages of the interconnection network. The physical connections among the functional units and routing FPGAs (just as in the de nition of a Clos network) do not change. Because, as we have already mentioned, ACME requires no more than a static interconnection network, the programming of the switches of the network is only done once per system. To allow for maximum exibility when rearranging the network we decide to mimic a Clos network as it relates to assigning physical wires to pins on the \conceptual" switches. Following such reasoning we decide to use contiguous IOBs in each user FPGA to form each of the \conceptual" switches. We considered two di erent ways to physically connect the functional units to the routing FPGAs: a) in an orderly (or \ xed") pattern: each functional unit in the outer stage has the same physical input/output pin connected to the same routing FPGA, and b) in a \shifting" pattern: pin 1 of functional unit 1 connects to routing FPGA 1, pin 1 of functional unit 2 connects to routing FPGA 2, etc. Figure 2.8 illustrates patterns (a) and (b). Notice that m is the number of routing FPGAs in the middle stage, and it dictates the size of the \conceptual" switch labeled S1 in the outer stage FPGAs. The more routing chips there are, the bigger S1 becomes and the more dicult it becomes to reassign nets to the pads of the \conceptual" switch. Both Figure 2.8(a) and Figure 2.8(b) show \conceptual" switches in the routing FPGAs. There would be more conceptual switches in the routing FPGAs if we had more conceptual switches in each functional unit. Because of the size of the prototyping board, ACME is limited to fourteen neurons (each neuron is an XC4010 FPGA). We decide to use ve routing FPGAs to implement the middle stage of the Clos interconnect network. The decision is taken after carefully studying the interconnection requirements for the possible systems that can potentially be exercised with fourteen functional units (see the Sections 2.5.6 and 2.6). Each routing FPGA has

33 outer stage F1

outer stage

inner stage F1

1 2 3 4

S1

user logic

S1

R1 user logic

5

inner stage 1 2 3 4

R1

5

R2

R2

R3 F2

1 2 3 4

S1

user logic

R3 F2

R4

5

S1

user logic

1 2 3 4

R4

5

R5

(a)

R5

(b)

Figure 2.8: Two possible mappings of a Clos network. (a) same user FPGA pins connected to the same middle routing chip. (b) shifting pattern. The boxes inside the functional units and routing FPGAs denote conceptual switches that we treat as small crossbars. eleven physical connections to each functional unit, and those connections follow an \area" assignment, rather than a permutation (as is done to create the conceptual switches of the functional units). As an illustration, all the IOBs in routing FPGA 1 that are connected to functional unit 1 share contiguous IOB locations in routing FPGA 1 to decrease the wire lengths. The functional units are placed in the prototyping board around the routing FPGAs in a ring structure. This placement reduces the di erence in wire lengths. Figure 2.9(a) shows a conceptual view of ACME's functional unit and routing FPGA placement. Figure 2.9(b) shows the assignment of groups of IOBs in the routing FPGAs to the functional units. F1 F14 F2 R1

F3

legend F13 F12

R2 R5 R3 R4

F4 F5 F6

F7 F8 (a)

F1

Contiguous IOB locations for:

...

F2

F11 F3 F10 F9

..

R1

F2 F3 F4 11 wires

(b)

Figure 2.9: (a) Conceptual view of ACME's physical functional units (F) and routing (R) FPGA placement. (b) Areas assigned to R1 for its connections to functional units. There are eleven wires between functional units and routing FPGAs.

34 ACME is a recon gurable processor: its interconnection architecture does not quite t any of the standard network types we have described until ACME is con gured. ACME's interconnection network resembles a Clos network. A deeper analysis of the architecture reveals that this is not the case, mainly because before ACME is con gured there is no clear notion of input and output stage (see Figure 2.10). functional units F1

routing FPGAs

a1 logic

r1,1

a2

R1 r1,2

F2

b1 logic

b2 r2,1

F14

R2

c1

logic

r2,2

c2

Figure 2.10: A simple example showing ACME's interconnect architecture It is only after assigning a function to the functional units that we can categorize ACME's routing network into a known class. ACME is a special-purpose board that accommodates back-propagation connectionist classi ers. Only hidden and output nodes are implemented with FPGAs. We can divide hidden and output nodes in two subsets that can be respectively mapped to the \input" and an \output" stages of a Clos-like interconnect network. The hidden nodes do not need to communicate amongst each other (the same is true for the output nodes). We may thus map hidden nodes to the input and output nodes to the output stage of the Clos-like interconnect network. It is only after creating such two subsets that we can analyze the network and categorize it. Only after such observations are made can ACME's interconnect architecture be called a Clos-like connecting network; furthermore, the network is asymmetric unless the number of hidden and output nodes is equal. We call the network \Clos-like" because, although we can analyze the resulting \conceptual" switches in ACME as if they formed a Clos network (see Figure 2.11), subsets of such conceptual switches may be collapsed into one larger

35 switch (because of the reprogrammability of the functional units and of the exibility of the routing FPGAs). Figure 2.11 shows how Figure 2.10 might be re-interpreted so that we can treat the interconnection architecture as a Clos connecting network. Note that if F1, F2, and F14 required connections amongst each other the resulting model would not be a Clos architecture, because the middle switches would have three \input-output" sides, instead of two as can be seen in Figure 2.11. functional units

routing FPGAs

functional units

R1

F1

a1

r1,1

logic

a2

r1,2

c1

F14 logic

F2

b1 logic

R1

c2

r1,1

b2 r1,2

Input stage

middle stage

output stage

Figure 2.11: ACME shows an interconnect architecture of the Clos type once a system is de ned. An example of a system with two hidden and one output nodes. The hidden nodes can be considered to form part of the input stage, while the output nodes form part of the output stage.

2.3.2 Implementable Interconnection Architectures Several papers suggest solutions to the general problem of interconnection networks. We discarded those papers that re ect a di erent architecture from that of the Clos networks early on. From our perspective, many of the solutions presented in these papers are not practical, requiring either too many resources (especially in the rst and last stages) [28], too many interstage interconnections [29], or too many stages [29, 30]. Of all the network topologies examined only the basic Clos interconnection network is rendered implementable. While the interstage connection requirements for our implementation of the basic three-stage Clos interconnection network requires a number of wires on the order of nm (where n is the number of inputs and m is the number of routing FPGAs), the network is regular and provides all the connections that ACME requires.

36 The papers that are perhaps closest to ACME's implementation are those by Douglass [31], and by Yang and Masson [27]. In [27], Yang and Masson analyze nonblocking broadcast switching networks similar to the Clos network. In this paper it is assumed that all switches can broadcast. As in [27], we also allow the conceptual switches in the routing FPGAs and at the outer stages of the network to broadcast. The conceptual switches at the output stage of ACME's interconnection network do not need to broadcast their signals, though. If we transform our \conceptual" switches into the \large" switches, then Douglass's results can be applied to ACME. In [31], Douglass shows that the new network of large switches (built in terms of smaller switches) is also rearrangeable (since its underlying structure is that of a Clos network), yet the new network is more powerful than a Clos network, being routable in a shorter time. A major deviation of ACME's interconnection architecture from the standard Clos (or more powerful derivative) interconnection network is our insertion of elements (the global memory) directly to the middle-stage of the interconnection network. This e ectively breaks the formal de nition of a three-stage interconnection network, making the results of [27] or [31] dicult to apply. We are forced to do a very detailed analysis of the interconnection network according to the speci cs of the systems that ACME can emulate. The analysis of ACME's interconnection network is presented in Sections 2.4 and 2.5.

2.3.3 Existing Routing Architectures Quickturn's Realizer [18] is the only architecture that has an interconnection network that is close to what ACME requires (a three-stage architecture). One major problem is the mapping of the interconnection network onto the Realizer board. Quickturn uses the xed pattern construction to accomplish the connections between user FPGAs and the routing chips. It is because of this approach that we can not run just one Partition, Placement, and Routing (PPR) per hidden and one per output node, and because of the problems posed by this requirement (to be explained in following sections) we are unable to use the Realizer's interconnection scheme. The Aptix AXB-AP4 board [19] may have worked (no

37 detailed analysis was performed with the board). The problem is that we required space to place global and private memories and there is not enough space in the AXB board to place the memories. Other boards have mesh-oriented interconnection networks, and we saw no straight-forward way in which hidden and output nodes could be easily mapped to produce a workable system.

2.4 Connectivity in ACME 2.4.1 Types of Connections We established the requirement for a exible interconnection network to support our back-propagation neural engine (ACME) in Section 2.1. Furthermore, we already discussed that the best interconnection scheme is a three-stage Clos interconnection network. We assign the two outer stages to the functional units and implement the middle-stage with routing FPGAs. In this section, we derive a formula to calculate the number of pins required for the middle-stage of the interconnection network to accommodate all trac, regardless of the internal implementation of the middle-stage. We restrict the number of wires between the pins of the functional units (outer stages of the interconnection network) and the pins of the middle-stage to one. To calculate the number of pins required for the interconnection network we need to show the type and quantity of connections required for the possible implementable systems. We rst calculate a general result that can be used for any number of functional units and for an input data word din bits wide. Later we calculate the speci c pin-count required to implement a system with 14 functional units and a 24-bit wide input word. Throughout the calculations we assume that each hidden and each output node t in one functional unit. From now on, we refer to the middle-stage of ACME's interconnection network as the routing network for simplicity. It should be clear that the interconnection network encompasses both the functional units and the routing FPGAs; the routing network only refers to the routing FPGAs.

38 Because each hidden and each output node ts in one functional unit, we refer to Figure 2.12 to see the external signals generated by a hidden node (a) and by an output node (b). We use this gure to analyze the type of connections needed and the number of those connections. HIDDEN NODE j ,0  j < m OUTPUT NODE k,0  k < n

s1

V1

s2

k;1

wj;1 wj;2

sl

wj;l

 Vj



0 control unit

(a)



V2

1;j 2;j 3;j n;j

Vm

k;m

wk;1 wk;2 wk;m

 

 ok -1

tk

control unit

(b)

Figure 2.12: Block diagrams of (a) hidden and (b) output nodes showing the synaptic processors responsible for generating dot product multiplications and back-propagation weight updates (marked with '*'), the dot product summer, the  and  0 function generators and the control units. External signals si represent the outputs of the input nodes (input nodes not shown). Vj represents the output of hidden node j . k;j represent the weight updates from output nodes to hidden nodes. tk represents the target output, and ok the real output. There are a maximum of n output nodes, m hidden nodes, and l input nodes. wj;i are hidden weight unis that are multipied by the output of input nodes. wk;i are output weight units that are multiplied by the output of hidden nodes. Assume the FPGAs have been programmed as either hidden or output nodes. We collectively call these FPGAs as \neurons." We ignore how the interconnection network is implemented internally for the moment. We concentrate instead on the number of inputs and outputs into and out of the routing network. We only allow one connection from each pin of the neurons to the routing network. Figure 2.13 shows a conceptual view of a system, where hidden and output nodes are represented by the rectangle labeled functional units. There are two types of interconnections in any implementable system. The dashed lines in Figure 2.13 show the interconnections among the neurons and the global memory; we

39 call these memory connections, or mc for short. The solid lines in Figure 2.13 show the connections among the neurons, we call these inter-neuron connections, or inc for short. functional units

routing network

legend inter-neuron connetions memory connections

global memory

Figure 2.13: The ACME architecture. Hidden and output nodes are functional units. Dashed lines represent memory connections (mc) and solid lines represent inter-neuron connections (inc). Referring to Figure 2.12, we let subscript i range over all input nodes, j over all hidden nodes, and k over all output nodes (unless otherwise stated). The memory connections are then composed of the signals si , tk , and ok ; let din = si , and dout = ok . The inter-neuron connections, from the perspective of a hidden node (see Figure 2.12(a)), are composed of the signals labeled Vj and k;j . As shown in Figure 2.13, the memory connections (mc) and the inter-neuron connections (inc) are composed of two kinds of connections: 1) one-to-one connections (bipartite connections) 2) broadcast connections (multi-terminal signals) Referring to Figure 2.12, one-to-one connections appear both in the inter-neuron and the memory connections. The inter-neuron connections labeled k;j are one-to-one, as are the memory connections labeled tk and ok . Broadcasts (or one-to-many) connections also appear both in the inter-neuron and the memory connections. The hidden nodes broadcast the signal Vj to all the output nodes, and the global memory broadcasts the input data si to all hidden nodes. It is important to explain at this point that the global memory is divided into input memory and output memory. These memories are controlled by two address/control busses. Each bus originates in a di erent functional unit. The address/control busses are not

40 routed through the routing network, instead, another FPGA (R0) is used to route the address/control busses. The memory connections that are mentioned in this section always refer to data connections (input data, target output, and real output), the address and control busses are not included in the discussion. Figures 2.14 and 2.15 show the di erent connections required for two di erent systems. Figure 2.14 shows a system with thirteen hidden nodes and one output node. Figure 2.14(a) shows the connections to and from the global memory (mc), while Figure 2.14(b) shows the connections among hidden and output nodes (inc). Note that the inc connections for this system are all one-to-one. 24

input memory

24

24

target output

24 1

output memory

h0

.. ..

h1

h12 o0

1

(a)

p memo 0 p memo 1

..

h0 h1

p memo 12 h12

2

1

1

1

..

h1

..

o0 1

1

h0

h12

(b)

Figure 2.14: A system composed of 13 hidden nodes, 1 output node and a 24-bit input word. (a) Memory connections (mc). (b) Inter-neuron connections (inc). The circles in (a) and (b) stand for the hidden and output nodes. The hidden nodes in (b) are unfolded to show the connections to and from the output node more clearly. Figure 2.15 shows a system with twelve hidden and two output nodes. For ease of reference the connections to and from output node o1 are shown in a dashed line. Broadcast connections are now evident not only in the mc connections, but also in the inc connections, as shown in Figure 2.15(b). We use the interconnection requirements shown in Figures 2.14 and 2.15 as a basis to express the total number of pins consumed at the routing network as a function of the number of hidden nodes h, the number of output nodes o, and the data width of the global memory (encompassing din + dout + tk + ok ). The number of memory and inter-neuron connections (mc and inc) do not directly correspond to the number of pins consumed at the routing network. If broadcasts are not allowed to be performed in the routing network,

41 24

input memory

..

24

target output

output memory

24

24 1 1 1 1

h0

..

h1

h11 o0 o1

p memo 0 p memo 1

h0

..

h1

p memo 11

h11

2

1

.. 1

1

2

1 1

(a)

..

1 1

1 1

h1

1

2

o1

h0

1

o0

1

h11

(b)

Figure 2.15: A system composed of 12 hidden nodes, 2 output nodes and a 24bit input word. Connections to o1 are drawn using a dashed line for clarity. (a) Memory connections (mc), (b) inter-neuron connections (inc). The circles in (a) and (b) stand for the hidden and output nodes. The hidden nodes in (b) are unfolded to show the connections to and from the output node more clearly. then the number of pins equals twice the number of total connections. Because the routing network is allowed to perform broadcasts in ACME, the latter observation is no longer valid, and we have to carefully account for the number of pins consumed. We calculate the pin consumption at the routing network. Let pinc be the pins consumed by inter-neuron connections, pmc the pins consumed by memory connections, and tp the total pins consumed at the routing network. Table 2.2 lists the variable names and their description. Variable Description h hidden nodes o output nodes f functional units din input memory data word mc memory connections inc inter-neuron connections pinc pins consumed by inter-neuron connections pmc pins consumed by memory connections tp total pins consumed Table 2.2: Variables and their description.

2.4.2 Pins Required by Inter-neuron Connections (pinc) We stress that there are two types of external signals created: broadcasts (from hidden nodes to output nodes), and one-to-one connections (from output nodes to hidden nodes). There are two possible ways to deal with the broadcast connections:

42 1) produce the broadcast at the source, that is at the hidden node, or 2) produce the broadcast inside the routing network. In this section we show the pin requirements for Scheme (2). Table 2.3 shows a list of the pin requirements for inter-neuron connections for the routing network to implement a neural net system. Formula (2.1) is an expression for the number of pins required by inter-neuron connections. We add the entries under column labeled Pins in Table 2.3 to obtain Formula (2.1). Direction Signal Pins Description hiddens-to-routing network Vj ,  h + 1 assuming broadcasts in routing network routing network-to-outputs Vj ,  (h  o) + o hidden 1 sends threshold to outputs outputs-to-routing network k;j ho errors sent to hiddens routing network-to-hidden k;j ho hiddens receive errors Table 2.3: Pins required by inter-neuron connections.  is the threshold value broadcast from hidden node 1 to all output nodes.

pinc =

(3  h  o) + o + h + 1

(2:1)

2.4.3 Pins Required by Memory Connections (pmc) Assume that the system has a maximum of din-bit wide input words. There are two types of memory connections: one-to-one (output nodes read one bit and write one bit from/to the global memory), and broadcasts (the hidden nodes all get the same input data from the global memory). The broadcasts can happen in the routing network or at the source (the global memory). We let the routing network generate the broadcast of the din bits. Regarding the global memory, we only need consider the data connections, since the memory address and control busses are routed by the routing chip R0. Following is a list of the number of pins required by the routing network to implement a system. Formula (2.2) is an expression for the number of pins required by memory connections. We add the entries under column labeled Pins in Table 2.4 to obtain Formula (2.2).

43 Direction memory-to-routing network routing network-to-hiddens routing network-to-outputs outputs-to-routing network Table 2.4:

Signal Pins Description si , din + (2  o) inputs + target outputs tk , o k + real outputs si din  h each hidden node gets at most din input bits tk o target read by output node ok o output written to memory pins required by memory connections.

pmc = (din

 h) + (4  o) + din

(2:2)

2.4.4 Growth of Pins: Pin-Count is Bound by f 2 Assuming that broadcasts occur in the routing network, the total number of pins used by a system (tp) is: tp = pinc + pmc

tp

= (3  h  o) + (5  o) + ((din + 1)  h) + din + 1

(2:3)

We can express tp in terms of f (number of functional units), and din (width of the input data word). For any system, the number of hidden nodes has to be at least two, and there has to be at least one output node. The number of hidden and output nodes has to be less than or equal to the number of functional units. These three constraints give rise to:

f  h+o

(2:4)

f >o+1 2

(2:5)

f  h+1

(2:6)

We can replace o and h by df=2e in Formula (2.3) to get the following equation: tp

= 3  f4  f + 5 2 f + (din +21)  f + din + 1

(2:7)

44 The conditions established by Formulae (2.4), (2.5), and (2.6) are satis ed if we replace o and h by df=2e since: (2.4) f  f , (2.5) f > f=2 + 1  2, and (2.6) f  f=2 + 1. Formula 2.7 is bounded from above by f 2 , this means that the number of pins in the routing network grow quadratically, at the rate of f 2 . Figure 2.16 shows a plot of Equation (2.3) according to Formulae (2.4), (2.5), and (2.6). The input data word (din) is kept constant at a 24-bit width. We plot Equation (2.3) to show the dependence of the number of pins in the routing network to the number of hidden and output nodes.

5000

4000

3000

2000

1000

0 35

30

25

hidden nodes

20

15

10

5

35

30

25

10 15 20 output nodes

5

Figure 2.16: Plot of tp assuming a system with 40 functional units. h (number of hidden nodes) and o (number of output nodes) are subject to constraints (2) and (3). The input data width is xed at 24 bits. Vertical axis: tp.

2.4.5 Pin-count Required for a System of 14 Functional Units It is important to calculate an upper bound on the total possible number of pins required by a system. In this section we limit the number of functional units to 14 and the data input width to 24 to nd the worst possible consumption of pins. Figure 2.17 shows the pins required by the routing network versus the number of output nodes present in di erent 14 functional unit, 24-bit input word systems. For ease of reference Table 2.5 shows the number of pins required per system.

45 The number of hidden nodes is h = f ? o. We can see that the maximum pin usage occurs for a system with 10 hidden nodes and 4 output nodes. The total number of pins required for that system is 415. The routing network has to have at least 415 pins. The minimum (or maximum) number of output units that causes the smallest (largest) pin count is derived using Equation (2.7). tp) 0 = d(do = d(3  h  o + 5  o + hdo (din + 1) + din + 1)

let h = f ? o, f = 14, and din = 24. 2 0 = d(42  o ? 3  o + 5  odo+ 25  14 ? 25  o + 25)

0 = 42 ? 6  o + 5 ? 25

o = 22=6 = d3:66e = 4 h

o

ic

mc tc

2 12 87 120 207 3 11 114 140 254 4 10 135 160 295 5 9 150 180 330 6 8 159 200 359 7 7 162 220 382 8 6 159 240 399 9 5 150 260 410 10 4 135 280 415 11 3 114 300 414 12 2 87 320 407 13 1 54 340 394 Table 2.5: Table of pin-counts for ACME's routing network assuming all broadcasts are in the routing network. This study shows that there is no clear-cut way to analyze the worst case pin consumption, since neither extreme case (those systems composed of 12 outputs and 2 hidden nodes, or 1 output and 13 hidden nodes) produced the worst case consumption. More analysis is required as we study the architecture of the routing network in the coming sections.

46 450

"tc.data" 3 3 3 3 3 3 3 "ic.data" + 3 2 "mc.data" 3 2 3 2 2 3 2 2 3 2 2 3 2 2 + + + + 2 + + + 2 2 + +

400 350 300 Pin count 250 200 150 100 50

0

+

+ 2

+

4 6 8 Number of output nodes

10

12

Figure 2.17: Pin usage versus number of output nodes. The number of hidden nodes is always h = 14 ? o, and the input data width is always 24.tp = rhombus, pinc = crosses, pmc = squares.

2.5 The Middle Stage of The Clos-like Network is Implemented With Routing FPGAs We determined the total number of pins required for the the middle stage of the interconnection network in Section 2.4.1. In Section 2.4.1 we assumed that the routing network could generate all broadcasts and paid no particular attention to its implementation. Now we study in more detail how to implement the routing network. We have settled on a Clos-like three-stage interconnection network, and one of its disadvantages is that it is rearrangeable only for one-to-one interconnections. We let the middle stage of our interconnection network consist of one stage of switches, the switches being implemented by routing FPGAs. We assume that each routing FPGA has the ability to perform the functions of a full crossbar. Figure 2.18 shows the composition of our routing network. Let ri stand for routing FPGA i.

47 Functional units

r1

r2

...

rk

Outer stage of Clos-like interconnection network.

middle stage of Clos-like (routing network) interconnection network. legend

Global memory

inter-node connections memory connections

Figure 2.18: Routing chips (r1, r2 , ..., rk ) are used to implement the middle stage of our three-stage Clos-like routing network. In the next section we study how to accommodate the broadcast connections as required by ACME. We bound the maximum number of connections possible among a routing chip ri and any functional unit.

2.5.1 Number of Pins Needed by a Routing FPGA In Section 2.3.1 we discussed the use of FPGAs to accommodate the middle-stage of our routing network. We now use the notions of inter-node (inc) and memory (mc) connections (presented in Section 2.4.1) together with our understanding of the implementation of the middle-stage to bind the number of pins required by a generic routing FPGA. Below we reiterate the two di erent types of connections present in ACME and give a summary of the most important points. Memory connections (mc): (1) The data input (din) is to be broadcast to each one of the hidden nodes. Since we are not using a bus structure, broadcasts are done via the routing network. The alternative is fanout from the memory data pins. This can produce electrical problems, because the loading on each pin is large. We interleave the memory data bits such that they are evenly distributed among the routing chips. The interleaving e ectively reduces the load on a single routing chip; there is no single routing chip that needs to broadcast all the data bits. The three routing Schemes (a), (b), and (c) in Section 2.5.2 all assume this bit-wise interleaving. Notice in Figure 2.12 that each input data bit is multiplied by a weight. This factor is important when considering the di erent schemes to be analyzed.

48

Inter-node connections (inc): (1) Each hidden node broadcasts its output (Vj ) to all

of the output nodes. In the output nodes a di erent weight (wk;j ) multiplies each Vj coming from each one of the hidden nodes, and the same weight multiplies the signal ok to produce k;j going back to hidden node j , so there is a de nite dependency among hidden-to-output, wk;j , and output-to-hidden signals. This can be seen in Figure 2.12. Hidden node 1 also broadcasts a threshold () to all the output nodes. (2) output-to-hidden node connections (k;j ) are one-to-one, and require no special treatment. (3) output-to-global memory connections (ok , and tk ) are one-to-one, and require no special treatment.

2.5.2 Analysis of Three Di erent Routing Schemes We present three strategies to provide the best possible implementation of the middlestage of our interconnection network both according to number of placement and routing runs (PPR) (see Chapter 1 for explanation of PPR), and the ability to expand the system to incorporate additional functional units. We would like to run the placement and routing tool PPR as few times as possible, and the least number of PPR runs is two: one for a hidden node and one for an output node. So there are only two \master" nodes, one hidden and one output. We replicate the master nodes into functional units as required by the system. Once a master hidden node has been routed, all of its external signals are assigned to some set of input/output pads. The set of input/output pads for all hidden nodes is the same (all hidden nodes might have a signal coming out of pad P15, for example), and the same applies to the output nodes. Scheme (a) relies on the xed pattern construction described in Section 2.3.1. Schemes (b) and (c) rely on the shifting pattern construction to accommodate all the required signals. Figure 2.19 shows the two constructions and is a reprint of Figure 2.8. According to the shifting pattern construction, the external signals of a functional unit are connected to di erent routing chips. Shifting of the pins is in accordance with the

49 outer stage F1

S1

user logic

outer stage

inner stage F1

1 2 3 4

S1

R1 user logic

5

inner stage 1 2 3 4

R1

5

R2

R2

R3 F2

S1

user logic

R3 F2

1 2 3 4

R4

5

R5

user logic

S1

1 2 3 4

R4

5

(a)

R5

(b)

Figure 2.19: Two possible mappings of a Clos network. (a) same user FPGA pins connected to the same middle routing chip. (b) shifting pattern. The boxes inside the functional units and routing FPGAs denote conceptual switches that we treat as small crossbars. principle of operation of a Clos interconnection network, and we need it to alleviate the problems otherwise posed by schemes (a) and (b). The studies in this section only consider those signals to and from functional units and routing chips. Signals that connect the global memory to the routing chips are not taken into account. Table 2.6 de nes the variables to be used in the discussions to follow. Variable Description r routing chips din input word length o output nodes h hidden nodes pmcri pins consumed in routing chip i due to memory connections to hidden and output nodes pincri pins consumed in routing chip i due to inter-neuron connections. rhij pin count among ri and hj roik pin count among ri and ok pinc2 pin count due to inter-neuron connections according to Scheme (c) pmc2 pin count due to memory connections according to Scheme (c) tp2 total pin count according to scheme (c) Table 2.6: Variables and their description.

Scheme (a): Assume that all pin 1s of all functional units are connected to ri, all pin 2s are connected to ri+1 , etc. as in the Realizer board [18]. All hidden nodes transmit

50 their Vj signals via the same routing chip ri, thus hidden node hi has 1 connection to routing chip ri (we disregard the threshold that hidden node h1 produces for the output nodes). Routing chip ri broadcasts the Vj signal to every output node, so there are h connections from routing chip ri to output node ok . To accomplish the backpropagation of errors (k;j ), all output nodes send their error values k;j to the same routing chip, rx (the routing chip may be di erent from ri ). The number of connections needed from ok to rx is h, the number of hidden nodes. Every hidden node reads the error values from the same routing chip, rx , so the number of connections between rx and hj is o. If we restrict broadcasts to the routing network and interleave the memory data bits to reduce the load on each routing chip, the number of connections required between routing chip ri and hidden node hj is din=r. Output node ok may read its target output from ri . Figure 2.20 shows the inter-neuron connections between two hidden and two output nodes. Figure 2.20 assumes the xed-pin construction. Formulae (2.8) and (2.9) list the pins used by inter-neuron connections among hidden node j , output node k and routing chips ri and rx. routing chips

3

V1

1

4

r1

routing chips

h1 3 4

V2

r1

3 4

3 4

feed-back

1+ h

hj ok

2

∆ 1,1 ∆ 1,2

r2 1 2

(a)

hj h

r3

∆ 2,1 1 ∆ 2,2 2

o2

o rx

o1 r3

ri

1

h2

r2

din + 1 r

2

ok

feed-forward (b)

Figure 2.20: System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the xed-pin pattern interconnect. (a) feed-forward and feed-back connections. The routing chips are unfolded for clarity. No memory connections are shown. (b) Connections between routing chip ri , one hidden and one output node. Memory connections are accounted for: din=r between ri and hj , and 1 (target or real output) among ri to ok . Connections among hidden node hj and routing chip rx , and output node ok and routing chip rx .

51 pincri

= 1+h

(2:8)

pincrx

= o+h

(2:9)

Both pincri and pincrx depend directly on the number of hidden and output nodes present in the system. This con guration is not scalable, since we need a much larger routing chip as our system grows.

Scheme (b): As in Scheme (a) we assume that the data to the global memory is interleaved, but we apply the shifting pattern construction in an e ort to reduce the direct dependency of Formulae (2.9) and (2.8) on h and o. The shifting pattern construction poses problems when the number of hiddens, outputs, or inputs is not a multiple of r. Let a hidden node send Vj to only one routing FPGA, ri. In the case that the number of hidden nodes is not a multiple of r, and that the number of output nodes is greater than one, the output nodes have to use a set of xed pins to receive the hidden-tooutput residual signals. if they were to receive Vj via the set of pins that shift, then some of the output nodes would not be able to connect to the proper routing FPGA. This problem can be solved by performing another PPR run to get a new output node, with a di erent pin assignment for the Vj signal. We want to only perform one PPR for the output nodes. For example: let us assume that there are two hidden and two output nodes in our system, as in Figure 2.21. Let output node o1 receive its hidden-to-output (V1 and V2) signals via input/output pads P12 and P14 (connected respectively to r1 and r2, for the sake of argument). Because we replicate the same output node from fi to fi+1 , o2 will also have P12 and P14 con gured to expect the hidden-to-output signals V1 and V2. Because these input/output pads belong to the shifting pin set, they are no longer connected to the same routing chips. P14 in fi+1 now connects to r3, which can not supply a hidden-to-output connection. Let there be a subset of xed pins to solve the special case. Because each functional unit has dual identity (hidden or output node), all functional units have to have the

52 h1

h2

V1

P1

r1 V 2

P1

r2 P12

o1

P14

?

r3

P12

o2

P14

Figure 2.21: System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the shifting pattern construction. Routing chip r3 can not supply neither V1 nor V2 to output node o2 . subset of xed pins. Because the output weight units multiply the hidden-to-output and the output-to-hidden signals, outputs cannot go back to di erent routing chips. It is because output weights would be modifying the wrong signals. When there are a multiple of r hidden nodes, the hidden-to-output connections can be made via the pins that shift, with the rest (the residue of h mod r) going through pins that are xed. For those signals that use the set of pins that shift, we need to scramble the weights in the output nodes. The input data to hidden nodes exhibits the same behavior: there are din mod r data bits that need to be routed using xed pins, and again, the weights in the hidden nodes have to be updated accordingly. Because of the shifting pin construction there is 1 connection among hi and ri due to the Vj signals (we disregard the threshold that hidden node 1 provides to the output nodes). Routing chip ri broadcasts the Vj signal to the output nodes, and this creates h=r connections among output node ok and routing chip ri . During the back-propagation phase the connections on routing chip ri due to the error values from output node ok is h=r. The error values are one-to-one connections, so hidden node hi has o connections to routing chip ri . There are din=r connections among routing chip ri and the hidden neurons that provide the input patterns to the hidden neurons. We let output node ok receive its target output tk from routing chip ri . We provide a pin-count to show

53 the e ect of this scheme: pmcri pincri

= din r +1

= 1 + o + hr + hr

Figure 2.22 shows a system with three hidden nodes, two output nodes, and two routing chips. V 2 V1

1

r1

h1 V 2

2

∆ 1,1 ∆ 1,2

4 5

3

∆ 1,3

6

1

r2

o2 2

∆ 2,1 4 ∆ 2,2 5

3

∆ 2,3 6

1 1

h3

feed-forward

(a)

2 3 4

h1 din + 1+ o r

2

1

h2 V3

r1

o1

r2

3 4

h2

ri

1+ h + h r r

2 3 4

feed-back

hj ok

h3

(b)

Figure 2.22: (a) A system with 3 hidden nodes, 2 output nodes, and 2 routing chips assuming the shifting pattern construction of Scheme (b). Routing chips and hidden nodes are unfolded for clarity. Fixed-pins are used by output nodes to receive V3 , and produce 1;3 and 2;3. (b) Connections between routing chip ri , one hidden, and one output node. Memory connections are accounted for: din=r among ri and hj , and 1 (target or real output) among ri and ok . We have reduced the dependency of pincri on h by a factor of r. But pincri still depends on o, the number of outputs. We want to reduce that dependency even further, so we need Scheme (c):

Scheme (c): We shall solve the problems encountered in the previous two schemes. We eliminate the subset of xed pins and are able to limit the number of feed-back connections that a routing chip accommodates to or . We di erentiate among feedforward, feed-back, and memory to analyze the connections. Figure 2.23 shows the feed-forward and feed-back connections among two hidden and two output nodes. There are three routing chips in the system.

54 V1

h1 V2

h2

1

1 2 3

r1

h1

2 3

r1

1

1

2

2 3

r2

1 2

h2 ∆ 1,1 ∆ 1,2

3

r2

1 2

o1

o1 r3

1 2

∆ 2,1 ∆ 2,2

r3

1 2

o2

o2 (a)

(b)

Figure 2.23: System with 2 hidden nodes, 2 output nodes, and 3 routing chips assuming the shifting pattern interconnect. (a) Feed-forward connections. (b) Feed-back connections.

Feed-forward: Let hidden nodes broadcast their hidden-to-output signal Vj to every

routing chip. This is done by fanning the external signal (Vj ) r times inside each hidden node, and routing the external signals to r input-output pads. The fanout at the hidden nodes enables every routing FPGA to broadcast the hidden to output signal. Output nodes can select the routing chip that provides Vj . The fact that all hidden nodes broadcast Vj to di erent routing chips also solves the problem posed by having a system where the number of hidden nodes is di erent than a multiple of r, since now each output node can select from which routing FPGA to get any one of the hidden-to-output signals. The requirement for a subset of xed pins is eliminated. We let hidden node 1 broadcast its threshold value to all routing FPGAs, so that output nodes can receive such value from any routing FPGA. We use hidden node 1 as our \master" node to be replicated. As a result, all the hidden nodes have connections allocated to a threshold. These connections do not get used, except for the ones of hidden node 1. As a result of the broadcastings, the number of signals that routing chip ri receives from hidden node hj is 2: one signal is Vj and the other is the threshold value. Output node ok requires a Vj from each of the hidden nodes and can select the routing chip that provides Vj . The Vj signals are interleaved among the routing chips, therefore each output node may have at most hr signals from a routing chip ri. All output nodes require the threshold signal from hidden node 1. Output

55 node k selects a routing chip other than ri for its threshold signal.

Feed-back: The feed-back from outputs to hiddens (k;j ) can use the same routing

chips as the feed-forward (Vj ) signals because output nodes can select from what routing chip to get the Vj or h1 's threshold signals. The bene t of synchronizing the k;j and Vj signals is that the weights in the output nodes do not need to be scrambled. Each output node produces h of the feed-back signals (k;j ), one for each hidden node. The signals can be interleaved among the r routing chips, so there are hr signals from output node ok to routing chip ri. Because the feed-back is interleaved, each hidden node receives or signals from routing chip ri .

Memory connections: Finally, the memory connections are also interleaved. There

are din r memory connections from routing chip ri to hidden node hj . Output node ok has two connections: one from memory (the threshold tk ) and one to memory (the real output). We assign tk to routing chip ri. The real output is assigned to another routing chip. Because of the shifting pattern, if the width of the input memory is not a multiple of the number of routing chips, the residue of the modulus has to be broadcast from the source (the global memory) to all the routing chips. This is so because, due to the shifting construction, some hidden nodes might not have connections to the proper routing chip. There are r ? 1 data bits of the global memory that need be broadcast to all the routing chips. Hidden node weights have to be scrambled because of the shifting pattern construction. Section 2.7 explains how to accomplish the weight scrambling. Figure 2.24 shows the inter-neuron and memory connections between routing chip ri , hidden node hj , and output node ok . The following equations express the number of connections between routing FPGA i (ri), hidden node h, and output node o: pmcri pincri

= din r +1

= 2 + hr + hr + or

(2:10) (2:11)

Scheme (c) is our best solution. The factor o has been reduced by a factor of r.

56 o ( ) r ∆k

hj

h r din r

legend

ok

(Vj + θ)

hj ok ri

(∆ k ) h (V ) j r

2

1

hidden neuron j output neuron k routing chip i inter-neuron connections memory connections target output (or real output) weight update hidden output threshold

(tk ) tk ∆k

ri

Vj θ

Figure 2.24: Connections between routing FPGA ri , hidden node hj , and output node ok according to Scheme (c). We can re-write Formulae (2.10) and (2.11) to highlight the connections among one hidden, one output, and one routing chip. Let rhij stand for pins consumed in routing chip i due to connections among routing chip i and hidden node j , and roik for pins consumed in routing chip i due to connections among routing chip i and output node j . Figure 2.25 shows the number of connections between routing chip ri, hidden node hj , and output node ok . din=r + 2 + o=r

ri

hj ok

2h=r + 1

Figure 2.25: Connections among routing chip ri , hidden node hj , and output node ok The following equations express the number of connections between routing FPGA i (ri), hidden node h, and output node o: = din + 2r r + o 2h+r ro =

rhij

ik

r

(2:12) (2:13)

Note Formulae (2.12) and (2.13) do not include the pins consumed by connections from ri to the global memory.

57

2.5.3 Pins Required by Inter-Neuron Connections (pinc) According to Scheme (c) In this section we show the pin requirements to accomplish Scheme (c). In contrast with Section 2.4.2, in our present study we do not allow broadcasts in the routing network. Broadcasts happen at the source, the hidden nodes. Table 2.7 shows a list of the pin requirements for inter-neuron connections to implement a neural net system. Formula (2.14) is an expression for the pins required. We arrive at Formula (2.14) by adding the entries under column labeled Pins in Table 2.7. Direction Signal Pins Description hiddens-to-routing network Vj hr all hiddens broadcast at the source hiddens-to-routing network  r hidden node h1 broadcasts the threshold value routing network-to-outputs Vj h  o + o Some Vj and the threshold are used by the output nodes outputs-to-routing network k;j ho output nodes send back their error value routing network-to-hidden k;j ho (same as above) Table 2.7: Pins required by inter-neuron connections.

pinc2 =

3ho+o+r+hr

(2:14)

2.5.4 Pins Required by Memory Connections (pmc) According to Scheme (c) In this section we account for the number of connections mong the neurons and the global memory. We let din stand for the input data word. According to Scheme (c), the global memory broadcasts r ? 1 signals at the source. Following is the number of pins required by the routing network. The number of connections required di ers according to the number of hidden nodes and the number of data inputs. Because of Scheme (c) we need to account for the broadcasting of some data bits at the source of the global memory. If the number of inputs is not a

58 Direction memory-to-routing network memory-to-routing network routing network-to-memory routing network-to-hiddens routing network-to-outputs outputs-to-routing network

Signal Pins Description si x (see below) data input to hiddens tk o target output ok o real output si din  h each hidden node gets inputs tk o target outputs ok o real outputs

Table 2.8: Pins required for memory connections. multiple of r, we broadcast the residue of the modulus from the global memory using what we call special bits. The global memory needs to have r ? 1 special bits. Three di erent cases are generated according to the number of inputs and the number of hidden nodes. They are shown below. if (din mod r 6= 0)

8 > z }|1 { z }|2 { < din ? ( din mod r) + (din mod r)  h if h  r x=> }|3 { : din ? (din mod r) + (zdin mod r)  r otherwise

(2:15)

if din is a multiple of r then

x = din

(2:16)

In Formula (2.15) there are three terms labeled 1, 2, and 3 that are de ned as follows: bits that do not need to be broadcast at the source (global memory)

1: 2: bits that need to be broadcast at the source (global memory). 3: once we broadcast the rst din mod r data bits to the r routing chips, the routing chips can broadcast the data inputs that they already have to the extra hidden nodes (given that there are more than r hidden nodes). Formula (2.17) is an expression for the total number of pins required due to memory connections. We generate this formula by adding the entries under the column labeled Pins in Table 2.8. pmc2 =

x + 4  o + din  h

(2:17)

59

2.5.5 Growth of Pins According to Scheme (c) The pins required for our routing network now re ect the broadcasts at the source (global memory and hidden nodes). The expression to account for those pins is: tp2 = pinc2 + pmc2 tp2 =

pinc2

z

}|

pmc2

{ z

}|

{

3  h  o + o + r + h  r + x + 4  o + din  h

(2:18)

where x is de ned as in the previous section (Formulae (2.15) and (2.16)). We let din = 24, and h range from 2 to 14 to obtain our parameters for tp2. According to these parameters we plot tp2 in Figure 2.26. We observe that as the number of routing chips increases, so does the number of connections. This is due to the broadcasts at the hidden nodes and the global memory.

1200 1100 1000 900 800 Pin count 700 600 500 400 300 200

4 4

4

4

4

4

4

     2 2 2 2 2   2 2 + 3 + 3 + 3 + + + + 3 3 3 3 0

2

"tp2-4r.dat" "tp2-5r.dat" "tp2-10r.dat" "tp2-14r.dat" "tp2-32r.dat"

4  2 + 3

4 6 8 Number of output nodes

3 + 2  4

4

4  4 2  + 2 3 3 +  4 2 +  3 2 + 3 10

12

Figure 2.26: Pin consumption (tp2) versus number of output nodes (o). The number of hidden nodes is always 14 ? outputs, and the input data width is always 24. Rhombus: four routing chip; plus: ve routing chips; square: 10 routing chips; x: 14 routing chips; triangle: 32 routing chips.

60

2.5.6 Minimum Number of Routing FPGAs We perform three di erent analyses to calculate the minimum number of routing chips of 175 pins each2 required for a system with fourteen nodes. In our rst analysis we apply Formula (2.18) and assume that the number of hidden nodes is larger than the number of routing chips. In our second analysis we use Formulae (2.12) and (2.13) and maximize the number of output nodes. In our nal analysis we use the same formulae as in the previous case, but maximize the number of hidden nodes instead. The di erence among the rst and last two analyses is that the rst analysis tries to approximate the maximum number of connections by applying a global perspective (the equation used involves the total pins consumed at the routing network as a whole). The other two cases apply local information only (the number of pins consumed at a routing chip ri ) and \guess" the maximum load on each routing chip by maximizing the number of output nodes (second analysis) or the hidden nodes (third analysis). In all the analyses we vary the number of routing chips and compare the results obtained with a maximum bound. The three cases converge to the same result: the minimum number of routing chips for a system with 14 nodes applying Scheme (c) is four.

There are 175 user input/output pins in an XC3195 Xilinx FPGA. Our routing chips are Xilinx 3195 FPGAs. 2

61

Analysis 1: Formula (2.18) depends on x, and x is de ned according to Formulae (2.15) and (2.16). We will assume that the number of hidden nodes is greater than the number of routing chips, and that the input data is not a multiple of the number of routing chips. Given these assumptions, we let

x = din ? (din mod r) + (din mod r)  r according to Formula (2.15) to obtain

tp2 =

z

z

pinc2

}|

{

3 ho+ o+r + h r+ pmc2

din + (din mod

}|

{

r)  (r ? 1) + 4  o + din  h

(2.19)

according to Formula (2.18). We perform an analysis on Formula (2.19) to derive the number of output nodes that produce the maximum (minimum) number of connections in the routing network assuming that we apply Scheme (c). ) = d(3  h  o + 5  o + h  (din + r) + din + (din mod r)  (r ? 1)) 0 = d(tp2 do do Let h = f ? o, f = 14, and din = 24, then 2 + 14  r + (24 mod r)  (r ? 1)) 0 = d(23  o ? o  r ? 3  o + 360do

0 = 23 ? r ? 6  o

o=

 23 ? r  6

(2:20)

We use Formulae (2.12) and (2.13) to calculate the pins needed by routing chip ri to connect to hidden node hj and output node oj . We use Formula (2.20) to nd

62 the number of output nodes o that produce the maximum pin requirement at the routing network. We let h = 14 ? o, f = 14, and assume each routing chip has 175 input/output pins. We reprint the formulae for convenience.

  o = 23 6? r din + 2  r + o rhij = r 2h+r roik = r We assume our global memory to be 56 bit-wide. We distribute the bits among the routing chips so that each routing chip has dr=56e pins connected to the global memory. According to Scheme (c) we allow for an extra r ? 1 connections from the global memory to each routing chip. In Table 2.9 let tap be the total available pins in each routing chip (175). Let tpl be the total pins left after the connections to the global memory are accounted for: tpl = tap ? d56=re? (r ? 1). Finally, let btpl=14c be the total pins left from routing chip ri to any functional unit.

r

tap

5 175 4 175 3 175 2 175 Table 2.9:

d56=re

tpl

btpl=14c do = 23?6 r e h

drhij e droik e

12 159 11 3 11 8 6 14 158 11 4 10 9 6 19 154 10 4 10 12 8 28 147 10 4 10 16 11 Worst-case input/output pin requirement on routing chip ri .

The following two inequalities have to be satis ed for routing chip ri to accommodate all signals:

btpl=14c  drhij e

(2:21)

btpl=14c  droik e

(2:22)

Column labeled btpl=14c in Table 2.9 shows that we can not build the middle stage of our interconnection network with less than four routing chips. Equation (2.21) is not satis ed if we use two or three routing chips.

63

Analysis 2: The following is a similar study to Analysis 1. Instead of nding the number of output nodes that maximizes the pins needed by the routing network, we maximize the number of output nodes and calculate the pins required by routing FPGA ri , as shown in Table 2.10. tpl and tpl/14 are de ned as in Analysis 1.

h btpl=14c drhij e droik e 5 12 2 11 10 2 4 12 2 11 11 2 3 12 2 10 14 3 2 12 2 10 20 3 Table 2.10: Worst-case pin requirement of ri according to the maximum number of output nodes (12). r o

Table 2.10 shows that we can build the routing network with four routing chips because Equations (2.21) and (2.22) are satis ed.

Analysis 3: We perform a pin-count in a similar manner to that of Analysis 2. In this case we maximize the number of hidden nodes. Table 2.11 shows the input/output pins required by routing chip ri. The table shows that we can build the routing network with four routing chips because Equations (2.21) and (2.22) are satis ed. tpl and tpl/14 are de ned as in Analysis 1.

r o h

btpl=14c drhij e droik e

5 1 13 11 7 7 4 1 13 11 9 8 3 1 13 10 11 10 2 1 13 10 15 14 Table 2.11: Worst-case pin requirement of ri according to the maximum number of hidden nodes (13). From our study we conclude that the minimum number of routing chips with 175 input/output pins to implement the routing network is four.

2.6 Implementation of a Clos-like Interconnection Network Based on our analysis, we implement a Clos-like three-stage interconnection network. The middle stage is built with ve Xilinx XC3195, as depicted in Figure 2.27. The XC4010

64 FPGAs implement the outer stage of of the Clos-like interconnection network. ... ...

22

22

1

8

15

. . .

Pmem2

Pmem1 din

22

1

15

Pmem13

15

din

8

15

din F1

busy-

Pmem14

. . .

F2

8

55

Private memories

8

din

...

1

22

1

F14

. . .

55

8

F13

5

n[1:4], p2-, cclk2, done2

...

a[0:11], ctl[0:2]

a[0:11], ctl[0:2]

11

11

11

Hidden and output nodes

8 11

11

8

55

Sparc din[1:6]

Dawn

21

Controller

R1

R0

p1-, cclk1, done1

X0

R2

R3

R4

Middle stage of Clos-like interconnection network

R5

a[0:11], ctl[0:2]

15

2

2

15

2

1

1 15

15

Gmem1 1

22

8

15

8

Gmem2 1

22

8

15

15

8

Gmem3

Gmem4

1

1

22

22

15

Gmem5 1

8

8

22

csel[1:19] a[0:11], d[0:7], ctl[0:1]

Gmem6 1

22

15

Gmem7 1

Global memories

22 (rev: Feb 11, 1994)

Figure 2.27: ACME architecture We use the XC3195 as the basic unit to implement the middle-stage because it is the part of highest pin-count (in a PGA package) and it is the fastest Xilinx part available (as of February 1994). We use ve routing FPGAs to make the parameter m of a Clos network relatively small. A small m means shorter distances among pins connected to the same routing FPGA inside the XC4010s. This helps the place and route program PPR to produce a routable design. We could have used four routing FPGAs. The reason we use ve is threefold: 1) the conceptual switches inside the XC4010 FPGAs are spread more evenly around the XC4010 FPGA, 2) we decided upon Scheme (c) late in the development of the board, and we already had ve routing FPGAs partially connected in the prototyping board, and 3) using less than ve routing FPGAs might over constrain systems with a large number of output nodes.

65

R1 R2 R3 R4 R5

...

switch 9 41 42 43 44 45 R2 R3 R4 R5 R1

50 49 48 47 46

R1 R5 R4 R3 R2

R3 R4 R5 R1 R2

1 2 3 4 5

R3 R4 R5 R1 R2

6 7 8 9 10

...

F3

fixed pins switch 11

R1 R5 R4 R3 R2

switch 10

6 7 8 9 10

55 54 53 52 51

switch 1

R2 R3 R4 R5 R1

F2

switch 2

R5 R4 R3 R2 R1

1 2 3 4 5

switch 11

switch 9 41 42 43 44 45

50 49 48 47 46

R2 R3 R4 R5 R1

switch 10

...

R5 R4 R3 R2 R1

switch 1

6 7 8 9 10

55 54 53 52 51

switch 2

switch 1

R1 R2 R3 R4 R5

F1

switch 11

1 2 3 4 5

switch 10

R1 R2 R3 R4 R5

switch 2

Each XC3195 has 175 input/output pins. There are eleven conceptual switches with 5 connections each per XC4010, so m = 5. Figure 2.28 shows how the outer stage is built inside XC4010 FPGAs. Note that there are really ve di erent footprints, where each footprint shifts its connections according to the shifting pattern. Figure 2.28 only shows the rst three footprints. F1 and F6 share the same footprint, F2 and F7 share the same footprint, etc.

switch 9

55 54 53 52 51

R2 R1 R5 R4 R3

50 49 48 47 46

R2 R1 R5 R4 R3

. . .

41 42 43 44 45 R3 R4 R5 R1 R2

Figure 2.28: Footprint of functional units. The numbers inside the FPGAs represent input/output pads grouped into 11 \switches." The labels on the outside of the boxes represent connections to routing FPGAs. The global memory unit is attached to the middle stage of the interconnection network for the purpose of broadcasting the data bits to the di erent XC4010s. The memory unit is 56-bits wide, and we distribute its data bits (d0 through d55) so that they are evenly spread across the routing FPGAs. Bits d0 through d23 hold the data input word din. Bits d20 through d23 are the special pins used to broadcast inputs from the global memory to the routing network according to Scheme (c). Bits d24 through d35 hold the target outputs (tk ), and bits d40 through d51 hold the real outputs (ok ). Appendix B.3 shows how the global memory is used. Figure 2.29 shows the architecture of the global memory.

2.7 A System With 6 Input, 3 Hidden, and 2 Output Nodes We use an example to illustrate how to build a neural system with Scheme (c). This example shows only group-forced external signals among functional units and among functional units and the global memory. We only show one forced net to the X0 controller.

66 r4 r3 r1 r2

r2 r1 r4 r3

Target Output n

r1 r2 r3 r4 r5 r1 r2 r3

r4 r5 r1 r2 r3 r4 r5 r1

r4 r3 r2 r2 r3 r4 r5 r1

r2 r1 r3 r4

r5 r1 r2 r3 r4 r5 r1 r2

r3 r4 r5 r1 r2 r3 r4 r5

0 0

0 8

0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23

0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31

0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39

gmem2

gmem3

gmem4

routing FPGA chip bit

1 1

2 2

3 3

4 4

gmem0

5 5

6 6

7 7

1 2 3 4 5 6 7 9 10 11 12 13 14 15

gmem1 Input Nodes (1 -> 24)

data bit

Target Outputs (1 -> n)

Real Output n r1 r2 r3 r4 r5 r1 r2 r3

r4 r5 r1 r2 r3 r4 r5 r1

0 1 2 3 4 5 6 7 40 41 42 43 44 45 46 47

0 1 2 3 4 5 6 7 48 49 50 51 52 53 54 55

gmem4

gmem5

Real Outputs (1 -> n)

Figure 2.29: Organization of the global memory. The net is used to shift the weights in and out of both hidden and output nodes. Other group-forced external signals to the routing chip R0 and external forced signals to the X0 controller are not shown. We adopt Scheme (c) for our implementation, and use the shifting pattern, therefore each hidden node broadcasts its output-to-hidden signal Vj at the source. Figure 2.31 shows the inter-neuron connections between the hidden and output nodes. Note that the hidden weight units are scrambled in the hidden nodes. This example has six input bits, and the rst ve weights are shifted from hidden to hidden node. The sixth input is xed and receives its data from bit d20 of the global memory. Bits d20, d21, d22, and d23 are broadcast at the source (the global memory) to each routing chip. We call them special bits. Columns two through ve (inclusive) of Table 2.12 show the order in which the hidden node weights should be shifted into the hidden node given that there are less than r weights left. The last column in Table 2.12 shows the order in which the hidden node weights should be shifted into the hidden node given that there are more than r weights left to be shifted in. Figure 2.30 shows the external signals connected to and from the global memory. Notice that the routing FPGAs are replicated to enhance the readability. The hidden weight units are scrambled to compensate the pin shifting e ect of Scheme (c). Referring to Table 2.12: the rst weight shifted into the hidden nodes is not scrambled, according to column

67

h

1

2

w0w1 w1w0 w0w1 w1w0 w0w1 w1w0 w0w1 w1w0 w0w1 w1w0 w0w1 w1w0 w0w1

0

w0w1 w2w3w4 w1w2 w3w4w0 w2w3 w4w0w1 w3w4 w0w1w2 w4w0 w1w2w3 w0w1 w2w3w4 w1w2 w3w4w0 w2w3 w4w0w1 w3w4 w0w1w2 w4w0 w1w2w3 w0w1 w2w3w4 w1w2 w3w4w0 w2w3 w4w0w1 Table 2.12: Hidden node weights are scrambled. If din mod 5 = 6 0, then the 1 2 3 4 5 6 7 8 9 10 11 12 13

w0 w0 w0 w0 w0 w0 w0 w0 w0 w0 w0 w0 w0

din mod r 3 4 w0 w1w2 w0w1w2w3 w1 w2w0 w1w2w3w0 w2 w0w1 w2w3w0w1 w0 w1w2 w3w0w1w2 w1 w2w0 w0w1w2w3 w2 w0w1 w1w2w3w0 w0 w1w2 w2w3w0w1 w1 w2w0 w3w0w1w2 w2 w0w1 w0w1w2w3 w0 w1w2 w1w2w3w0 w1 w2w0 w2w3w0w1 w2 w0w1 w3w0w1w2 w0 w1w2 w0w1w2w3

special data pins in the global memory need to be used to broadcast the residual din mod 5 data input bits. In the table, each row represents the hidden node number, each column represents the number of weights that are scrambled. If the input data width is less than 5, then columns two through ve (inclusive) of this table are used as a guide to scramble the weights. If there are more than 5 data bits, then the sixth column is used as a guide to scramble the rst ve weights, and columns two through ve (inclusive) are used as a guide to scramble the last din mod 5 weights.

mod 5 (column labeled 1). The last ve weights are scrambled according to column (din ? 1) mod 5 (column labeled 0) and the row that corresponds to the hidden node in question.

din

Figure 2.31 shows the inter-neuron connections. The output node weights are not scrambled, a consequence of using Scheme (c). The routing FPGAs are replicated to enhance the readability. The gure does not show the threshold broadcast from hidden node 1 to the output nodes.

68 F1

H1 weights W 0 1

W 1

W 2

W 3

W 4

W 5

2

3

4

5

6

F2

H2 weights W 1 1

W 2

W 3

W 4

W 0

W 5

2

3

4

5

6

F3

H3 weights W 2 1

W 3

W 4

W 0

W 1

W 5

2

3

4

5

6

hidden nodes

legend

r1

d0 d1 d2 d3 d4 d5 d25 (target output1) d24 (target output2) d41 (real output1) d40 (real output2)

r2

r3

d0 d1 d2 d3 d4

r1

...

r4

d20

...d 24

r3

r2

d25

r5

...d40

r4

d41

r5

middle stage of Clos-like interconnection network

global memory

middle stage of Clos-like interconnection network

x is a physical pin number

x 1

2

1

2

F13 rectangles are input output pins

O2

F14 O1

output nodes

Figure 2.30: Memory connections of a system with 3 hidden nodes, 2 output nodes, and 6 inputs. The gure shows memory-to-hidden connections on the top and memory-to-output connections on the bottom. The routing FPGAs on the top and on the bottom of the gure are the same physical units that have been unfolded for clarity. Note that because the system has six inputs (and six is not a multiple of the number of routing FPGAs), the sixth input is allocated to memory data bit 20 (one of the special data bits). The boxes inside the hidden-nodes are weight units. The weights are shifted serially into the hidden nodes. The dotted lines going into the hidden nodes and feeding the rst (or last) weight unit are bi-directional lines connected to the X0 controller.

69 r1

r2

r3

r4

middle-stage of Clos-like interconnection network

r5

feed back external signals ∆ k,j legend 6

7

6

H1 F1 V1 1

2

3

4

7

6

H2 F2

1

2

3

4

4

5

1

2

3

4

5

6

4

O2 F13 weights

V3

V2 5

7

H3 F3

5

W 0

W 1

W 2

1

2

3

5

6

O1 F14 weights

W 0

W 1

W 2

1

2

3

Wi

weight i V1 V2 V3 ∆1,3 ∆1,2 ∆1,1 ∆ 2,3 ∆ 2,2 ∆ 2,1

feed forward external signals Vj

r1

r2

r3

r4

r5

middle-stage of Clos-like interconnection network

Figure 2.31: Inter-neuron connections of a system with 3 hidden (H1, H2, and H3) and 2 output (O1 and O2) nodes. Notice how the Vj signals are broadcast inside the hidden nodes, in the rst stage of the routing network. The routing FPGAs on the top and the bottom of the gure are the same physical devices. They are unfolded to illustrate the two di erent types of inter-neuron connections: feed-forward (Vj ) and feed-back (k;j ). The boxes inside the output nodes are output weight units. The weights are shifted serially into the hidden nodes. The dotted lines going into the output units and feeding the rst (or last) weight unit are bi-directional lines connected to the X0 controller.

70

2.8 A Note on the Scalability of ACME Our implementation of ACME has been designed to accommodate any number of hidden and output nodes in 14 functional units. ACME can scale within those bounds. We use the shifting pattern construction to connect the functional units to the routing chips. We apply Scheme (c) to accomplish any neural network with only one hidden and one output node. Scheme (c) assumes that the hidden nodes broadcast their Vj signals. The global memory has the capability to broadcast (r ? 1) data bits to every routing chip, and every routing chip can broadcast the input data to the hidden nodes. Formulae (2.12), (2.13), (2.19), and (2.20) allow us to check that any system is implementable. Assuming that we use Scheme (c) we can apply the same formulae mentioned above to build systems that have more than fourteen nodes. We can perform a similar study to that of Section 2.5.6 to check that the system is implementable. There is a tradeo between the number of pins in each routing chip, the number of routing chips, the number of data input bits, and the number of hidden and output nodes. Acquiring a balance among those variables is not an easy task, and may include the application of a scheme other than Scheme (c) to provide a balanced load on the routing devices. If a di erent scheme to Scheme (c) is used, the formulae have to be re-derived. A large degree of freedom is added if we allow the use of more than one hidden and one output nodes to accomplish the neural network. The latter requires that the place and route program (PPR) be run more than two times, and complicates the issues of accomplishing the correct interconnections among the nodes.

71

3. Hardware Implementation This chapter describes the platform that supports the ACME back-propagation emulator. It focuses mainly on the hardware aspects, and mentions only brie y the software that directly relates to the hardware. ACME is composed of three basic units de ned in Chapter 2: input, hidden, and output neurons. Input neurons are mapped to the global input memory. Hidden and output neurons are implemented in Xilinx XC4010 FPGAs. Output neurons require a target output and in turn generate a real output. Target outputs are read from the global input memory. Real outputs are written to the global output memory. The decision to implement hidden and output neurons on XC4010 FPGAs and the details of their realization can be found in [2]. Xilinx XC3195 FPGAs connect the neurons among themselves and the neurons to the global memory. Of concern to us in this chapter is the architecture of the hardware and the overview of the software that surrounds and supports the operation of the neurons. We solve several key issues in our hardware implementation of ACME, two being the most prominent: 1) an ecient way to con gure the neurons and routing FPGAs, and 2) an ecient way to access the memory units. Both the XC4010 FPGAs and the XC3195 FPGAs are in turn con gured in parallel. Each XC4010 FPGA is con gured in one of three ways: as a hidden neuron, as an output neuron, or with no function. Our con guration method is particularly e ective because only three les are required to con gure any number of neurons in any size system. The advantages are twofold: 1) The speed of con guration is constant for any size system (i.e. a system scales with no con guration overhead), and 2) the amount of memory required in the SPARC station (our host machine) to store the neurons is a constant and minimum for any system. We take advantage of direct memory access (DMA) to maximize the performance of

72 both con guration and memory transfers. We use dual-ported memory chips to simplify and streamline the hardware design. The architecture of the routing network (explained in Chapter 2) enables us to use one copy of a hidden neuron and one of an output neuron. We consider ACME to be a SIMD (Single Instruction Multiple Data) processor. This is so because we only need one copy of the neurons, and all neurons receive the same instruction and act on di erent data (for example during back-propagation). The following section presents an overview of the ACME hardware. An important issue, synchronization among components, is presented in Appendix A.2 together with some of the interesting points we learnt during the design and implementation of the hardware.

3.1 Introduction: Overview of Implementation We can divide the ACME system into three major components: a) host computer and software, b) interface cards: DMA controller card1 [32] and X0 controller board, and c) ACME board. The host computer is a SPARC, running the UNIX operating system. The computer is linked to our Local Area Network (LAN) at UCSC. The computer has a bus (SBus) with two expansion slots. One of the slots hosts an ethernet card that connects to the LAN. The other slot accommodates the DMA controller card. The DMA controller card allows data to be transferred into and out of the SPARC processor via single port input/output transfers (port I/O) or direct memory access (DMA). We built a controller board that can provide the necessary handshaking between the DMA controller card and the ACME board. The controller board holds X0 and is connected to the DMA controller card through a 37-pin ribbon cable. The controller board talks to the ACME board via a pair of 50-pin ribbon cables. Because of wire lengths we place bu ers and transceivers2 on both ends of all the ribbon cables, and because of re ection problems we terminate the signals on many of the 1 2

The correct name of the card is \DPS-1." We call it \DMA controller card" for convenience. We use AS245 TTL chips both as uni-directional bu ers and transceivers because of their nice pin layout.

73 wires that belong to the ribbon cables. The ribbon cable that connects the controller board to the DMA controller card is particularly long (about 20 inches.) Because of the cable's length we interleave signal and ground wires to prevent capacitive coupling among signal wires [20]. As a result of the interleaving, fourteen of the 37 wires are connected to the SPARC's ground on one end and to the controller board's ground on the other end. Two separate regulated power supplies provide power to a) the controller board, and b) the ACME board. Figure 3.1 shows a high-level view of ACME's hardware. ribbon cables

DMA controller legend XC4010

X0

controller board

DMA controller card sparc station

XC3195 2 memory chips buffers 3 memory chips 4 memory chips

ACME board

Figure 3.1: High level view of the ACME environment. The software running on the SPARC station drives ACME's hardware platform. The software can be sub-divided in six components: 1) generation of hidden and output neurons, 2) generation of middle-stage of routing network, 3) con guration of the ACME board, 4) access to private and global memories, 5) program net for run-time control, and 6) a test suite. ACME's input and output neurons are held in XC4010 FPGAs. The middle-stage of the routing network is implemented with XC3195 FPGAs. The design of the hidden and output neurons (component 1) is described in [2]. After the hidden and output neurons are generated and before they are routed with the Place and Route program PPR (see Chapter 1 for an explanation of PPR), the program color assigns a set of pins to be used by the

74 external signals of the two hidden and output neurons. Program color assigns the external signals according to our routing network described in Chapter 2. The two hidden and output neurons are individually routed using PPR and the pinout information generated by color. After the hidden and output neurons are routed, the location of the external signals is read from the hidden and output report les. The locations thus obtained are used to generate the les for the routing FPGAs. These les are generated by hand, except for R0. We have only partially implemented component 2. Chapter 4 gives an overview of the le generation steps. There can be a total of fourteen XC4010s and six XC3195s in the main board. All of these FPGAs need to be con gured because of their SRAM technology: the connections and functions in each FPGA are performed according to the setting of internal memory bits that are erased once power is turned o . The internal memory bits are set to their proper value once: during the con guration of the device. On power-up, the FPGAs on the ACME board are uncon gured. Two software programs (fdp and rdp) are used to program the FPGAs in the board with the two hidden and output neuron les and the six routing FPGA les that were previously generated. The process of downloading data bits into the FPGAs is called \con guring" the FPGAs. Con guration data is sent through X0, and both the XC4010s and the XC3195s are con gured in parallel using DMA transfers. X0, the FPGA in the controller board, is con gured di erently: X0 reads its con guration bit stream from a programmable read-only memory (PROM). X0 is sometimes con gured in a bit-serial fashion using the vendor's xchecker software and cable. We use xchecker when developing changes to{or while debugging{the interface requirements. There are fourteen \private" memories and seven \global" memory chips in ACME. The memory chips are all the same, dual-ported 4Kx8-bit Integrated Device Technology IDT7134SA. Program access memo transfers data to and from any memory in the ACME board. Transfers are made via direct memory access facilitated by the L64853 SBus controller from LSI logic present in the DMA controller card. The net program is used to provide the basic functions for the neural-net to work. Finally, script les exist to test

75 the memories. The test suites are very complete, involving also the XC4010 FPGAs in the process of testing the memories. More detail is given on the software programs and test suites in Chapter 4. Figure 3.2 shows a view of the ACME development center.

Figure 3.2: Picture of the ACME development center. From left to right: top: oscilloscopes and power supplies. Bottom: ACME board and controller board, SPARC station, monitor and keyboard.

3.2 The ACME Board The ACME board is implemented in a sea-of-holes wire-wrap board, with power and ground planes. The ACME board has fourteen XC4010 FPGAs (F1 through F14) and six XC3195 FPGAs (R0 through R5). Each XC4010 FPGA has its own private memory chip and there are seven global memory chips. Figure 3.3 shows a picture of the ACME board at an intermediate stage of its development. Figure 3.4 shows the architecture of the ACME board, and at a higher level the controller board and the DMA controller card. Figure 3.4 does not show the bu ers and transceivers

76

Figure 3.3: The ACME board at an intermediate stage of its development. Ribbon cables and bu ers are at the top. The larger chips are the Xilinx FPGAs. The ve at the periphery are XC4010s, and the six in the center are XC3195s. The smaller chips are the memory units. The seven in the center are global memories. The ve at the periphery are private memories. used to connect the controller board to the ACME board. Figure 3.5 shows a more detailed view of processing element x that consists of XC4010 FPGA Fx and private memory pmemx. There are eleven pins connected from each XC4010 FPGA to each XC3195 FPGA. The XC4010 FPGAs F1 and F14 also have fteen connections to R0. R0 transfers the twelve address and three control lines to the two sets of global memories: global input memories and global output memories. The address and controls to these memories are in two separate busses: one for the two global output memory chips and another for the ve global input memory chips. We require two output memory chips because there can be at most twelve output neurons in one of ACME's system, and since each output neuron produces a one bit output (remember the data to and from the global memory is transferred in serial fashion), there has to be at least two memory chips eight bits wide each. A similar reasoning accounts for the ve input memory chips: we arbitrarily set the maximum width of the input neurons

22

1

din

. . .

Pmem2

Pmem1 8

15

22

1

15

Pmem13

15 8

15

din din F1

busy-

Pmem14

8

din

...

1

22

1

. . .

F2

8

55

F14

. . .

55

8

F13

5

n[1:4], p2-, cclk2, done2

...

a[0:11], ctl[0:2]

a[0:11], ctl[0:2]

11

11

11

8

8 11

11

Private memories Hidden and output nodes

55

Sparc din[1:6]

Dawn

21

Controller R1

R0

p1-, cclk1, done1

X0

R2

R3

R4

R5

a[0:11], ctl[0:2]

15

2

2

15

2

1

1

15

Gmem1 1

22

8

15

8

Gmem2 1

22

8

15

Gmem3 1

22

8

15

Gmem4 1

15

Gmem5

22

1

8

8

22

Gmem6 1

22

15

Gmem7 1

22

csel[1:19] a[0:11], d[0:7], ctl[0:1]

Dawn card

Controller board

ACME board

(rev: Feb 11, 1994)

Global memories

15

77

Middle stage of Clos-like interconnection network

Figure 3.4: ACME board, DMA controller card, and X0 controller board.

... ... 22

78 pa[0:11], moe-, mwspd[0:7] mcsel-x-

... ...

14 8 14

pmem-x

pal

...

...

...

done-x

8

...

3

...

12

d[0:7], ctl[0:2], a[0:11]

(if F1) busydin/weights

X0

8

F-x ga[0:11], gctl[0:2]

11 11 11 11 11 15

245 to R1 to R2 to R3 to R4 to R5 to R0 (if F1 or F14)

crystal

6

program-, reset-, ws-, go-, wtwr/rd-, shift-

6

Figure 3.5: Processing element. to be 24 bits (requiring at least three memory chips), and there can be at most twelve output neurons that require 12 memory bits for their respective target outputs. There is just one XC4010 FPGA (F1) that transmits the busy- signal to the controller X0. The busy- signal tells the controller that the neurons are in operation. The con guration data is loaded (as are the weights) bit-by-bit and in parallel via the connection labeled din/weights in Figure 3.5. A six-bit wide bus broadcasts the signals program-, reset-, ws-, go-, wtwr/rd-, and shift-. Table 3.1 describes the use of each signal in the bus. Name

programresetwsgowtwr/rdshift-

Direction X0-to-ACME board " " " " "

Description asserted before con guration global reset for all XC4010 FPGAs used during con guration start the neurons direction of weight transfer shift in/out one bit of weights

Table 3.1: Signals that belong to the six-bit wide bus. Each of the XC4010 FPGAs has a \done" pin. The done pin is used to tell if the con guration has been successful or not. All the done pins are individually connected to a di erent pin of a Programmable Array Logic device (PAL). The PAL is programmed as a 14 input nand gate. The output of the nand gate is fed back to X0 and read by the con guration program fdp right after con guration. The nand gate produces a low output

79 if all the XC4010 FPGAs are con gured. Each of the done pins has an LED that is used for visual veri cation of the programming. Figure 3.5 shows the crystal that provides the clock for the ACME board. It should be clear that the SPARC station's clock is not used in the ACME board. Di erent systems require di erent clock frequencies, and the best and easiest way for us to provide those frequencies is by using a crystal oscillator that can be changed accordingly. The output of the oscillator is fed to a bu er for signal strength, and the output of the bu er is broadcast to all the XC4010 FPGAs. Finally, both the private and the global memory chips are dual ported. We decided to use dual port memory chips to simplify the design and to reduce the area used by the memory chips. Because of electrical reasons we bu er the memory chips as is shown in Figure 3.6. X0, the controller, broadcasts the address and two control signals (we- and oe-) to every memory chip. The cs- signals identify each chip separately, and there are twenty one of those signals. Six signals identify the sub-group that a memory chip belongs to. The sub-group enables (gsel0-, gsel1-, and psel0- through psel3-) are generated automatically by X0 on selecting the proper memory chip. legend

X0

AS245 transceiver

245

controller board

8-bit wide data bus memtran

245

P

gsel0-

245

gsel1-

P g g g g g m m m m m 1 2 3 4 5

245

P g g m m 6 7

psel0-

245

psel1-

P p p m m 1 2

ribbon cable

245

psel2-

P p p m m 3 4

p p m m 5 6

245

psel3-

P p p m m 7 8

p p p p m m m m 9 10 11 12

245

P

ACME board

p p m m 13 14

Figure 3.6: Tree of bu ers to decrease the load on each memory data bit. The P stands for pullup resistors.

80

3.3 Interfacing the ACME Board and the SPARC Station Information transfer between the SPARC station and the ACME board is facilitated by the DMA controller card. The DMA controller card is a product of \Dawn VME products" and allows us to perform 8-bit wide DMA as well as port I/O transfers to and from the SPARC station. DMA transfers are controlled by the LSI logic's SBus L64853 DMA Controller that resides in the DMA controller card. The DMA controller card can transfer data at two speeds: fast or slow. We use the slow speed. The DMA controller card has two di erent I/O channels: D and E, with the capability of performing 8-bit and 16-bit wide transfers respectively. Because our SPARC station is equipped with an ethernet card we can not use port E. We are bound to use port D. We can use 8-bit transfers because ACME's memory chips are 8-bit wide, and we design the con guration process of the XC4010s and XC3195s so that less than 8-bits of data are required to be transferred in parallel. The DMA controller card comes with an EPROM that is loaded with initialization rmware written in Forth code. The EPROM is used at boot time by the SPARC processor to identify the card's ID and create con guration tables in system memory which characterize the board as a system I/O device [32]. A device driver is provided with the DMA controller card, and allows us to use high-level system calls to access the card. We modi ed the driver slightly to add more port input/output addresses. The card itself is identi ed as a device le, and as such it has to be opened and closed. Before using the card the software issues the statement DevicePointer = open ("/dev/ddps0\0", O_RDWR)

where DevicePointer can be initialized to be a static integer and holds the returned pointer to the DMA controller card, "/dev/ddps0" is the device name, and O RDWR allows DevicePointer to be written and read. (The latter is a ag that is maintained by the UNIX operating system.) Once DevicePointer has been associated with the DMA controller card via the open() system call, it can be used as a parameter to perform port input/output transfers via the UNIX system call ioctl(), and DMA transfers using the system calls

81 or write(). Before the program ends, the DMA controller card is closed by issuing the command

read()

close(DevicePointer).

The DMA controller that resides in the DMA controller card has a set of internal registers. For the purposes of our implementation we only need to reset (and release the reset from) the control register. These operations are done with a UNIX system call ioctl() with the correct parameters [32]. A Programmable Array Logic PAL22V10 in the DMA controller card provides us with three address bits (pa[4:2]), and pre-decoded active low read/write signals (d rdf- and d wrf-) that are asserted when either DMA or port-IO are performed to the DMA controller card. The SPARC's 25 megahertz clock clk is also available. Other signals (that are useful to us) are sent directly from the DMA controller and trough re-powering bu ers: d ack, d wr-, d rd-, and d cs-. A signal generated in X0, d req (for DMA request), is used by the DMA controller to continue/stop the transfer of bytes. The function and timing of the signals will be explained in following sections. It is important to notice that the eight data-bit bus used by the DMA controller to provide transfers among the SPARC and the ACME board, together with the control and address signals generated by the DMA controller go to X0. No signal connects the DMA controller directly to the ACME board. Figure 3.7 shows some of the signals used to interface the controller board and the DMA controller card. Sbus memory clk pa[4:2]

X0 controller

controller board

d[7:0] d-ackd-req d-rdd-wrd-cs-

ribbon cable

PAL

EEPROM DMA controller

SPARC station

Figure 3.7: Signals used among the controller board and the DMA controller card.

82

3.3.1 Direct Memory Access A direct memory access can only be initiated by the software resident in the SPARC station, because we are using the D channel of the DMA controller card. The statement read (DevicePointer, SunBfr, Bytes)

initiates the DMA transfer. DevicePointer is the le pointer that identi es the DMA controller card, SunBfr is a pointer to an array of elements (that because we are using the \D" channel need to be eight bits wide), and Bytes is the number of bytes to transfer. Once the DMA is nished, the program continues executing. A similar call to the write() function allows the SPARC computer to write as many as Bytes number of elements into the SunBfr array. The signal d req is asserted by X0 to request a new byte, and d ack- is generated by the DMA controller card to announce/request each byte to/from X0 during the DMA transfer. Direct memory access is used in two di erent instances: a) con guration of XC3195s and XC4010s FPGAs, and b) access to global and private memories. In the rst case the data are written directly into the FPGAs, and X0 does no address generation to store the data bytes. Data simply ows into the Xilinx FPGAs that reside in the ACME board. The data bytes are clocked into the FPGAs by a signal decoded by X0 from the DMA controller card. The Xilinx FPGAs provide their own address generation (i.e. the FPGAs know where each bit goes). The second case is di erent, though, because X0 has to control where the data is read from or written to while DMA is performed to read or write the memory chips. X0 is therefore equipped with an address counter that increments by one every time X0 is signaled for a byte-transfer from the DMA controller card. The counter is 12-bits wide and allows X0 to access all 4096 locations on ACME's 4K byte memory chips.

3.3.2 Port Input/Output Transfers Port I/O transfers are used to initialize the three main control registers in X0, load X0's address and epoch counters, select the proper memory chip, and setup the correct path for

83 the con guration data to con gure the Xilinx parts in the ACME board. These issues are described in Section 3.4. The statement ioctl (DevicePointer, Address, DataByte)

produces the transfer of an 8-bit wide DataByte among the SPARC and X0. Address tells the operating system if this is to be a write or read to/from X0 and the address that the DMA controller card will generate. The DMA controller card supplies three address lines that can be decoded in X0 to generate eight di erent ports. A signal from the DMA controller card (d cs-) announces/requests the byte after the proper address has been setup and allows X0 to distinguish the port I/O from DMA transfers.

3.3.3 DMA Controller Card and Alterations We perform minor alterations to the original DMA controller card. We extract the PAL that comes with the card and replace it with a 22V10 that we burn with slightly di erent functions. We include the CUPL description of the 22V10 in Appendix C. Several bu ers are added to provide strength to signals, and series terminator resistors are used to compensate for re ections on the ribbon cable that we use to connect the DMA controller card to the controller board. The series resistors are in the 150 Ohm range. We add a switch to control the transfer rate of the card (fast or slow), and pull-up the data bus with 4.7 K Ohm resistors. The data bus coming out of the DMA controller chip oated at dangerous levels when tri-stated. We decided on the direction of the transceiver that bu ers the data bus to be from the DMA controller card towards the controller board when the data bus is not in use. The data bus oating at borderline levels made the transceiver output's swing very fast from one logic level to the other, and this introduced large amounts of noise in the ribbon cable. By pulling-up the bus we eliminated this source of noise.

3.4 X0: the Controller X0 is implemented in a Xilinx XC3190-4 PG175 eld-programmable gate array. This particular FPGA has 320 con gurable-logic blocks and 144 user input-output pins. There

84 are two global clock bu ers: Aclk and Gclk that are used to internally broadcast signals that have a large fan-out, like clocks. X0 resides in the controller board and controls the transfer of data for con guring ACME's board, the access to private and global memories, and the neural network basic functions. The next section describes the controller board. Later sections describe the functions of the controller chip X0. We place X0 in a separate board from the ACME board to save space in the ACME board and to allow for the concurrent development of both boards. There are two basic modes of transfer among a program running on the SPARC computer and X0: direct memory access and single-byte transfers. X0 distinguishes among the two di erent modes by using a decoder. DMA transfers are used to con gure the FPGAs in the ACME board, and to read and write the global and private memories. Single-byte transfers are used to initialize the three main control registers, download weights for hidden and output neurons, and set-up a selector to allow either the download of weights or the con guration of the XC4010s of the ACME board with a hidden neuron, an output neuron, or an empty bit-stream. Below we explain how X0 distinguishes DMA from port I/O transfers. We tabulate the signal names in Table 3.2, and present two timing diagrams from the \LSI Logic" manual [33] in Figure 3.8: one for DMA, and another for port I/O. Name

clk d rdd wrd ackd cspa[2:4] d req d d[7:0]

Direction DMA controller-to-X0 " " " " " X0-to-DMA controller bi-directional

Description SPARC's clock. Frequency 25mhz data from X0 to SPARC should be ready data from SPARC to X0 is ready signals one byte of dma transfer signals a port I/O transfer address used in port I/O when low, DMA controller waits data bus

Table 3.2: Name of signals available to interface to the DMA controller card. The DMA controller card uses ve main signals to initiate transfers of data to and from X0, three address bits to decode a particular register or \port" when performing a single-byte transfer, and the eight-bit data bus to transfer the actual data. One particular signal: d req can be controlled by X0 to prevent the next byte transfer from/to the DMA

85 controller card while DMA is in progress. X0 need only assert the signal to let the transfer continue. Figure 3.8 shows the two transfer modes: port I/O and DMA. Figure 3.8 shows only the d rd- signal; the same timing information applies to writes. The timing charts presented assume that the DMA controller card is operating in \slow" mode. On \fast" mode, the width of the d rd- (d wr-) pulse is ve cycles. d_cs-

clk d_ackd_req

d_rdd_ackd_d[7:0]

d_rdd_csd_d[7:0]

clk pa[4:2]

(a)

(b)

Figure 3.8: Timing of \reads" generated by DMA controller card. (a) Port I/O, (b) direct memory access. Both (a) and (b) according to slow transfer mode. Subtract two cycles if using \fast" transfer mode. Charts show \reads"; \writes" have similar waveforms. The frequency of the clock (clk) is 25 megahertz, its period 40 nanoseconds.

3.4.1 Controller Board The controller board is implemented in a sea-of-holes board that has neither a ground plane nor a power plane. We are forced to wire-wrap all power and ground wires. We connect the power and ground busses in the form of a star to reduce the path resistances. Figure 3.9 shows a picture of the controller board. The controller board hosts X0 and its PROM, a connector for the xchecker cable, and several transceivers and bu ers. Two 50-pin sockets accept the ribbon cables that connect the controller board to the ACME board. One 37-pin socket accepts the ribbon cable that connects the controller board to the DMA controller card inside the SPARC station. As we already discussed, we add the bu ers for re-powering and cleaning of the signals that travel via ribbon cables. Moreover, some of those signals are terminated at the source

86

Figure 3.9: Controller board. From left to right, top: connector to SPARC station and transceivers, connector to power, X0 controller FPGA. bottom: two 50-pin connectors to ACME board and transceivers, bank of LEDs. Right: connector for the xchecker cable and proms for storing permanent con guration for the X0 controller. with series resistors. We use various ratings of resistor values, ranging from 47 to 100 Ohm, for signals going from the controller board and into the ACME board.

3.4.2 Internal View of the X0 Controller and Description of Signals In this section we describe the internals of the X0 controller. Figure 3.10 shows a block diagram of X0. Some of the inputs/outputs to X0 in Figure 3.10 are named according to existing signal names, other have descriptive names that are not to be found in the lower-level schematics that we include in Appendix C. We start the description of X0 with the module labeled decoder in Figure 3.10. The decoder generates the di erent port addresses required for writing or reading data from the requested registers.

87 port address

port generator pa[4:2] control

ipad

read/write signals

ipad

decoder

memory interface

X0

address generator

read/write

opad

chip selector network interface

opad

net control register

opad net-status

epoch counter

main control register

configuration interface

configuration register

ipad

XC4010 bus

XC4010 configuration

ma[11:0] mcsel

net control signals busy-

XC4010 configuration/weights bpad

weights

d[7:0]

bpad

read-back multiplexer

read/write data in data out

weights

opad

memory data bus data in bus read/write

bpad

XC3195 configuration memodata[7:0]

Figure 3.10: High-level view of X0's architecture.

3.4.3 Port Decoding The address decoding is done using a TTL 74-138-like macro cell from the Xilinx library. We use the decoded port signals to clock a register to implement the I/O port write transfer. On studying the Xilinx library's macro of a 74-138 (included in Appendix C) one can see that it is prone to produce glitches. We carefully assign inputs and controls to the decoder to prevent glitches. Two signals generated by the DMA controller card, d ack- and d cs-, are the control inputs. The three addresses pa[4:2] are the data inputs. Three of the eight outputs of the decoder are gated with d rd- to produce three di erent \port reads." All eight outputs of the decoder are gated with d wr- to produce eight di erent \port writes". Figure 3.11(a) shows the decoder and Figure 3.11(b) shows a one-bit register. inputs

control

pa[2] pa[3] pa[4]

p0 3-to-8 decoder

... p7

data[0] p0 d_wr-

D Q

port0wr

d_ackd_cs-

(a)

(b)

Figure 3.11: Implementation of (a) port I/O decoder and (b) a one-bit register. DMA signals are also generated with no glitches. Refer to Appendix C to see how we

88 generate these signals. Table 3.3 shows a list of port names and a brief description of their use. Name Function port0wr- write into the main control register. port1wr- load the lower 8 bits of the memory address counter. port2wr- load the upper 4 bits of the memory address counter. port3wr- load the encoded value to select the right memory chip. port4wr- in combination with signal wthigh/low- of the net control register loads the weights for the neurons. port5wr- in combination with signal prepare of the net control register loads the epoch counter. port6wr- set the selector for con guring the XC4010 FPGAs with the proper bit-stream. port7wr- loads the net control register. port0rd- read back the status word. port1rd- read back the weights of neurons one through eight. port2rd- read back the weights of neurons nine through fourteen. dma wrwrite the con guration for the XC4010 FPGAs if memtran is not asserted, otherwise write to the selected memory chip. dma rdwhen signal memtran is asserted, reads back the contents of the selected memory. Table 3.3: Name and description of each of the port addresses and dma signals present in X0. Having described the decoder, we now focus on the main control register module shown in Figure 3.10. The register is implemented (as the rest of the registers in X0) with D-type

ip- ops, and holds several signals that control basic functions. Table 3.4 shows the signal names and a brief description of each signal's function. Signal Name Bit Description resetd0 resets the XC4010 FPGAs programd1 according to fx/rx- below strobes the \program-" pad of the XC4010 or XC3195 FPGAs dma configd2 di erentiates between memory DMA transfers and con guration DMA transfers memtran d3 the DMA is a memory transfer fx/rxd4 con guring XC4010 or XC3195 FPGAs Table 3.4: Signals in the main control register.

3.4.4 Con guring the Neurons and Routing Network Given the bit-streams for the hidden and output neurons, fdp takes care of con guring the proper XC4010 FPGAs. The command

89 fdp hiddens outputs hidden.mcs output.mcs empty.mcs

accomplishes the con guration of the XC4010 FPGAs. The number of hidden (hiddens) and output (outputs) neurons, together with the hidden, output, and an empty mcs3 les are passed as a parameter to this program. The number of hidden and output neurons are encoded into a word that is written to the con guration register in X0. The contents of the register then act as selectors for enabling the proper tri-state drivers of the con guration multiplexer to replicate in X0 the proper bit-stream. The con guration multiplexer is required because the same wires used to con gure the XC4010 FPGAs are also used to transfer the weights for each neuron, and we need to select among the weights and the con guration data. The con guration multiplexer and con guration register are part of the con guration interface module shown in Figure 3.10. After the con guration register is initialized, fdp opens the hidden, output, and empty les and proceeds to pack three-bit wide words, interleaving each one of the bits of hidden, output, and empty les. As each 3-bit wide word is nished, it gets written in consecutive locations in an array. Each location in the array is 8-bits deep so the upper order ve bits of each word are disregarded. After the three les have been completely read, fdp starts a DMA call to X0. The array is sent byte-by-byte to the XC4010 FPGAs. The FPGAs are con gured in \slave-serial" mode [6], and we use the \DMA write" signals to clock the data into the FPGAs. The SPARC station sends three bits to X0, and X0 uses the multiplexer built of tri-state busses to replicate the proper bits. Because all hidden and output neurons are the same, the same bit can be used to con gure di erent FPGAs. The hidden neurons grow from FPGA F1 towards FPGA F13, and the output neurons grow from FPGA F14 towards FPGA F3. We call F14 output neuron 1, and F1 hidden neuron 1. There may be unused FPGAs between F2 and F13 according to the system that is being downloaded. Those FPGAs are also con gured, but with the \empty" bit-stream. By default, F1 and F2 are con gured as hidden neurons, and F14 as an output neuron. For example, in a system with ve hidden and two output neurons, FPGAs F1, F2, and F3 hold hidden neurons 1, 3

See Chapter 1 for an explanation of mcs.

90 2, and 3 respectively, while FPGAs F14 and F13 hold output neurons 1 and 2 respectively. FPGAs F4 through F12 are con gured with an empty bit stream. Figure 3.12 shows the implementation of the multiplexer. X0 selector

14

f11-hid en-hid[0:3] en-out[0:3]

f11-hid f11-out

dma-config f11-hid f11-out

f11-empty

dma-config f10-hid f10-out

f10-empty

f10-hid

d2 dma-config weight11

XC4010 configuration/weights

d0

d0 f11-out d1 f11-empty

bpad

f10-out 1

1

d1 f10-empty d2 dma-config weight10

Figure 3.12: Multiplexer for selecting among the di erent types of con guration streams and a weight value to be written to the XC4010 FPGAs. Only two inputs to the multiplexer are shown: one for FPGA F11 and another for FPGA F10. There are twelve more inputs to the multiplexer. A similar program, rdp, is used to con gure the routing FPGAs. Six les (one for each of the routing chips R0, R1, R2, R3, R4, and R5) are read by rdp and processed in the same manner as fdp, that is, each bit is interleaved to pack a new 6-bit wide word. Again, a decoding of the \DMA write" signal is used as a clock to strobe the con guration data into the FPGAs. These are XC3195-4 FPGAs, and are also con gured in \slave-serial" mode. There is no multiplexer required in X0 for con guring the XC3195s because there is no further data transfer among X0 and the routing FPGAs required (i.e., the con guration data wires are single-purpose wires). There is a large overhead in con guring the XC4010 and XC3195 FPGAs because of the le manipulation steps (see Section 3.5 for an empirical measure of the overhead). Once the les have been generated, the DMA transfers are quite fast. We can see two ways to reduce the con guration time:

91 1) re-write the speci c parts of the software that are in charge of manipulating the les, or 2) create a pre-processed le for each system such that the bit interleaving does not need to be done \on-the- y." A third solution can be implemented by introducing a \cache" in X0 to pipe-line the con guration data streams. We discard this idea because the ACME board does not need to be recon gured \on-the- y."

3.4.5 Controlling the Neurons: Epoch Counter and Control Signals Once a system is in place, there are six signals needed to control the ACME board. The net control register (see Figure 3.10) controls most of them. The basic actions to operate a system are summarized below. 1) load private memories with the function tables, 2) load target outputs and input patterns into the global memories, 3) reset the neurons, 4) load weights into the neurons, 5) unload weights from the neurons, 6) start the neurons, 7) stop the neurons, and 8) read-back the real outputs from the global output memory. Program net controls the loading and unloading of the weights, and the starting and stopping of the neurons. We describe actions 1, 2, and 8 in the next section. The reset (action 3) is controlled by the main control register (Section 3.4.2). The net control register holds the signals required to perform actions 4 through 7. Table 3.5 lists and brie y describes the signals found in the net control register. Loading and unloading weights (actions 4 and 5) is performed by rst loading the con guration register with a code-word that enables the con guration multiplexer to direct the weights to the neurons held in XC4010 FPGAs. An option in program net allows

92 Signal Name

work shiftwthigh/lowwtwr/rdld high/lowprepare preload

Bit d0 d1 d2 d3 d4 d5 d6

Description start the neurons shift one bit either in or out of FPGA F1 through FPGA F14 enable one of the two weight registers to be written. direction of weight transfer (according to X0- XC4010 FPGA) together with prepare load the registers that hold the starting count for the epoch counter together with ld high/low- enable the registers that hold the starting count for the epoch counter while active, preload the epoch counter

Table 3.5: Signals in the net control register. weights to be written to the XC4010 FPGAs. The program rst sets the direction wtwr/rdto write. The rst bit of the weights corresponding to FPGAs F1 through F8 are loaded into a register in X0, and the rst bit corresponding to weights in FPGAs F9 through F14 are loaded into another register. Once the registers are loaded, a strobe on shift- signals the neurons to shift in one bit of the weights. The weight read-back is performed in a similar manner, and is another option of program net. The weights for both hidden and output neurons are shifted in parallel, thus a preprocessing step is performed to arrange the weights in their proper position. The total number of transfers performed for hidden weights is the precision of the hidden weights times the number of hidden weights (since the transfer is done bit-by-bit). The same is true for the output neurons. The total number of transfers (since the transfers are done in parallel) is the maximum number of bits resulting from either the hidden or the output neuron weights. The smallest of the two streams from both hidden or output neurons is axed with zeros so that the correct values are shifted in. ACME's epoch counter determines the number of epoch runs. The epoch counter is a 12-bit pre-loadable down-counter, and is initialized by rst writing the 12 bits that set the desired number of epochs into two registers, and then asserting the preload signal. The neurons are started by the signal work being activated in the controller. A small state machine asserts the signal go- that is broadcast to all the neurons two cycles later. X0 de-asserts the signal work immediately after it was asserted. Signal busy- is generated by the neuron in FPGA F1 and signals the start/end of a processing cycle. The epoch counter

93 decrements by one every time the neuron in F1 produces a falling edge on the busy- signal. Once the epoch counter under ows, the state machine de-asserts go-. ACME stops as soon as it samples both the go- and work signals being de-asserted.

3.4.6 Reading From X0: Read-Back Multiplexer The SPARC station reads several signals from the ACME board and the controller X0. The read-back multiplexer shown in Figure 3.10 is in charge of directing the information to the SPARC station. According to the correct setting of the selector the multiplexer can pass memory data, one weight bit belonging to the neurons in F1 through F8, one weight bit belonging to the neurons in F9 through F14, or miscellaneous control signals. This multiplexer is implemented with gates rather than with tri-state busses (as the con guration multiplexer) because it is simpler. Table 3.6 shows the proper settings to enable the correct data to be transferred to the DMA controller card. Among the miscellaneous signals that the DMA controller card reads is a ground signal. This is used by the SPARC station to identify if the X0 controller board is turned o (the data bus is pulled-up in the DMA controller card). Another of the miscellaneous signals is the busy- signal generated by the net. memtran port0rd- port1rd- port2rdpass h l h h memory data l l h h miscellaneous signals l h l h lower 8 bits of weights l h h l higher 8 bits of weights Table 3.6: Settings to enable correct data to be read from the SPARC.

3.4.7 Memory Interface The main function of the memory interface module shown in Figure 3.10 is to interface the SPARC station to the memory chips present in the ACME board. Program access memo controls the memory access. All of the memory chips are dual-ported, thus they can be read or written asynchronously by both, the neurons or the SPARC station. There is no hardware

94 to arbitrate the access. Arbitration is performed by software. Program access memo polls the signals busy- and go- to see if the network is active. If the network is active, the program simply exits4 . There are two steps involved in setting-up X0 to perform a memory access: 1) preload the memory address counter and 2) select the correct memory chip. The address counter is a 12-bit preloadable counter. Address port1wr- is used to load the lower eight bits, while port2wr- is used to load the upper four bits. The counter is incremented once upon each memory access (read or write.) A memory chip is selected by loading a 5-bit wide register with the proper value such that one of the outputs of the three 74-138-like decoders (used to generate the proper 21 signals) is asserted. A secondary set of signals is also generated by this decoding. These signals enable a bu er in the bu er-tree to select the proper memory chip. Table 3.7 shows what memory chips are enabled according to the setting of the decoders. Figure 3.6 in Section 3.2 shows the implementation of the bu er tree.

The correct way to implement the handshaking, without polling, is to use an interrupt. The SPARC should set a bit in one of X0's registers, and X0 should interrupt the SPARC when the go- and work signals are not asserted. We did not want to deal with interrupts for this round. 4

95 Code (Decimal) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Signal Name

csel0csel1csel2csel3csel4csel5csel6csel7csel8csel9csel10csel11csel12csel13csel14csel15csel16csel17csel18csel19csel20-

Memory Chip Signal Name Selected of Bu er gmem1 gmem2 gmem3 gmem4 gmem5 gmem6 gmem7 pmem1 pmem2 pmem3 pmem4 pmem5 pmem6 pmem7 pmem8 pmem9 pmem10 pmem11 pmem12 pmem13 pmem14

gsel0-

" " " " gsel1" psel0" " " psel1" " " psel2" " " psel3"

Table 3.7: Code required to enable correct memory chip and memory bank.

3.5 Status and Performance This section presents the status of our project, and the performance of two classi ers that we run in ACME. The section also presents calculations and empirical observations of the raw speed of two operations performed in the ACME board: access to the private and global memories, and con guration of the Xilinx FPGAs.

3.5.1 Status At the time of writing this thesis, there are ve XC4010 FPGAs (F1, F2, F3, F13, and F14) and their private memories installed in ACME. All of the routing FPGAs (R1 through R5) as well as R0, the FPGA used to broadcast address and control to the global memory chips, are in place. The seven global memory chips are also in place and can be read or written by the host computer. See Figure 3.3 for a picture of the ACME board. Only those connections needed for accomplishing the two designs that we run in ACME are in place among the XC4010 FPGAs and the routing FPGAs; the same is true for the global memory connections towards the routing network. All connections are in place among the

96 ve Xilinx XC4010 FPGAs and their respective private memories. The controller board is completely functional (see Figure 3.9). Program wirer has to be completed. The generation of R1 through R5 has not been automatized yet. The generation of R0 is done automatically by program configR0, a subset of program wirer (see Section 4.2).

3.5.2 Neural Network Performance The X0 controller board and X0 controller operate at 25 megahertz. The ACME board operates reliably at 10 megahertz while exercising a particular neural network. We exercise two designs: a simple two hidden node, one output node, and two input node xor circuit (XOR), and a three hidden node, two output node, and ten input node character recognition circuit (OCR). Aaron Ferrucci's Master Thesis [2] explains the performance of ACME and compares our prototype with other existing and \theoretical" architectures. Performance is measured in MCUPS (Millions of Connection Updates Per Second). According to the de nition in [2] our examples run at XOR: 1.5 MCUPS (@10 megahertz)

OCR: 5.66 MCUPS (@8 megahertz) Maximum attainable: 54.3 MCUPS (this is a theoretical system with 13 hidden, 1 output, 23 inputs and threshold, 60 cycles per iteration and a 10 megahertz clock) The eciency of ACME is measured as MCUPS per weight, so according to this measure and using the maximum attainable system above, ACME shows an eciency of 0.17. To give a perspective to these numbers, ACME is compared with several architectures in [2] . Of these architectures, the best performance is attained by CNAPS with 2300 MCUPS for a system with 1900 inputs, 500 hidden nodes and 12 output nodes, and the worst performance by CM-2, with 40 MCUPS for a system with 256 inputs, 128 hidden nodes, and 256 output nodes. The architecture closest to ACME in eciency is that of GF11 with a measure of 0.090. The worst ecient is SPERT with 0.000012.

97

3.5.3 Memory Access and Con guration Performance The global and private memories of the ACME board are memory mapped to the SPARC computer's address space. They can be accessed via DMA transfers. The XC3195 and the XC4010 FPGAs can be considered to be memory locations programmed from the SPARC computer via DMA transfers. The X0 controller has built-in I/O ports that are mapped to the SPARC address space and that control the transfer of data. We measure the speed of DMA transfers from the SPARC station and into the ACME board. We use signal d ack- to measure the frequency of operation of the DMA controller that resides in the SPARC station. Figure 3.13 shows the timing of signal d ack- as observed with an oscilloscope. In fast mode we read 1.57 micro seconds per four byte transfers, or 2,547,770 bytes/sec. In slow mode we read 2.02 micro seconds per four byte transfers, or 1,980,198 bytes/sec. d_ackfast mode slow mode

730 ns.

800 ns.

1.37 us.

650 ns.

Figure 3.13: Speed of the DMA controller during DMA transfers. d ack- signals the transfer of a byte from the SPARC station to the X0 controller board. Fast mode: 1.57 micro seconds per four bytes, or 2.5 Mb./sec. Slow mode: 2.02 micro seconds per four bytes, or 1.9 Mb./sec. Table 3.8 shows the time taken for the con guration of the Xilinx parts and to access any one of the memory chips. We tabulate three di erent types of measurements: 1) oscilloscope: duration of the DMA transfer itself, with no operating system overhead. 2) calculated: according to the frequency of DMA byte transfers as shown in Figure 3.13 and the number of transfers shown in Table 3.8. 3) stopwatch: including operating system overhead. Time taken from pressing the carriage return in the keyboard the return of the prompt on the screen of the SPARC station.

98 Action

Device

Transfers Measurement DMA Controller Speed (msec.) Fast Slow 178,140 oscilloscope 96.0 125.0 con guration XC4010 three-bit calculated 70.0 90.0 words stopwatch 7.0 sec. 7.0 sec. 94,996 oscilloscope 51 65 con guration XC3195 six-bit calculated 37.3 47.9 words stopwatch 7.0 sec. 7.0 sec. 4096 oscilloscope 1.5 2.0 read/write memory eight-bit calculated 1.6 2.1 words stopwatch 1.0 sec. 1.0 sec. Table 3.8: Time taken to perform the con guration of the XC4010s, XC3195s, and to access any memory chip in the ACME board. Column labeled \Transfers" in Table 3.8 shows the total number of bytes transferred from the DMA controller card in the SPARC computer, through the X0 controller board and into the ACME board. Each transfer from the DMA controller card is eight-bits wide. Only three bits are used in each of the XC4010 transfers, and only six bits are used in each of the XC3195 transfers. All eight bits are used in each of the memory transfers.

99

4. Supporting Software 4.1 Introduction Several steps have to be taken to run a new neural network con guration in ACME. Among the steps are the generation of two les, one for a hidden and another for an output neuron, and the generation of six les to con gure the six routing chips. Recall from Chapter 3 that the processing elements of the ACME board behave as those of a SIMD machine, therefore only one copy of the con guration le for each hidden and one for each output neuron is needed. After the con guration les are properly generated, a variety of programs are used to con gure the hardware and exercise the net. We divide the software in two distinct groups: 1) system generation, and 2) utility. The following section describes the steps for generating a system. Section 4.3 lists the typical steps and software to initialize and run ACME. Program wirer remains to be developed at the time of this writing.

4.2 System Generation Both the neurons and the routing chips are implemented in Xilinx FPGAs, and we use the vendor's tools to realize their designs. The place and route program PPR places and routes the neurons. PPR reads a constraint le to assign the neuron's external signals to the correct IOBs. The placement and routing program APR routes the routing chips R0 through R6. The output of both the PPR and APR programs are routed LCA les. The XACT editor allows us to accomplish the placement of the I/O pads in the routing chips by hand. The ow for con guring a neural system in ACME is shown in Figure 4.1. The rst step in generating a new system is to design a hidden and an output neuron as in [2]. The results are two unrouted XNF les (hidden.xnf and output.xnf in Figure 4.1). Two programs, color and wirer, aid in the mapping of the neurons to the ACME hardware. The color

100 program reads the number of hidden, output, and input nodes in the new system and generates an ASCII constraint le for a hidden neuron and another for an output neuron. The les are called hidden.cst and output.cst respectively in Figure 4.1. Program color generates a third ASCII text le, configuration.txt in Figure 4.1, that aids in the creation of the ve routing chips R1 through R5. Program color outputs no information related to the rst routing chip, R0, because R0 is simply used to transfer control and address signals from the hidden neuron in F1 and the output neuron in F14 to the global memory chips. #inputs

#hiddens

#outputs

color hidden.xnf

hidden.cst

configuration.txt

output.cst

PPR

PPR

hidden.mcs hidden.rpt acme.wires

output.xnf

output.rpt

output.mcs

wirer

R0.lca R1.lca R2.lca R3.lca R4.lca R5.lca apr -l

apr -l apr -l

apr -l

apr -l

apr -l

r0.mcs r1.mcs r2.mcs r3.mcs r4.mcs r5.mcs

Figure 4.1: Flow of system con guration. In the hidden and output constraint les (shown as hidden.cst and output.cst in Figure 4.1), color speci es the input/output pin assignment of the external signals of hidden and output neurons so that they conform to the board level constraints. Two place and route (PPR) runs are performed using these constraint les: one run for the hidden and another for the output neurons (shown as hidden.xnf and output.xnf respectively in Figure 4.1). After the hidden and output neurons have been placed and routed, the second step is

101 to generate the six LCA les that describe the routing chips. This step is not completely automated, and should be performed by a program or a set of programs under the name of wirer. Program wirer should read the board constraints in acme.wires, the report les hidden.rpt and output.rpt, and the configuration.txt le as shown in Figure 4.1 to generate the six routing chips R0 through R5. Program configR0 exists and is part of wirer. configR0 is used to automatically con gure the routing chip R0. At the time of writing this thesis, we create the les for routing chips R1 through R5 using the XACT editor. The les for the routing chips have the I/O pad assignments only, no logic is required. Six APR -l runs (the -l option locks the input/output blocks) are performed, one for each le. Below we summarize the design ow for con guring a neural net system to prototype using ACME. a) produce two unrouted XNF les, one describing a hidden and another describing an output neuron as explained in [2], b) before routing the XNF les with PPR, create constraint les hidden.cst and output.cst using color, c) route the hidden and output neuron LCA les using the constraint les as a guide, d) after the neurons are routed, read the pins assigned to the external signals of the neurons by PPR from the report les hidden.rpt and output.rpt, e) produce six LCA les, one for each of R0 through R5 according to the report les and the physical wires joining the neurons to the routing chips, and f) route the six LCA les generated in point (e). neurons. Usage of color and configR0 is found in Appendix B.

4.3 Utility Software The following are typical steps to initialize and run ACME: a) load private memories with the function tables,

102 b) load target outputs and input patterns into the global input memory, c) reset the neurons, d) initialize the weights of the neurons, e) start the neurons, f) stop the neurons, and g) read-back the real outputs from the global output memory. These steps have all been described in Section 3.4. Program access memo allows access to the private and global memories in the ACME board. Program net allows for the resetting of the neurons, the initialization of weights, and the starting and stopping of the neurons. Usage of access memo and net is found Appendix B.

103

Appendix A. Synchronization A.1 Synchronization Issues Synchronization becomes an issue when two parties that need to communicate among each other operate at di erent frequencies. Humans can use their senses to notice when things start to go \out-of-sync" and correct the situation by adapting their behavior. Our ACME hardware has to do otherwise. In ACME there are three instances when synchronization is important: 1) when pre-loading the epoch and address counters in the X0 controller (see Section 3.4), 2) when transferring information and control among the ACME board and the X0 controller board, and 3) when starting the neurons. Instance (1) is important because of the way the counters are designed. The address counter in the X0 controller is preloaded from the SPARC station. The starting address is loaded into the ip- ops of the counter with port I/O transfers. The counter increments by one upon each read or write issued by the SPARC station to the global memory. We avoid synchronization problems when preloading the counter by clocking the ip- ops of the counter with a combination of the three signals: preload, DMA read, and DMA write. The epoch counter is initialized by the SPARC station with the number of epochs for the neurons to run. The epoch counter is clocked with the SPARC's clock and decrements by one on every falling edge of the busy- signal. The busy- signal is produced by the neuron in FPGA F1. To avoid gating the clock [7] we device a simple method to synchronize the preloading of the number of epochs. We add a set of registers whose outputs are connected to the inputs of the counter's ip- ops. The SPARC loads the number of epochs in the registers with port I/O transfers and signals the counter to preload by asserting the preload signal. Immediately after the signal preload is asserted, the SPARC issues another port I/O transfer to de-assert the preload signal. Right after signal preload is

104 asserted and before it is de-asserted the counter loads the number of epochs on every rising edge of the SPARC's clock. Instance (2) is a problem because the ACME and the controller boards operate at di erent frequencies. The busy- signal generated by the neuron in FPGA F1 is fed into a d-type ip- op upon its arrival to the controller X0. The SPARC's 25 megahertz clock is used to clock the ip- op. We also provide ip- ops in the hidden and output neurons to synchronize the signals coming from X0 and into the neurons. The ACME board's clock is used to clock the ip- ops. We decide not to use DMA transfers for loading and unloading the weights from the neurons, and avoid the problem of synchronization because we provide enough time for signals to settle. Finally, the go- signal is broadcast from the X0 controller to all the neurons. All the neurons should start working in unison when go- is asserted. Even though each of the neurons in the ACME board has go- synchronized to the ACME board's clock, the neurons may not be synchronized among themselves. While testing the ACME board we noticed that sometimes one of the hidden or output neurons lagged one cycle behind the others, with no preference for a particular neuron. We solved the problem by using an external synchronizer (a 74LS74 TTL ip- op) that is clocked with the ACME board's clock to synchronize the go- signal. The new syncgo- signal is broadcast to all the neurons. Notice that we can not use X0 to produce syncgo- because the main board's clock would have to be routed to X0, and the delays involved are too great. Figure A.1 shows an oscilloscope print-out of the busy- signal generated by the three neurons of an example we run before using the syncgo- signal. The gure proves that neurons may start operating at di erent times. The clock of the ACME board (10 megahertz for this particular example) is shown as a reference.

A.2 Lessons Learnt There are many things we learnt during the design and implementation of ACME. Three of the most instructive \bugs" that we solved are:

105

Figure A.1: Synchronization of functional units fails because go- is sampled at slightly di erent times. busy- signals generated by the three functional units in our system: First and second traces starting from the top (labeled 4 and 2 respectively) show hidden 1 and hidden 2's busy- signals. The third trace (labeled 1) shows busy- belonging to the output node being delayed by one cycle according to the fourth trace (main board's clock, labeled 3). 1) the power in the controller board is noisy, 2) many \tri-stated" outputs oat at the TTL threshold level, and 3) FPGAs can be used in such a manner that internal signals ght. As we explained in Chapter 3, the controller board does not have power and ground planes. The fact that we use wires to provide power and ground to all the components increases the chances for ground bounce and noise (since the ground return is more resistive and inductive than that of a plane). We noticed large surges in the power line of the controller board. The main causes for the surges were 1) contention in the ribbon cable that connects the DMA controller card to the controller board, and 2) bug (2) above. The contention on the ribbon cable was caused by the data-bus transceivers (labeled A and B in Figure A.2) ghting on both ends of the ribbon cable. The data bus is bu ered in both the DMA controller card and the controller board. There is a delay on all the signals that travel from the DMA controller card and into the X0 controller board. We use the d rdf-

106 signal to change the direction of the transceiver A in the DMA controller card, and we were originally decoding two control signal in X0 to change the direction of the transceiver B in the controller board. There was a relatively large delay from the time when the direction for transceiver A was ready, and when X0 nally produced the direction for transceiver B. During this time, the transceivers were ghting in the ribbon cable (for about 70 ns.) This was a major source of noise in the controller board. It was most noticeable during DMA transfers. We solved the problem by using the inversion of the signal that controls transceiver \A" to also control transceiver \B." X0 is in charge of inverting the signal. This solution helped to reduce to a minimum the VCC bounce and became obvious after implementing it. sbus 2 decoder

data[7:0]

X0

AS245

AS245

rd-,cs-

AS245

AS245

rdf-

dir B AS245

controller board

ribbon cable

dir A AS245

2

pal22V10

data[7:0]

DMA

5v 4.7K

controller DMA controller card SPARC station

Figure A.2: Signals ghting in the ribbon cable. Dotted lines inside X0 show the original implementation for the direction control to the transceivers. Solid lines inside X0 show the solution. Some of the bu ered signals are tri-stated when not in use. The data bus coming from the DMA controller card, the fourteen signals used for con guring/transferring the weights to the XC4010s, and also the data bus for the memories all had to be pulled-up at the source. The reason is that all of these signals oated at dangerous levels, very close to the TTL threshold value of 1.4 volt. This is due to the large loading presented to the tri-stated signals. The data bus coming from the DMA controller card was especially dangerous, since it introduced large amounts of noise in the ribbon cable. The noise was transferred to the control signals also in the ribbon cable{that we assumed would never glitch{, and that in

107 turn introduced unwanted side-e ects in X0. TTL chips consume large amounts of power when being switched, and introduce re ections on the wires connected to their outputs because the edges are sharp. In our particular case, since the inputs of the TTL chips were oating close to the threshold value and were coupling to the noise already present in the ground plane (remember that the controller board is noisy because its power/ground distribution is prone to be noisy itself), the TTL chips switched very quickly, following its noisy input. The fast switching of the TTLs produced even more noise by draining large amounts of current for every switch. The noise got transferred into the ground system, producing an endless cycle (well, at least from an electron's perspective). We solved this problem by pulling-up the inputs to the bu ers. Especially with the memory bu ers, the noise was actually produced by a very slow rising data bus (as the memory chip tri-stated the data bus). We reduced the rising time by pulling the signals at the places marked \P" in Figure 3.6 , and much of the noise in those particular bu ers was cleaned-up. Finally, the current meter in our power supply revealed that contention can be produced inside the Xilinx FPGAs. In our case, the multiplexer implemented with busses (shown in Figure 3.12) allowed more than one driver per line when X0 was idle (in the default state). This is a major problem, because it is easy to create a path with relatively small resistance from power to ground (i.e. almost a short circuit). By simply having one of the drivers output a logic zero and another a logic one, we create such a path. We noticed this fact because X0 consumed too much power after we implemented the multiplexer. We corrected the problem by loading the proper word into the selector of the multiplexer so that no more than one driver drives a line at a time.

108

Appendix B. Software B.1 The color Program Once the hidden and output neurons are generated as in [2], the color program automatically generates a constraint le for a hidden and another for an output neuron. Program color has an intimate knowledge of the architecture of the board, and is thus capable of assigning signal names to the correct group of input/output blocks in both the hidden neuron in F1 and the output neuron in F14. The color program assumes the signal names to be the ones that appear as signal names in the report les for hidden and output neurons, and knows which signals are to be connected among hidden and output nodes. color has an internal one-dimensional array structure that holds the name of the external signals produced by hidden and output nodes, and another two-dimensional array structure that holds the exact pad names that connect the neuron in F1 to the routing chips R1 through R5. These structures are initialized at compile time and can be found in the le globals.h. Program color can produce three outputs: a text le showing all fourteen FPGAs and how the signals are assigned (option \Print assignment"), and two more text les that are valid constraint les for XC4010s and are used to route the two neurons (option \Generate constraint les"). The program can also check all possible systems implemented with the number of inputs desired if option \Generate all possible con gurations automatically" is selected. In the latter case, the check program evaluates di erent systems by creating a ctitious number of hidden and output nodes, always assuming a total of fourteen XC4010 FPGAs. After evaluating the di erent systems, color writes the results in a le called status.txt. Figure B.1 shows a run of color. In the run we specify that we want a system with ten input nodes, that we do not want to generate all possible con gurations automatically, that we wish to print the signal names as they are assigned to the di erent functional units, and nally, that our speci c system has 2 output and 3 hidden nodes.

109 pico 396> color Program to check different systems on the board configuration chosen on January 7, 1994 (footprint 3) Number of inputs: 10 Generate all possible configurations automatically? 0 Print assignment? 1 Generate constraint file? 0 Number of outputs: 2 Number of hiddens: 3 checking system with 10 input, 3 hidden, and 2 output nodes.

Figure B.1: A run of the color program. Program check automatically evaluates that the system speci ed is a valid one. The following is the output le \con guration.txt" generated by the run of color as speci ed in Figure B.1. Notice how the signal names switch routing FPGAs from functional unit to functional unit. This is the result of using the shifting pattern strategy as explained in Chapter 2 to connect the neuron holding FPGAs to the routing chips. =========================================================================== fpga: 1 hidden #1 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 1: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 2: HT_OUT HT_OUT HT_OUT HT_OUT HT_OUT 3: HO_OUT HO_OUT HO_OUT HO_OUT HO_OUT 4: DW_HID DW_HID nc nc nc 5: nc nc nc nc nc 6: nc nc nc nc nc 7: nc nc nc nc nc 8: nc nc nc nc nc 9: nc nc nc nc nc 10: nc nc nc nc nc =========================================================================== fpga: 2 hidden #2 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 1: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 2: HT_OUT HT_OUT HT_OUT HT_OUT HT_OUT 3: HO_OUT HO_OUT HO_OUT HO_OUT HO_OUT 4: nc DW_HID DW_HID nc nc 5: nc nc nc nc nc 6: nc nc nc nc nc 7: nc nc nc nc nc 8: nc nc nc nc nc 9: nc nc nc nc nc 10: nc nc nc nc nc

110 =========================================================================== fpga: 3 hidden #3 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 1: DATA_IN DATA_IN DATA_IN DATA_IN DATA_IN 2: HT_OUT HT_OUT HT_OUT HT_OUT HT_OUT 3: HO_OUT HO_OUT HO_OUT HO_OUT HO_OUT 4: nc nc DW_HID DW_HID nc 5: nc nc nc nc nc 6: nc nc nc nc nc 7: nc nc nc nc nc 8: nc nc nc nc nc 9: nc nc nc nc nc 10: nc nc nc nc nc =========================================================================== fpga: 4 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: nc nc nc nc nc 1: nc nc nc nc nc 2: nc nc nc nc nc 3: nc nc nc nc nc 4: nc nc nc nc nc 5: nc nc nc nc nc 6: nc nc nc nc nc 7: nc nc nc nc nc 8: nc nc nc nc nc 9: nc nc nc nc nc 10: nc nc nc nc nc =========================================================================== FPGAs 5 through 12 look exactly like FPGA 4, so we skip them to save paper =========================================================================== fpga: 13 output #2 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: SO_OUT DATA_IN DATA_IN nc TO_IN 1: DATA_IN DW_OUT DW_OUT nc nc 2: DATA_IN nc nc nc nc 3: DW_OUT nc nc nc nc 4: nc nc nc nc nc 5: nc nc nc nc nc 6: nc nc nc nc nc 7: nc nc nc nc nc 8: nc nc nc nc nc 9: nc nc nc nc nc 10: nc nc nc nc nc =========================================================================== fpga: 14 output #1 brown red orange yellow green Switch Signal Signal Signal Signal Signal 0: TO_IN SO_OUT DATA_IN DATA_IN nc

111 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

nc nc nc nc nc nc nc nc nc nc

DATA_IN DATA_IN DW_OUT nc nc nc nc nc nc nc

DW_OUT nc nc nc nc nc nc nc nc nc

DW_OUT nc nc nc nc nc nc nc nc nc

nc nc nc nc nc nc nc nc nc nc

Program color uses two template ASCII les for creating the constraint les for the hidden and output nodes. The templates are called hidden foot.txt and output foot.txt and are extracted from footprint3.txt. In the templates, the forced external signals of hidden and output nodes are speci ed with statements like \place instance [signal name]: [pad name]," and the pads that are unused in all the XC4010 FPGAs are pre-speci ed using the statement \notplace instance: [pad name list]." These are statements that are interpreted by the place and route program PPR. The le used as a template by color to generate the constraint le for a hidden node is printed below. Note that this is simply the template. Program color appends statements to this le to allow the proper signals to be group forced to the correct set of input/output blocks. The appended information for the hidden node (as was produced by the statements shown in Figure B.1) is shown after this list. # # # # # # # #

hid_foot.txt assumes that there are FIVE routing chips: brown red orange yellow green R1 R2 R3 R4 R5 R0 is in charge to route signals for control and address to staging memory. updated on April 8, 1994

# clock input place instance CLK_IN: B17; # Locked pins for # control place instance place instance place instance # address place instance place instance

# pck2

private memory: HS_OE: HS_CE: HS_WE:

V16; T15; # D7 used to be T16 V17; # D6

HA_OUT: HA_OUT:

N18; P18;

112 place instance HA_OUT: T18; place instance HA_OUT: M17; place instance HA_OUT: P17; place instance HA_OUT: R18; place instance HA_OUT: P16; place instance HA_OUT: R17; place instance HA_OUT: T14; place instance HA_OUT: U18; place instance HA_OUT: T17; place instance HA_OUT: N17; # data place instance HO_IN: U4; place instance DER_IN: T5; place instance HT_IN: V3; # notice we are missing 4 data pins

# D0 of memo. # D1 of memo. (Used to be U3) # D2 of memo.

# PINS TO PRIVATE MEMORIES ARE OK. # These go to X0. place instance place instance place instance place instance place instance place instance

They are definitely GO_INT: P3; BUSY: U7; RESET_INT: B2; # SW_INT: B10; WT_IO: U3; # W_DIR: T3; #

locked pins. SGCK1 d0 WS-

# # # # # # # #

now come the addresses and ctl to the staging memory These should be allowed to be placed loosely, they should not be constrained pin to pin, so I list the pinout to R0 here only. Connections to R0 (only really used by F1 and F14): V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2

# # # # #

connections connections connections connections connections

to to to to to

R1: R2: R3: R4: R5:

T2 P2 R1 P1 M2

L3 L1 K2 J3 J1

H2 G1 F1 F2 C1

F3 B1 C2 C3 B3

# disallow all unconnected pins notplace instance *: T4 V2 V4 V6 notplace instance *: T16 N16 L16 notplace instance *: G18 F18 F17 notplace instance *: B16 A17 C14 notplace instance *: B12 A12 B11 notplace instance *: E3 D2 E2 D1

A2 C6 B5 A4 B7

M18 L17 K18 K16 J17

T13 V15 U13 U12 T11

V11 H18 C8 A11 U10 H16 B8 B11 T9 G17 B9 B12 V9 E18 C10 B13 U8 D18 A9 A15

T10 V10 V12 L18 K17 J16 C18 F16 E17 B15 A16 B14 C11 A11 A10 E1 G2 H3 H1

V14; J18 H17; B18 D17 C17 C13 A15 A14 C9 A8 A7 A6 J2 K3 L2 M1

E16 C5 U16; B13 A13; A5 B6 A3 B4 C4; N1 N3 U1;

# special for hidden: place instance _GO_1: U15; place instance _GO_2: U16;

Below are the signals that color group-forced to di erent input/output blocks according

113 to the routing chips that should be used for the interconnections. This information is appended by color to the template le that is printed above. This information belongs to a hidden node. # # # #

now come the addresses and ctl to the staging memory These should be allowed to be placed loosely, yet should not be constrained to go to any of R1 through R5 do notplace on R1 through R5 (this will automatically allow R0 pins only.

notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15;

114 notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance I_ADDR: T2 L3 H2 F3 A2 M18 T13 V11 H18 C8 A11 P2 L1 G1 B1 C6 L17 V15 U10 H16 B8 B11 R1 K2 F1 C2 B5 K18 U13 T9 G17 B9 B12 P1 J3 F2 C3 A4 K16 U12 V9 E18 C10 B13 M2 J1 C1 B3 B7 J17 T11 U8 D18 A9 A15; notplace instance DATA_IN: P2 R1 P1 M2 L1 K2 J3 J1 G1 F1 F2 C1 B1 C2 C3 B3 C6 B5 A4 B7 L17 K18 K16 J17 V15 U13 U12 T11 U10 T9 V9 U8 H16 G17 E18 D18 B8 B9 C10 A9 B11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DATA_IN: P2 R1 P1 M2 L1 K2 J3 J1 G1 F1 F2 C1 B1 C2 C3 B3 C6 B5 A4 B7 L17 K18 K16 J17 V15 U13 U12 T11 U10 T9 V9 U8 H16 G17 E18 D18 B8 B9 C10 A9 B11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HT_OUT: P2 R1 P1 M2 L1 K2 J3 J1 G1 F1 F2 C1 B1 C2 C3 B3 C6 B5 A4 B7 L17 K18 K16 J17 V15 U13 U12 T11 U10 T9 V9 U8 H16 G17 E18 D18 B8 B9 C10 A9 B11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HO_OUT: P2 R1 P1 M2 L1 K2 J3 J1 G1 F1 F2 C1 B1 C2 C3 B3 C6 B5 A4 B7 L17 K18 K16 J17 V15 U13 U12 T11 U10 T9 V9 U8 H16 G17 E18 D18 B8 B9 C10 A9 B11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance C2 C3 B3 C6 B5 A4 G17 E18 D18 B8 B9 T1 U6 R2 U5 K1 N2

DW_HID: P2 R1 P1 M2 L1 K2 J3 J1 G1 F1 F2 C1 B1 B7 L17 K18 K16 J17 V15 U13 U12 T11 U10 T9 V9 U8 H16 C10 A9 B11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 ;

115 notplace instance C2 C3 B3 A2 B5 A4 G17 E18 D18 C8 B9 T1 U6 R2 U5 K1 N2

DATA_IN: T2 R1 P1 M2 L3 K2 J3 J1 H2 F1 F2 C1 F3 B7 M18 K18 K16 J17 T13 U13 U12 T11 V11 T9 V9 U8 H18 C10 A9 A11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 ;

notplace instance DATA_IN: T2 R1 P1 M2 L3 K2 J3 J1 H2 F1 F2 C1 F3 C2 C3 B3 A2 B5 A4 B7 M18 K18 K16 J17 T13 U13 U12 T11 V11 T9 V9 U8 H18 G17 E18 D18 C8 B9 C10 A9 A11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HT_OUT: T2 R1 P1 M2 L3 K2 J3 J1 H2 F1 F2 C1 F3 C2 C3 B3 A2 B5 A4 B7 M18 K18 K16 J17 T13 U13 U12 T11 V11 T9 V9 U8 H18 G17 E18 D18 C8 B9 C10 A9 A11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HO_OUT: T2 R1 P1 M2 L3 K2 J3 J1 H2 F1 F2 C1 F3 C2 C3 B3 A2 B5 A4 B7 M18 K18 K16 J17 T13 U13 U12 T11 V11 T9 V9 U8 H18 G17 E18 D18 C8 B9 C10 A9 A11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DW_HID: T2 R1 P1 M2 L3 K2 J3 J1 H2 F1 F2 C1 F3 C2 C3 B3 A2 B5 A4 B7 M18 K18 K16 J17 T13 U13 U12 T11 V11 T9 V9 U8 H18 G17 E18 D18 C8 B9 C10 A9 A11 B12 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance B1 C3 B3 A2 C6 A4 H16 E18 D18 C8 B8 T1 U6 R2 U5 K1 N2

DATA_IN: T2 P2 P1 M2 L3 L1 J3 J1 H2 G1 F2 C1 F3 B7 M18 L17 K16 J17 T13 V15 U12 T11 V11 U10 V9 U8 H18 C10 A9 A11 B11 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 ;

notplace instance DATA_IN: T2 P2 P1 M2 L3 L1 J3 J1 H2 G1 F2 C1 F3 B1 C3 B3 A2 C6 A4 B7 M18 L17 K16 J17 T13 V15 U12 T11 V11 U10 V9 U8 H18 H16 E18 D18 C8 B8 C10 A9 A11 B11 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HT_OUT: T2 P2 P1 M2 L3 L1 J3 J1 H2 G1 F2 C1 F3 B1 C3 B3 A2 C6 A4 B7 M18 L17 K16 J17 T13 V15 U12 T11 V11 U10 V9 U8 H18 H16 E18 D18 C8 B8 C10 A9 A11 B11 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HO_OUT: T2 P2 P1 M2 L3 L1 J3 J1 H2 G1 F2 C1 F3 B1 C3 B3 A2 C6 A4 B7 M18 L17 K16 J17 T13 V15 U12 T11 V11 U10 V9 U8 H18 H16 E18 D18 C8 B8 C10 A9 A11 B11 B13 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DATA_IN: T2 P2 R1 M2 L3 L1 K2 J1 H2 G1 F1 C1 F3 B1 C2 B3 A2 C6 B5 B7 M18 L17 K18 J17 T13 V15 U13 T11 V11 U10 T9 U8 H18 H16 G17 D18 C8 B8 B9 A9 A11 B11 B12 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DATA_IN: T2 P2 R1 M2 L3 L1 K2 J1 H2 G1 F1 C1 F3 B1 C2 B3 A2 C6 B5 B7 M18 L17 K18 J17 T13 V15 U13 T11 V11 U10 T9 U8 H18 H16 G17 D18 C8 B8 B9 A9 A11 B11 B12 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ;

116 notplace instance HT_OUT: T2 P2 R1 M2 L3 L1 K2 J1 H2 G1 F1 C1 F3 B1 C2 B3 A2 C6 B5 B7 M18 L17 K18 J17 T13 V15 U13 T11 V11 U10 T9 U8 H18 H16 G17 D18 C8 B8 B9 A9 A11 B11 B12 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HO_OUT: T2 P2 R1 M2 L3 L1 K2 J1 H2 G1 F1 C1 F3 B1 C2 B3 A2 C6 B5 B7 M18 L17 K18 J17 T13 V15 U13 T11 V11 U10 T9 U8 H18 H16 G17 D18 C8 B8 B9 A9 A11 B11 B12 A15 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DATA_IN: T2 P2 R1 P1 L3 L1 K2 J3 H2 G1 F1 F2 F3 B1 C2 C3 A2 C6 B5 A4 M18 L17 K18 K16 T13 V15 U13 U12 V11 U10 T9 V9 H18 H16 G17 E18 C8 B8 B9 C10 A11 B11 B12 B13 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance DATA_IN: T2 P2 R1 P1 L3 L1 K2 J3 H2 G1 F1 F2 F3 B1 C2 C3 A2 C6 B5 A4 M18 L17 K18 K16 T13 V15 U13 U12 V11 U10 T9 V9 H18 H16 G17 E18 C8 B8 B9 C10 A11 B11 B12 B13 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HT_OUT: T2 P2 R1 P1 L3 L1 K2 J3 H2 G1 F1 F2 F3 B1 C2 C3 A2 C6 B5 A4 M18 L17 K18 K16 T13 V15 U13 U12 V11 U10 T9 V9 H18 H16 G17 E18 C8 B8 B9 C10 A11 B11 B12 B13 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ; notplace instance HO_OUT: T2 P2 R1 P1 L3 L1 K2 J3 H2 G1 F1 F2 F3 B1 C2 C3 A2 C6 B5 A4 M18 L17 K18 K16 T13 V15 U13 U12 V11 U10 T9 V9 H18 H16 G17 E18 C8 B8 B9 C10 A11 B11 B12 B13 V7 U9 T6 U11 V8 V5 T8 V13 U14 T1 U6 R2 U5 K1 N2 ;

Note that the exact pins for the pad assignments for group-forced external signals of the hidden and output nodes are not known after running color, because the hidden and output nodes have not yet been routed. After the hidden and output nodes are routed according to the constraint les, the routing chips' LCA descriptions are created. Program wirer should be written to accomplish the task. It is important to know that the software involved in wirer is of no great complexity because program color provides an assignment that makes it possible for the routing FPGAs to have enough resources in all instances.

B.2 The configR0 Program: Program configR0 generates the routing chip R0. R0 is only required to route control and address signals to the global memory chips, so we use two programs called parse hid and parse out to identify the correct signal names from the hidden and output report les

117 and create two tables used by configR0 to generate R0. configR0 does a search to see what input/output block in R0 is connected to the external signal's input/output block in FPGA F1 or FPGA F14. The program is then able to assign the correct input/output block connecting to the proper pin of the global memory chips because it uses the signal name to identify the function. Figure B.2 shows the steps taken to create a valid unrouted LCA le that describes R0. pico 479> parse_hid < hidden.rpt > hid_rpt.parsed pico 480> parse_out < _output.rpt > out_rpt.parsed pico 481> configR0 acme.wires hid_rpt.parsed out_rpt.parsed r0.lca r0.xnf

Figure B.2: Steps for creating the unrouted LCA and XNF les for routing chip R0. Figure B.3 shows the last steps applied to the r0.lca le obtained as in Figure B.2 for generating an r0 r.mcs le proper for downloading into the ACME board. pico 482> apr -l r0.lca r0_r.lca pico 483> makebits -t r0_r.lca pico 484> makeprom -u 0 r0_r.bit

Figure B.3: Final steps performed to obtain a le for routing chip R0 that can be downloaded to the main board. All the steps are performed using Xilinx proprietary tools. The last statement produces the le r0 r.mcs. The steps shown in Figure B.3 are the same as those used to generate the other ve mcs les (see Chapter 1 for an explanation of mcs) for the routing chips R1 through R5.

B.3 Accessing Memory Chips With access memo Program access memo is used to load the private and global memories in the board. The program can perform reads or writes to memory and can be command-line driven as shown below.

118 access memo [a][b][c][filename]

Parameter a speci es the function to perform: if a = 1, access memo tests a particular memory chip with its own nine di erent test les. If a = 2, access memo performs a read, and if a = 3, access memo performs a write. If a read is performed, then the le memo read.txt is automatically generated. The le lists the number of the byte (in decimal) followed by the data (in hexadecimal) read from the speci ed memory chip. When writing to memory, access memo expects the input le's name to be speci ed as the last option of the command line. Each byte of data in the le to be written is expected to be entered in ascii hexadecimal code, separated from preceding and following data bytes by a space. To di erentiate these les from others we append the extension \.md" to the le name. access memo produces the le memo sent.txt after performing a write to memory. File memo sent.txt lists an address (in decimal) followed by a data byte (in hexadecimal). Parameter b speci es the number of the memory chip to be accessed. Table 2.12 in Chapter 2 shows the code for accessing the di erent memory chips. Parameter c speci es the number of bytes to read or write to/from each memory. The number of bytes is speci ed in decimal, and ranges from 0 to 4096. There is an option to pre-set the address to start writing or reading to/from, but at this moment we have xed access memo to always write or read to/from a memory chip starting at address 0. The line below shows the contents of example.md, were 25 data bytes are speci ed to be written to memory. 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19

The statement \access memo 3 0 25 example.md" writes the le to the global memory chip 0. On writing, access memo produces the le memo sent.txt that looks as is shown below. [0] 1 [1] 2 [2] 3 [3] 4 [4] 5 [5] 6 [6] 7 [7] 8 [8] 9 [9] a [10] b [11] c [12] d [13] e [14] f [15] 10 [16] 11 [17] 12 [18] 13 [19] 14 [20] 15 [21] 16 [22] 17 [23] 18 [24] 19

The latter is the same format that the le locations of global memory chip 0 were read.

memo read.txt

would have if the rst 25

119 The global input memory is divided in two sections: a) input nodes, and b) target outputs. If the number of input nodes is not a multiple of the number of \routing" FPGAs ( ve in our case), then the extra = din (mod 5) bits need to be assigned to the global input memory bits d20, d21, d22, and d23. Otherwise, the input nodes are mapped to bits d0 to d(num inputs ? 1) of the global memory. The global input memory chips destined to store the input nodes are gmem0, gmem1, and gmem2. The target outputs are mapped to gmem3 and gmem4. The last output node reads its target output from the d0th . bit of gmem3, and the rest read their bits in ascending order. For example, for a system with three output nodes, o1, o2, and o3, o3 reads its target output from d25, o2 from d26, and o1 from d27. At this point we are doing the editing of the memory data les by hand. The global output memory consists of gmem5 and gmem6. These memory chips get written with the output generated by the output nodes. As an example, a system with three output nodes (such as above) will get the outputs mapped in the following manner: o3 writes its output to data bit d40, o2 to bit d41, and o1 to bit d42. Figure B.4 shows a detailed view of how the memory chips are connected to the routing FPGAs and how they are used. r4 r3 r1 r2

r2 r1 r4 r3

Target Output n

r2 r1 r3 r4

r5 r1 r2 r3 r4 r5 r1 r2

r3 r4 r5 r1 r2 r3 r4 r5

r1 r2 r3 r4 r5 r1 r2 r3

r4 r5 r1 r2 r3 r4 r5 r1

r4 r3 r2 r2 r3 r4 r5 r1

0 0

0 8

0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23

0 1 2 3 4 5 6 7 24 25 26 27 28 29 30 31

0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39

gmem2

gmem3

gmem4

routing FPGA chip bit

1 1

2 2

3 3

4 4

gmem0

5 5

6 6

7 7

1 2 3 4 5 6 7 9 10 11 12 13 14 15

gmem1 Input Nodes (1 -> 24)

Target Outputs (1 -> n)

Real Output n r1 r2 r3 r4 r5 r1 r2 r3

r4 r5 r1 r2 r3 r4 r5 r1

0 1 2 3 4 5 6 7 40 41 42 43 44 45 46 47

0 1 2 3 4 5 6 7 48 49 50 51 52 53 54 55

gmem4

gmem5

Real Outputs (1 -> n)

Figure B.4: Organization of the global memory.

data bit

120

B.4 Program net Controls ACME's Basic Functions The net program controls the resetting of the neurons, the starting and stopping of the neurons, the loading of the epoch counter, and the writing and reading of weights to and from the hidden and output neurons. Program net includes the le params (see [2]) for setting up the internal data structures that enable weight initializations. The net program should be re-compiled whenever a new params le is generated. The net program is command-line driven and is called in the following manner: net [a][b/filename]

Parameter a speci es the basic functions that net can perform: if a = 8, net resets the neurons, if a = 10, net writes the weights to the neurons . If a = 10, the second parameter should be a le specifying the weights. The weight le should be formatted as is shown in Figure B.5. To read the weights from the hidden and output neurons, a should be set to 11. The net program writes the weights to the CRT, so the output of the program should be re-directed to a le. Finally, if a equals 1, then b should equal the number of epochs to be loaded into the epoch counter. The net program loads the epoch counter with b number of epochs (up to 1024 epochs). Figure B.5 shows the format le expected as a weight le to be downloaded by net into the hidden and output neurons. In this particular case the system consists of 3 hidden neurons, 10 input neurons (9 plus a threshold input), and 2 output neurons. The hidden neuron weights are 12-bits deep, and the output neuron weights are 8-bits deep. There are nine weights plus a threshold weight for each hidden neuron, and three weights plus a threshold weight for each output neuron. This le format is speci ed in [2]. Note that the hidden weights should be permuted because of the e ect of the \shifting pins" strategy on the input neurons as they are routed to the hidden neurons. The original weight le has to be permuted in order to load the weights to their correct destination. Program wscramble is able to permute the weights. The program reads the input le from the command line, and it outputs the permuted weight le to the CRT. If a new le with

121 #inputs: t 1 h #0 t 1 h #1 t 1 h #2 t 1 o #0 t 1 o #1

9 #hidden: 3 #outputs: 2 HWtBits: 12 OWtBits: 8 wts: -1701 -1644 -2004 -1299 272 40 1370 -808 1236 thresh: -1317 wts: 1159 162 1638 603 1083 1703 1350 981 -664 thresh: -1867 wts: 1491 1142 666 822 1428 -1746 -404 1172 213 thresh: 1718 wts: 61 -40 118 thresh: -44 wts: -92 -108 -2 thresh: 93

Figure B.5: Original, non-permuted wts0 le shows format of weights to be downloaded to the hidden and output neurons in ACME. the permuted weights is desired, the output of wscramble should be redirected into the new le. We issue the following command to obtain wts0p: wscramble wts0 > wts0p. Figure 2.12 can be used to check the correct weight re-arrangement. Figure B.6 shows how the le wts0 of our example is permuted to accomplish the correct weight assignment. #inputs: t 1 h #0 t 1 h #1 t 1 h #2 t 1 o #0 t 1 o #1

9 #hidden: 3 #outputs: 2 HWtBits: 12 OWtBits: 8 wts: -1701 -1644 -2004 -1299 272 40 1370 -808 1236 thresh: -1317 wts: 162 1638 603 1083 1159 1350 981 -664 -1867 thresh: 1703 wts: 666 822 1428 1491 1142 1172 213 1718 -1746 thresh: -404 wts: 61 -40 118 thresh: -44 wts: -92 -108 -2 thresh: 93

Figure B.6: wts0p is the permuted version of wts0. Note that only hidden node weights need be permuted.

B.5 Testing the Components There are 21 dual-ported memory chips in the ACME board. In developing a new design it is important to know that all the parts are working properly. We designed two di erent methods for checking the memory chips. Program access memo is able to check individual memory chips by downloading nine di erent bit-patterns to the memory chip in question, reading back its contents and issuing the UNIX call diff with parameters memo sent.txt and memo read.txt. An error is reported if diff reports a di erence in the les. The second way to test the memories is more complete, because it involves the XC4010 FPGAs. First access memo downloads the memory chips with a speci c pattern le, next the program fdp con gures the XC4010 FPGAs with a special \memory checker" con guration le, and the routing FPGAs to allow control signals to ow to the global memory chips. The

122 routing chips need to be carefully con gured, since the global memory chips need to be read and written by the XC4010 FPGAs. Finally, once all the elements of the board have been con gured, the net program is used to load the epoch counter in X0 with a 1, and to issue the start command. The XC4010 FPGAs then invert the bits of the private and global memories once (or as many times as epochs are loaded in the epoch counter). access memo is then used to read back the data from the memories and a comparison is made of the recently read data with expected data. Because of the complications with the global memory we develop two di erent suites: one for private memories and another for some private and one global memory chips. Below we include the script le we use during an intermediate stage of ACME's development. We include a print-out of the script le that tests two private and one global memories. # # # # #

This file includes a test for private and global memories Tetinig private memories 3 and 14 and global memory 3. F1 and F2 have a global memory tester (by default). (F2 is not used) F3 and F14 have private memory tester units.

@ err = 0 xchecker /X0/x0_test.bit # Following line loads gmemtest in F1 and F2, and loads pmemtest # in F14. Other FPGAs are configured with pmemtest bit-streams. ../bin/fdp 2 12 mcs/hid_memtest4.mcs mcs/hid_memtest.mcs \ mcs/emp4010.mcs # Configure r0 in the same manner as for the xor circuit. # r1 and r3 share the same pinout => use the same configuration file # r2 needs a different file. r4 and r5 are loaded with empty files ../bin/rdp mcs/r0_rt.mcs mcs/r1_testr.mcs mcs/r2_testr.mcs \ mcs/r1_testr.mcs mcs/emprx.mcs mcs/emprx.mcs # number of iterations if ($1 != "") then set count = $1 else @ count = 1 endif @ i = 0 while (${i} < ${count}) @ i += 1 echo "test number $i"

123 set beep=0 # load the private memories # F3 and F14 talk to their private memory ../bin/access_memo 3 9 4096 mtest/pattern.txt ../bin/access_memo 3 20 4096 mtest/pattern.txt # load the input staging memory number 3. # F1 talks to chip 2 of the input staging memory ../bin/access_memo 3 2 4096 mtest/pattern2.txt # reset net ../bin/new_net 8 # run net for one cycle ../bin/new_net 1 1 # read pmem3 ../bin/access_memo 2 9 4096 # compare pmem3 to desired data cmp -s memo_read.txt mtest/desired.txt if ($status != 0) then set beep=1 echo ". ERROR: pmem3 and desired data file differ" cp memo_read.txt mtest/pm3read.${err}.err endif # read pmem14 ../bin/access_memo 2 20 4096 # compare pmem14 to desired data cmp -s memo_read.txt mtest/desired.txt if ($status != 0) then set beep=1 echo ". ERROR: pmem14 and desired data file differ" cp memo_read.txt mtest/pm14read.${err}.err endif # read gmemo2 ../bin/access_memo 2 2 4096 # compare gmemo2 to desired data cmp -s memo_read.txt mtest/desired.txt if ($status != 0) then set beep=1 echo ". ERROR: gmem2 and desired data file differ" cp memo_read.txt mtest/gm2read.${err}.err endif if($status != 0) then set beep=1 echo ". ERROR: global memory and desired data file differ" cp memo_read.txt mtest/gmread.${err}.err endif

124 if($beep == 1) then @ err += 1 echo  echo "number of errors:" $err endif end

125

Appendix C. Schematics C.1 Description of Schematics This section includes the schematics for the controller board, alterations performed to the Dawn card, and the controller FPGA X0. The documentation for the alterations to the Dawn card and the implementation of the controller board are performed using PADs, a program for producing circuit schematics. The design of X0 is performed using workview, a schematic capture program. We give a brief summary of the schematics. 1. Controller board. 2. Dawn board alterations. (The DMA controller and Sbus are not shown.) 3. X0 top level schematic. Following schematics show implementation of functional blocks. 4. Address generation and memory chip control. 5. A 4-bit binary coded preloadable, resettable up/down counter. 6. The decoder and port address generator. 7. A macro for a 74-138. 8. Miscellaneous control signals. 9. The read-back multiplexer. 10. Implementation of a 2-to-1 multiplexer. 11. Epoch counter. 12. 3-bit loadable down counter. 13. 14 multiplexers implemented using tri-state bu ers. 14. BDSYN description of the selector. 15. CUPL description of the PAL22V10 we replaced in the DMA controller card.

126

127

X0 --- SBUS Controller for ACME READD0_READ

LOC=K1 FAST BPAD D0_IO

RD8

LOC=H2 P_REQ_OUT FAST OPAD

P_REQOUT XD0NB B

X D0

D0 D1

LOC=J2FAST OBUFZ D1_IOIBUF D1_READ BPAD XD1NB

X D1

B

LOC=K2 FAST OBUFZ BPAD D2_IOIBUF D2_READ X D2NB

D2 D3

X D2

B

D4

RESET-

REQUEST

D0

Q0

D1

Q1

PROGRAM-

D2

Q2

DMA_CONFIG INV

D3

Q3

MEMTRAN INV

D4

Q4

LOC=B10 OBUF245_DIR OPAD FAST

P_REQOUT P_REQ

P_RDF-INV

WTWR/RDTRI-

MEMTRAN-

OR2

LOC=L2 REQ-IN IPAD

REQUEST

DMA_CONFIG

OBUF LOC=N12 FD0-DIR OPAD

AND2B1

FX/RX-

F[14:1]IN

IBUF

FX/RX-

F[14:1]IN

F[14:1]IN

LOC=G2 FAST D3_IO IBUF OBUFZ D3_READ BPAD X D3NB

X D3

B

D5

Q5

D6

Q6

DMA_CONFIG MD[7:0]

D0_READ D1_READ D2_READ D3_READ D4_READ D5_READ D6_READ D7_READ

OUT0 MD[7:0] OUT1 OUT2

LOC=J15 FAST D4_IOIBUF OBUFZ BPAD D4_READ XD4NB

D7 X D4

B

LOC=L15 FAST D5_IOIBUF OBUFZ BPAD D5_READ X D5NB

PORT0WR- C

TRI-

X D5

B

PORT0WR-

PORT1WRPORT1RD-

X D7

B

PORT2WR-

P_RDFPA2 PA3 PA4 P_ACKP_CSP_WRP_RD-

PORT1RDPORT2RD-

C READ-

PORT0RD-

P_RDF-

PORT2RD-

PA2

PORT3WR-

PA3

PORT4WR-

PA4

PORT5WR-

P_ACK-

PORT6WR-

P_CS-

PORT7WR-

PORT0WRPORT0RDPORT1WRPORT1RDPORT2WRPORT2RDPORT3WRPORT4WRPORT5WRPORT6WRPORT7WR-

PULLUP PULLUP PULLUP PULLUP

LOC=A8 WR_IN

IPAD

LOC=B9 RD_IN

P_RD-

IPAD

LOC=H1 FDONE_IN IBUF C

PULLUP

IPAD

LOC=A6 PA2_IN

IPAD IPAD IPAD

DMA_RD-

LOC=N7 RST_OUT- OPAD

RESET-

OBUF

GO-

PULLUP

O

IBUF

OBUF

GO_OUT

GSEL1-

PMEM1-

PSEL2-

OBUF

PMEM2-

PSEL3-

OBUF

PMEM3-

D0 D1 D2 D3 D4 D5 D6 D7

OBUF

MCSEL0-

GSEL0GSEL1-

D4

PSEL0-

D5

PSEL1-

D6

PSEL2-

D7

PSEL3-

FD0_11 D[7:0]

MEMTRAN PORT1WRPORT2WRPORT3WR-

DMA_RD-

CSEL1-

MEMTRAN

CSEL2-

PORT1WR-

CSEL3-

PORT2WR-

CSEL4-

PORT3WR-

CSEL5CSEL6CSEL7CSEL8CSEL9-

MCSEL16MCSEL17MCSEL18MCSEL19MCSEL20-

CSEL16- CSEL10CSEL17- CSEL11CSEL18- CSEL12CSEL19- CSEL13CSEL20- CSEL14CSEL15-

MCSEL0MCSEL1MCSEL2MCSEL3MCSEL4MCSEL5MCSEL6MCSEL7MCSEL8MCSEL9MCSEL10MCSEL11MCSEL12MCSEL13MCSEL14MCSEL15-

memory3

PART=3190PG175-4

F5IN F6IN

IBUF F7IN OBUFZ F7CONF BPAD LOC=D13 IBUF F8IN OBUFZ F8CONF LOC=D7 BPAD IBUF F9IN OBUFZF9CONF LOC=B7 BPAD IBUF F10IN OBUFZ F10CONF BPAD LOC=B6 IBUF F11IN OBUFZF11CONF LOC=A11 BPAD IBUF F12IN OBUFZF12CONF BPAD LOC=C10

FD0_13

IBUF F13IN OBUFZF13CONF BPAD LOC=A13

FD0_14

OBUFZF14CONF

IBUF F14IN

LOC=C8 FCCLKIBUF OPAD

TRI-

config2

OBUFZ

FX/RXTRI-

GOSHIFT-

FAST

WTWR/RDWTHIGH/LOW-

IBUF

PORT7WRPORT5WRBUSYCLK

(output buffers) LOC=E14 MEMTRANTRAN_OUFAST OPAD DMA_RD- OBUF

GOSHIFTWTWR/RDWTHIGH/LOW-

LOC=R16 RCCLK- OPAD

NAND2B2 FX/RX-

LOC=L1 OBUFZ PROGFX- OPAD

PROGRAM-

PORT7WRPORT5WRBUSY-

DMA_WR-

F_CLK

LOC=D10 MEMDIR

OBUF MCSEL12-

OPAD

OBUF LOC=R10 MA0_OUT OPAD

MA0 OBUF

LOC=T9 MA1_OUT OPAD

MA2

OBUF

LOC=T11 MA2_OUT OPAD

MA3

OBUF

LOC=P10 MA3_OUT OPAD

OBUF

LOC=T12 MA4_OUT OPAD

MA5

OBUF

LOC=R12 MA5_OUT OPAD

MA6

OBUF

LOC=T13 MA6_OUT OPAD

OBUF

LOC=P11 MA7_OUT OPAD

MA1

LOC=B3

LOC=C6 OBUFSEL3_G

LOC=N10 FAST MOE-OUT OPAD

OPAD MEM_WR- OBUF

MCSEL3-

F4IN

DMA_CONFIG

WS-

D[7:0]

MWS-OUT

OPAD

FAST

LOC=R5 SEL12_P

OPAD

MCSEL4- OBUF SEL4_G

LOC=E2 OPAD

LOC=N4 MCSEL13-OBUF SEL13_P OPAD

LOC=B4

LOC=R4

OPAD

MCSEL14-OBUF SEL14_P

LOC=B1 CSEL0-

FD0_12

FCONF_[14:1]

D[7:0]

D[7:0]

LOC=D5 MCSEL2- OBUF SEL2_G OPAD

MCSEL6- OBUF SEL6_G

CLK DMA_WR-

FD0_9 FD0_10

(shift weights)

LOC=A4 MCSEL1- OBUF SEL1_G

MCSEL5- OBUF SEL5_G CLK DMA_WRDMA_RD-

FD0_8

OPAD LOC=N1 OPAD

FD0_[14:1]

D3

PSEL0PSEL1PSEL2PSEL3-

FD0_7

OPAD

OPAD

NAND2B1 OBUFZ NAND2B2 DMA_WRMD[7:0]

ACME-NET-INTERFACE

INFF

GSEL0-

OBUF

IBUF OBUFZ F6CONF BPAD LOC=C13

FD0_6

OPAD

PORT4WRPORT4WRPORT6WRPORT6WRWTHIGH/LOW-WTHIGH/LOW-

OPAD

LOC=B5 SEL0_G OPAD

C

CLK

MA[11:0]

PSEL1-

F3IN

BPAD

LOC=B13 IBUF OBUFZ F5CONF BPAD LOC=G16

FD0_5

OPAD

IBUF

OBUFZ

LOC=C2 FAST_IN IPAD

LOC=H16 SHFTWGHT OPAD

BUSY-

Q

GCLK MA[11:0]

D2

PMEM0-

DMA_CONFIG TRINAND2B2

LOC=P16 WT-WRITE OPAD

OBUF

D1

OBUF

F2IN

OBUFZ

OBUF

SHIFT-

D

D0

LOC=M16 RPROGIO BPAD

FX/RX-

LOC=K3

PA4

D0 D1 D2 D3 D4 D5 D6 D7

OBUFZ

PROGRX-

PSEL0-

(weight read from the net)

PA3

CLK_IN

P_REQ MEM_WR-

MEM_WR-

GMEM1-

RDONE

WTWR/RD-

IBUF

IPAD

LOC=R11 OBUFZR5CONF OPAD

D5

OBUF

LOC=P4

LOC=P1 OBUFZR3CONF OPAD LOC=P15 OBUFZR4CONF OPAD

miscel3

PA2

IBUF

LOC=D4

D2

WS-

WS-

GSEL1-

LOC=N5

PROGRAM-

FDONE

IBUF

FD0_4

OPAD

LOC=N6

OPAD

BUSY-

d0 -> hidden d1 -> output d2 -> empty

IBUF C P_ACK-

LOC=B8 PA4_IN

BUSY_IN

GMEM0-

PORT2RD-

LOC=N11 OBUFZR2CONF OPAD

EXT_RD-

DMA_CONFIG EXT_RDDMA_WR-

P_RD-

LOC=A7 PA3_IN

LOC=F3 IPAD

LOC=A5 GSEL0-

PORT1RD-

LOC=T4 OBUFZR1CONF OPAD

DMA_WRDMA_RD-

DMA_WR-

CLK CLK FASTFAST

dma_config = 1

IBUF C P_WR-

LOC=C9 ACK_IN

BPAD LOC=C4 IBUF OBUFZ F3CONF BPAD LOC=D12 IBUF OBUFZ F4CONF

FD0_3

MEMTRAN-

D1

D3

P_REQ

P_CS-

IBUF C

OUT6

F1IN

OPAD

LOC=T6 MCSEL15-OBUF SEL15_P

LOC=D1 MCSEL7- OBUF SEL7_P

OPAD

MCSEL9- OBUF SEL9_P

OPAD LOC=T5 OPAD

MCSEL16-OBUF SEL16_P

OPAD

LOC=P5 MCSEL11- OBUF SEL11_P OPAD OBUF

file name: x0

OPAD

MA7

LOC=R1 MCSEL17-OBUF SEL17_P MCSEL18-OBUF SEL18_P

LOC=M3 MCSEL10-OBUF SEL10_P

OPAD LOC=N2

LOC=P6 MCSEL8- OBUF SEL8_P

OPAD

MA4

OPAD LOC=L3 OPAD

MA8

OBUF

MA9

OBUF

MA10

OBUF

LOC=M2 MCSEL19-OBUF SEL19_P

OPAD

LOC=M1MA[11:0] MCSEL20-OBUF SEL20_P MA11 OPAD OBUF

March 21, 1994

LOC=T8 MA8_OUT OPAD LOC=P7 MA9_OUT OPAD LOC=R7 MA10_OUT OPAD

LOC=R6 OBUF MA11_OUT OPAD OBUF

MEMTRANLOC=G1 C D0 C D1

BPAD LOC=G3 OBUFZ BPAD

C

LOC=J1 OBUFZ BPAD

C

LOC=F2 BPAD OBUFZ

D2

D3

C

C D6 C D7

MD1_IO

C MD2

MD3_IO

IBUF C MD3 IBUF

LOC=L16 OBUFZ BPAD

C MD4 IBUF

MD5_IO LOC=K15 OBUFZ BPAD

OBUFZ

MD0

IBUF

MD4_IO

LOC=J16 OBUFZ

C

IBUF C MD1

MD2_IO

LOC=H15 OBUFZ BPAD

D4

C D5

MD0_IO

C MD5 IBUF

MD6_IO

C MD6

BPAD MD7_IO

IBUF C MD7 IBUF

Marcelo Martin, modified by Pak K. Chan and Martine Schlag

128

PULLUP

IPAD

R0CONF

D4 DMA_CONFIG DMA_WRDMA_RDBUSY-

P_RDF-

IBUF C

D0

decoder

NOR4B4

LOC=A9 CS_IN

OUT5

mux LOC=P2

P_WR-

DMA_RD-

P_RDF-

IPAD

FDONE

FD0_2

LOC=D2 X D6

IBUF

C

OUT4

RDONE

MEMTRANPORT1RDPORT2RD-

NAND2B2

March 21, 1994

B

LOC=K16 FAST D7_IOIBUF OBUFZ D7_READ BPAD X D7NB

LOC=A10 RDF_IN

OUT3

GO-

OUT7

PORT0RD-

IPAD

BUSY-

F1CONF BPAD LOC=C5 IBUF OBUFZ F2CONF

FX/RX-

LOC=K14 FAST D6_IOIBUF OBUFZ BPAD D6_READ X D6NB

OBUFZ

BUSYGOFDONE RDONE

Q7

OBUF LOC=C7 BPAD

NOR2 FD0_1

AND2

MEMTRAN M_WR

VCC

DMA_WR-

PORT1WR- INV

M_ACCESS

PORT1

+5

AND2B1 OR2

MEMTRAN

PORT2WR-

PORT2

INV

M_RD

FDRD

DMA_RD-

FDRD

D

D

AND2B1

Q CLK

CLK

C

VCC

Q

INV

LOAD1

Q

INV

LOAD2

C

+5

RD

FDRD CTRCLK

CTRCLKX C

Q CLK

RD

PORT2WR- C

D COUNT

C PORT1

COUNT PORT1WR-

C

C

ACLK AND3 VCC

RD

+5

M_ACCESS FDRD

FDRD

D We need transfer low before loading data into the counters. (just a precaution)

D Q

CLK

CLK

C

C

RD C16BCPRD D0

C16BCPRD D4

D0

D0

C PORT2

D0

D1

D5

D1

D1

D1

D2

D2

D6

D2

D2

D2

D3

D3

Q0

MA8

LOAD2 PE

Q1

MA9

C

Q2

MA10

CE

Q3

MA11

TC

TERM3

Q0

MA0

PE

Q1

MA1

C

Q2

MA2

CE

Q3

MA3

D3

LOAD1 CTRCLK LOAD2 INV

D7 LOAD1

D3 PE

CTRCLK C C CE

TC

Q1

MA5

Q2

MA6

Q3

MA7

CTRCLK

C TERM2

TC

RD LOAD overrides the effect of CE when loading data into the counter

Q0

MA4

RD AND2B1

C

MA[11:0] MA0 MA1 MA2 MA3 MA4 MA5 MA6 MA7 MA8 MA9 MA10 MA11

On reset, csel0- is selected. CSEL0-

RD

CSEL1-

LOAD1

LOAD2

CSEL2-

AND2B1

GSEL0-

CSEL3CSEL4-

74-138 GND

O0 O1

CE for counters 1 and 2 are controlled by LOAD2 inverse because we don’t want them to count when we load the RD4 last counter

C0

D0

INV

D0

Q0

INV

C0

D1

INV

D1

Q1

INV

C1

D2

INV

D2

Q2

INV

C2

D3

INV

D3

Q3

INV

C3

VCC

CSEL1-

CSEL5-

A0

O2 O3

CSEL3-

CSEL7-

A1

C2

O4

CSEL4-

CSEL8-

A2 E1

O5

CSEL5-

C3

E2

O6

E3

O7

CSEL6CSEL7-

CSEL18NOR4B4 CSEL19-

PSEL0CSEL9-

NOR2B2 NOR4B4

CSEL11CSEL12PSEL1-

C

D Q

INV

C

GND

file: memory3 CRITICAL January 24, 1994

O0

CSEL14-

CSEL16-

O0

CSEL8-

O1

CSEL9-

C0

A0

O2

CSEL18-

C1

A1

O3

CSEL19-

C2

A2

O4

CSEL20-

E1

O5

E2

O6

E3

O7

C0

A0

O2

CSEL10-

C1

A1

O3

CSEL11-

C2

A2

O4

C4

74-138

CSEL13-

74-138

FD INV

PSEL3CSEL20-

CSEL10-

PORT3WR-

D4

PSEL2CSEL17-

NOR2B2

CSEL2-

+5

CSEL16GSEL1-

CSEL6-

C1

C4

CSEL15NOR5B5

CSEL0-

NOR4B4

O1

CSEL12-

E1

O5

CSEL13-

C4

E2

O6

CSEL14-

C3

E3

O7

CSEL15-

C4 GND

CSEL17-

129

D1

D3

RD

C16BCPRD D0

C16BCPRD

CE

CE

Q0 Q1 D2 FTPRD D0

FTPRD D1

D PE

Q

CE

PE T1

Q0

T

AND2

C

CE

D Q

PE T2

Q0 Q1

T

FTPRD D

AND3

Q

C PE

Copyright (c) 1985-1989 Viewlogic Systems, Inc.

Q0

PE

Q2

T

Q1

C

FTPRD D

T3 AND4

TC

Q2 AND5 Q Q3

T C RD

RD

130

RD

D3

C RD

RD

CE

74-138 P0O0 O1 PA2 PA3 PA4

P_CSGND

P_ACK-

A0

O2

A1

O3

A2

O4

E1

O5

E2

O6

E3

O7

P0-

PORT0WR-

P0-

PORT1WR-

P1-

P_WR-

PORT0RD-

P1-

P_RDNAND2B2

P2P3-

P1-

NAND2B2

P_WR-

PORT1RDP_RD-

P4-

NAND2B2 P2-

NAND2B2

P5-

PORT2WR-

P2-

P_WR-

PORT2RD-

P6-

P_RDNAND2B2

P7-

P3-

NAND2B2 PORT3WR-

P_WRNAND2B2 P4PORT4WRP_CS-

P_WRDMA_RD-

P_RDF-

NAND2B2 P5NAND2B1

PORT5WRP_WRNAND2B2

P_CS-

P6PORT6WR-

P_ACK-

DMA_WR-

P_WR-

P_WR-

NAND2B2 P7NAND3B2

PORT7WRP_WRNAND2B2

File: decoder

January 24, 1994

131

132

74-138 A0 O0 A1 A2 E3

NAND4B3

E1 O1 E2 AND3B2 NAND4B2 O2

NAND4B2 O3

NAND4B1 O4

NAND4B2 O5

NAND4B1 O6

NAND4B1 O7 NAND4

Copyright (c) 1985-1989 Viewlogic Systems, Inc.

VCC +5

FDRD D P_REQ1

Q

DMA_CONFIG C

BUSYAND2

GMUX

RD

D1 DMA_CONFIG

DMA_WR-

SE O

INV

P_REQ

VCC +5

FDRD

D0

D Q CLK

C

VCC

P_REQ2

+5

FDRD

RD

P_REQ3

D Q

AND2

INV

EXT_RD-

VCC +5

CLK

FDRD

C

FDRD

D RD

D

Q CLK DMA_RD-

Q

C CLK

INV

C

RD RD VCC

DMA_RD-

INV

+5

FDRD

FDRD D Q

CLK

133

D

Q CLK

C RD

WSThis signal causes data to be strobed into the Xilinx FPGAs.

C RD

VCC +5

DMA_WR-

INV

FAST 2CYCLES FDRD

FDRD

AND2 MEM_WR-

D

file: miscel3 January 24, 1994

D Q

CLK

Q CLK

C

OR2

3CYCLES FAST

C

AND2B1 RD

RD

FD DMA_WR-

This circuit makes sure that the write strobe to memory is low for two cycles (80ns) using the slow mode).

D Q

CLK

C AND2B1

DMA_WR-

F[14:1]IN F1IN F2IN F3IN F4IN F5IN F6IN F7IN F8IN F9IN F10IN F11IN F12IN F13IN F14IN

MD0

M4-1 D0

FDONE

D1

F1IN

D2

F9IN SEL0 SEL1

MD0 MD1 MD2 MD3 MD4 MD5 MD6 MD7

D1

D3

SEL0 SEL1

S0

OUT4

M4-1 D0

MD5

D1 O

OUT1

F6IN GND

D2

SEL0

S0

S1

SEL1

S1

D1

O

S1

S0

GO-

SEL1

D3

SEL0

D3

MD2

SEL0

D2

F13IN

F14IN

M4-1 D0

F11IN

F5IN

SEL1

D2

F10IN

F3IN

GND

S1

BUSY-

MD[7:0]

OUT0

S0

MD1

F2IN

D1 O

D3

M4-1 D0

M4-1 D0

MD4

O

OUT5

M4-1 D0

MD6

D1

D2

O

OUT2

F7IN

D2

D3

O

OUT6

D3 SEL0

S0

SEL1

S1

S0 S1

GND MD3

M4-1 D0

RDONE

D1

F4IN

D2

D1 O

OUT3

F8IN

D2

D3

SEL0

S0

SEL1

S1

O

OUT7

134

F12IN

M4-1 D0

MD7

D3 SEL0 SEL1

S0 S1

GND

sel0 sel1 pass (outx) MEMTRANPORT1RD-

B

D SEL0

PORT2RD-

B AND3 OR2

MEMTRANPORT2RD-

C

PORT1RD-

C SEL1

MEMTRAN- AND3B1 PORT1RD-

D D

PORT2RDAND3B1

file name: mux

January 24, 1994

OR2

0 0 1 1

0 1 0 1

D0 D2 D1 D3

135

M4-1 GMUX D3

D1 SE

D23 O

D2

D0 GMUX D1 SE

S1

O D0 GMUX

D1

D1

S0

SE

D0

D01 O

D0

Copyright (c) 1985-1989 Viewlogic Systems, Inc.

O

EPOCH Counter

Marcelo Martin/Pak K Chan 3/22/94

D[7:0] D0 D1 D2 D3 D4 D5 D6 D7

PRELOADLOAD Q0 C0 COUNT

DEC

D0_L D1_L D2_L

D0 D1 D2

Q2 C2

F_CLK

CLK

TC

Q1 C1 PRELOADLOAD

TC0 RD8CR D0

D0

Q0

D0_L

D1

D1

Q1

D1_L

D2

D2

Q2

D2_L

D3

D3

Q3

D3_L

D4

D4

Q4

D4_L

D5

Q5

D6

Q6

D7

Q7

D5 D6 D7 PORT5WRPREPARE_L

GND

D6_L D7_L D0_H

D0 D1 D2

Q2

C8

F_CLK

CLK

TC

TC1

F_CLK

CLK

TC

TC2

D0 D1

Q0

WORK SHIFT-

D2

Q2

D3

D3

Q3

WTWR/RD-

D4

D4

Q4

LD_HIGH/LOW-

D5

D5

Q5

D6

D6

Q6

D7

Q7

D2

Q2

D2_H

D3

Q3

Q0

C9

Q1

C10

Q2

F_CLK

C11

D0 D1 D2 CLK

TC

D

D0

R

D2

D1_H D2_H D3_H

TC3

CE

D1_H

INV

PORT7WR-

TC12

Q

INV WTHIGH/LOW-

C R

PRELOAD

terminal count to stop ACME GO-

PREPARE PRELOAD

WORK

C

FD D3_H

Q5

D6

D6

Q6

D6_H

D7

D7

Q7

D7_H

PREPARE D5_H

Q

PREPARE_H LD_HIGH/LOWPREPARE

C AND2 PREPARE_L

LD_HIGH/LOW-

C

AND2B1

FD

E

AND2B1

INV

D

D

CE R

E Q

OR2

INV

C

F

GND

TC12 D

A B

FD AND2

D

C OR3

AND2 F_CLK

A

Q

B

E TC12

C

NAND2B1 F_CLK

Q

INV

A

C

F_CLK

FD B D

BUSYAND2B1 F_CLK

COUNT Q

C

B

C

FSM to convert falling edge of busy- to enable counter March 22, 1994

FSM to convert WORK to GO-

GO-

F

D Q

OR2

C D OR2

net-int4

FD AND2B1 AND2

F

FD

C

BUSY-

C

AND2

F_CLK

D

BUSY-

D Q

OR2

FD

INV

136

Q4

D5

SYN_WORK

D

D4_H

D4

PREPARE_H

C5

Q1

D0_H

PORT5WR-

Q2

D1

Q1

PRELOADLOAD DEC

D0 D1 D2

D2

Q0

C7

D3_L D4_L D5_L

C

D1

C6

Q1

FDR

D7_L

D0

AND4 Q0

DEC

RD8

D6_L

D1

D5

C4

AND4

D5_L

D0

D4

C3

Q1

AND2

RD8CR

D3

AND3PRELOADLOAD Q0

DEC

A

B

C

D0

1

D

FDC

1

AND2 D Q

OR2

Q0

Q0

C AND2B2

CE

D1

FDC AND2 D

Q0 Q1

TC

Q1 Q

OR2

C

XNOR2 AND2B1

NOR3

CE

2

2

D2

FDC

Q0

AND2 D

Q1

Q2 Q

OR2

NOR2

C

XOR2

AND2B1

CE

Q2

137

CLK

LOAD 3

3

DEC OR2

3-bit LOADABLE Down Counter Pak K. Chan and Martine Schlag with mostly Martine’s bug

4

4

March 21, 1994

DRAWN BY: A

B

C

D

14 multiplexers implemented using tri-state busses TRID[7:0]

F6_HID

RD8 F3_HID D0 D1 D2 D3 D4 D5 D6 D7

PORT6WR-

D0

Q0

D1

Q1

D2

Q2

D3

Q3

D4

Q4

D5

Q5

D6

Q6

D7

Q7

D0 F6_OUT

ENHID_2

D1 F4_HID

D1 D2

ENHID_3

D3

F6_EMPTY-

DMA_CONFIG F5_HID

WTHIGH/LOW-

ENOUT_3

DMA_CONFIG

WOUT1 F7_HID

D2

Q2

WOUT3

F7_OUT

WOUT4 F8_HID

D5

D5

Q5

WOUT6

F8_OUT

D6

Q6

F7_OUT INV

D0 TBUF

F13_OUT INV

D1 F7_EMPTY-

F7_EMPTY-

DMA_CONFIG F8_EMPTY-

TBUF

TBUF

DMA_CONFIG

TBUF

TBUF TRI-

INV

F8_OUT INV

D1 TBUF

D1

GND

F8_EMPTY-

FCONF_8

TBUF FCONF_[14:1]

D2 F10_OUT DMA_CONFIG

TBUF

NAND3B3 D0

Q0

D1

Q1

D2

Q2

WOUT9 F11_EMPTY-

TBUF F11_OUT

F9_HID

WOUT11 NAND3B3

D0

Q3

D4

Q4

WOUT13

D5

Q5

WOUT14

D6

Q6

D7

Q7

F12_HID

F12_EMPTY-

F9_OUT

D1 NAND3B3

F13_HID

F9_EMPTY-

F13_EMPTY-

FCONF_9 TBUF

D2 DMA_CONFIG

F13_OUT

C

TBUF

INV

F12_OUT

FCONF_1 FCONF_2 FCONF_3 FCONF_4 FCONF_5 FCONF_6 FCONF_7 FCONF_8 FCONF_9 FCONF_10 FCONF_11 FCONF_12 FCONF_13 FCONF_14

138

D3

NAND3B3

TBUF

WOUT9

CE

GND

TBUF F10_HID INV

R

D0

TRID0

INV

WOUT12

D5

PORT4WR-

WOUT8 F11_HID

WOUT10

D4

WTHIGH/LOW-

TBUF

WOUT14

TBUF

F10_EMPTY-

RD8CR

D3

FCONF_14

DMA_CONFIG

NAND3B3 F10_HID

R

D2

TBUF

WOUT13

F9_EMPTY-

F9_OUT

CE

D1

TBUF

D2

D0 F9_HID

C

FCONF_13

F13_EMPTY-

WOUT7

NAND3B3

WOUT8

TBUF D1

FCONF_7

D2

F8_HID WOUT7

F13_HID INV D0

NAND3B3

WOUT5

TBUF

F7_HID INV

F6_EMPTY-

NAND3B3

Q4

D0

TBUF

WOUT12 TBUF

NAND3B3

Q3

INV

TBUF

WOUT6

F6_OUT

Q7

TBUF

D2

F5_OUT

WOUT2

D7

FCONF_12

F12_EMPTY-

F5_EMPTY-

ENOUT_2

D4

PORT4WR-

TBUF

NAND3B3 ENOUT_1

D3

D7

TBUF D1

D2

D4

D6

FCONF_6

F4_OUT ENOUT_0

Q1

D1

F12_OUT INV

F4_EMPTY-

F6_HID

Q0

TBUF

INV

NAND3B3

C

D0

D0

F3_OUT ENHID_1

RD8CR D0

F12_HID INV

INV

F3_EMPTY-

ENHID_0

FCONF_1

F10_OUT INV

F4_HID INV

TBUF D1

FCONF_10

D0 DMA_CONFIG

TBUF

F10_EMPTYF4_OUT INV

WOUT1

FHID_4

FHID_6

DMA_CONFIG F4_EMPTY-

FHID_5

FCONF_4

TBUF

D0

TBUF

TBUF

FCONF_2

WOUT10 D2

DMA_CONFIG

FHID_3

D2 D1

TRI-

TBUF

TBUF

TBUF

TBUF DMA_CONFIG

TBUF

F11_HID

WOUT2

INV

WOUT4

D0

ENHID_0 ENHID_1 ENHID_2 ENHID_3 ENOUT_0 ENOUT_1 ENOUT_2 ENOUT_3

TBUF TBUF F5_HID D0

F11_OUT INV

TBUF D1

FHID_10

ENOUT_0

FHID_11

ENOUT_1

FHID_12

ENOUT_2

FHID_13

ENOUT_3

FOUT_10 FOUT_11

DMA_CONFIG

TBUF

TBUF

FCONF_3

FOUT_8 FOUT_9

D2 FCONF_5

TBUF

D2

TBUF

TBUF

F5_EMPTY-

FOUT_6 FOUT_7

F11_EMPTYF5_OUT INV D1

FOUT_12 FOUT_13

WOUT11 D2

TBUF

TBUF DMA_CONFIG

TBUF

WOUT3 WOUT5 TBUF TBUF

January 14, 1994

FHID_9

ENHID_3

FOUT_4

FCONF_11

D0

DMA_CONFIG

FHID_8

ENHID_2

FOUT_5

INV

TBUF D1

F3_EMPTY-

FHID_7

ENHID_1

FOUT_3

F3_HID INV

F3_OUT INV

ENHID_0

F3_HID F4_HID F5_HID F6_HID F7_HID F8_HID F9_HID F10_HID F11_HID F12_HID F13_HID

selector

F3_OUT F4_OUT F5_OUT F6_OUT F7_OUT F8_OUT F9_OUT F10_OUT F11_OUT F12_OUT F13_OUT

139

C.2 Schematics ! Bdsyn description of the selector ! ! Module to be used for configuring 4010s. ! Assert enabling signals to pass config. data to F2 through F13 ! (remember: F1 and F2 are hidden by default, F14 is output by default.) ! F can be either a hidden or an output node, ! so we have two sets of Fs: FHID or FOUT. ! ! NOTE: it is up to fdp.exe to download the correct byte into selF ! so that FHID and FOUT are mutually exclusive. ! ! The order of the enabling byte (selF) is: ! !SelF bit: ! 7 6 5 4 3 2 1 0 ! enout enout enout enout enhid enhid enhid enhid ! ! As an example: selF xxxx 0010 => F1, F2, F3, and F4 are hiddens. ! maximum number of hiddens 13: 1011 => F1, F2 ... F13 are hiddens. ! ! selF 0011 xxxx => F14, F13, F12, and F11 are outputs. ! maximum number of outputs 11: 1011 => F14, F13 ... F3 are outputs. MODEL selector Fhid, Fout = enhid, enout; ! hiddens from F3 to F13 ! outputs from F13 to F3 (assume that #hiddens > 1) behavior; PORT enhid, enout input, Fhid, Fout output; ROUTINE sel; SELECT enhid FROM [0]: Fhid = 0; ! select F1 and F2 (default) [1]: Fhid = 1; ! select F1, F2 and F3 [2]: Fhid = 3; [3]: Fhid = 7; [4]: Fhid = 15; [5]: Fhid = 31; [6]: Fhid = 63; [7]: Fhid = 127; [8]: Fhid = 255; [9]: Fhid = 511; [10]: Fhid = 1023; [11]: Fhid = 2047; [12, 13,14,15]: Fhid = 0; ! select F1 and F2 (by default) ENDSELECT;

140 SELECT enout FROM [0]: Fout = 0; ! select F14 (default) [1]: Fout = 1024; ! select F14 and F13 [2]: Fout = 1536; [3]: Fout = 1792; [4]: Fout = 1920; [5]: Fout = 1984; [6]: Fout = 2016; [7]: Fout = 2032; [8]: Fout = 2040; [9]: Fout = 2044; [10]: Fout = 2046; [11]: Fout = 2047; [12,13,14,15]: Fout = 0; ! select F14 by default ENDSELECT; ENDROUTINE sel; ENDBEHAVIOR; ENDMODEL selector;

141 /* CUPL description */ Name pal_new2; Partno ; Date 9/14/93; Revision ; Designer ; Company ; Assembly ; Location ; Device p20l8; Format j; /* command line: cupl -j pal_new2 checksum: 5834 Same as pal_new1 except that it is compiled for pal20L8 instead of 22v10. */ PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN PIN

1 = CLK; 2 = D_RESET; 3 = !D_WR; 4 = !D_RD; 5 = !D_CS; 6 = !D_ACK; 7 = S_PA3; 8 = S_PA2; 9 = S_PA4; 10 = NC9; 11 = NC6; 13 = NC7; 14 = NC8; 15 = !D_RS; 16 = !D_RDF; 17 = !D_WRF; 18 = NC15; 19 = BSPA4; 20 = BSPA2; 21 = BSPA3; 22 = !BCLK; 23 = NC14;

/*** declarations, intermediate variables */ D_WRF = !D_RESET & D_ACK & !D_CS & D_WR # !D_RESET & !D_ACK & D_CS & D_WR; D_RS = D_RESET; D_RDF = !D_RESET & D_ACK & !D_CS & D_RD # !D_RESET & !D_ACK & D_CS & D_RD; BCLK = CLK; BSPA2 = S_PA2; BSPA3 = S_PA3;

142 BSPA4 = S_PA4;

143

References [1] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Trac. Academic Press, 1965. [2] A. T. Ferrucci, ACME: A Field Programmable Gate Array Implementation of a SelfAdapting and Scalable Connectionist Network. University of California, Santa Cruz, March 1994. [3] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. A Lecture Notes Volume in the Santa Fe Institute Studies in the Sciences of complexity, Addison Wesley, 1991. [4] B. J. A. Krose and P. P. V. D. Smagt, An Introduction to Neural Networks. University of Amsterdam, 1993. [5] P. K. Chan, M. Schlag, and M. Martin, \BORG: A recon gurable prototyping board using Field-Programmable Gate Arrays," in Proceedings of the 1st International ACM/SIGDA Workshop on Field-Programmable Gate Arrays, (Berkeley, California, USA), pp. 47{51, Feb. 1992. [6] XILINX: The Programmable Gate Array Data Book. 2100 Logic Drive, San Jose, CA 95124, 1993. [7] P. K. Chan and S. Mourad, Digital Design Using Field Programmable Gate Arrays. Prentice Hall, 1994. [8] M. Schlag, J. Kong, and P. K. Chan, \Routability-driven technology mapping for lookup table-based FPGAs," in 1992 IEEE International Conference on Computer Design: VLSI in Computers and Processors, (Cambridge, Massachusetts), pp. 86{90, Oct. 1992. [9] C. E. Cox and W. E. Blanz, \GANGLION{A Fast Field-Programmable Gate Array Implementation of a Connectionist Classi er," IEEE Journal of Solid-State Circuits, vol. 27, pp. 288{299, March 1992. [10] J. N. Morris, Hardware Design of a Rapid Prototyping Recon gurable Logic System. North Carolina State University, Raleigh, NC, 1992. [11] P. Bertin, D. Roncin, and J. Vuillemin, \Programmable Active Memories: a Performance Assessment," in Advanced Research in VLSI, (Seattle), pp. 88{102, MIT Press, 1993, Feb. 1992. [12] S. Casselman, \Virtual Computing and the Virtual Computer," Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, pp. 44{48, April 1993. [13] M. Gokhale, W. Holmes, A. Kopser, S. Lucas, R. Minnich, and D. Sweely, \Building and Using a Highly Parallel Programmable Logic Array," IEEE Computer, vol. 24, pp. 81{89, January 1991. [14] IDA Supercomputing Research Center, 17100 Science Drive, Bowie, MD 20715, Splash2: Programmer's Manual Version 0.7, 1992. [15] J. Darnauer, P. Garay, T. Isshiki, J. Ramirez, and W. W.-M. Dai, \A Field Programmable Multichip Module (FPMCM)," IEEE Workshop on FPGAs for Custom Computing Machines, pp. 1{10, April 1994.

144 [16] J. Babb, R. Tessier, and A. Agarwal, \Virtual Wires: Overcoming Pin Limitations in FPGA-based Logic Emulators," Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, pp. 142{151, April 1993. [17] National Technologies Inc., \X-12 Product Announcement," 1993. [18] J. Varghese, M. Butts, and J. Batcheller, \An Ecient Logic Emulation System," IEEE Transactions on Very Large Integration (VLSI) Systems, vol. 1, pp. 171{174, June 1993. [19] APTIX, \FPCB AXB-AP4:Field Programmable Circuit Board for ASIC Prototypes," Product Preview, April 1993. [20] Paul Horowitz and Win eld Hill, The Art of Electronics. Cambridge University Press, 1989. [21] S. Hauck, G. Borriello, and C. Ebeling, \Springbok: A Rapid-Prototyping System for Board-Level Designs.," in Second International Workshop on Field Programmable Gate Arrays, (Berkeley, California), February 1994. [22] K. Yamada, H. Nakada, A. Tsutsui, and N. Ohta, \High-Speed Emulation of Communication Circuits on a Multiple-FPGA System," in Second International Workshop on Field Programmable Gate Arrays, (Berkeley, California), February 1994. [23] C. Clos, \A study of non-blocking switching networks," Bell System Technical Journal, vol. 32, pp. 406{424, Mar. 1953. [24] G. M. Masson and B. W. Jordan, \Generalized Multi-Stage Connection Networks," Networks, vol. 2, pp. 191{209, Dec. 1972. [25] P. K. Chan and M. Schlag, \Architectural Tradeo s in Field-Programmable-DeviceBased Computing Systems," in Proceedings of the 1st IEEE Workshop on FPGAs for Custom Computing Machines, (Napa, CA, USA), pp. 152{161, Apr. 1993. [26] F. K. Hwang, \Rearrangeability of Multi-Connection Three-Stage Clos Networks," Networks, vol. 2, pp. 301{306, Aug. 1972. [27] Y. Yang and G. M. Masson, \Nonblocking Broadcast Switching Networks," IEEE Transactions on Computers, vol. C-40, pp. 1005{1015, Sept. 1991. [28] G. W. Richards and F. K. Hwang, \A Two-Stage Rearrangeable Broadcast Switching Network," IEEE Transactions on Communications, vol. COMM-33, pp. 1025{1035, Oct. 1985. [29] C. D. Thompson, \Generalized Connection Networks for Parallel Processor Intercommunication," IEEE Transactions on Computers, vol. c-27, pp. 1119{1125, Dec. 1978. [30] Y.-M. Yeh and T.-Y. Feng, \On a Class of Rearrangeable Networks," IEEE Transactions on Computers, vol. 41, pp. 1119{1125, Nov. 1992. [31] B. G. Douglass, \Rearrangeable Three-Stage Interconnection Networks and Their Routing Properties," IEEE Transactions on Computers, vol. 42, pp. 559{567, May 1993. [32] Dawn VME Products, DPS-1. 47073 Warm Springs Blvd, Freemont CA 94539, 1990. [33] LSI Logic Corporation, L64853 SBus DMA Controller. 1551 McCarthy Blvd, Milpitas, CA 95035.