Cellular Automata for Structural Optimization on

0 downloads 0 Views 426KB Size Report
May 12, 2004 - I would first like to thank my advisor, Dr. Mark Jones, for his ... Thanks to Dr. Athanas for serving on my thesis committee and for making ...
Cellular Automata for Structural Optimization on Recongfigurable Computers

Thomas R. Hartka

Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Master of Science in Computer Engineering

Dr. Mark T. Jones, Chair Dr. Peter M. Athanas Dr. Michael S. Hsiao

May 12, 2004 Blacksburg, Virginia

Keywords: configurable computing, cellular automata, design optimization c Thomas R. Hartka Copyright 2004 ,

Cellular Automata for Structural Optimization on Recongfigurable Computers

Thomas R. Hartka

(ABSTRACT)

Structural analysis and design optimization is important to a wide variety of disciplines. The current methods for these tasks require significant time and computing resources. Reconfigurable computers have shown the ability to speed up many applications, but are unable to handle efficiently the precision requirements for traditional analysis and optimization techniques. Cellular automata theory provides a method to model these problems in a format conducive to representation on a reconfigurable computer. The calculations do not need to be executed with high precision and can be performed in parallel. By implementing cellular automata simulations on a reconfigurable computer, structural analysis and design optimization can be performed significantly faster than conventional methods.

This work was partially supported by NSF grant #9908057 as well as by the Virginia Tech Aspires program.

Acknowledgements I would first like to thank my advisor, Dr. Mark Jones, for his guidance through my entire research. Without his guidance I never would have been able to complete this thesis. Thanks to Dr. Athanas for serving on my thesis committee and for making development on the DINI board possible. Thanks to Dr. Hsiao for serving on my committee and being an excellent teacher. Thanks to Dr. Gurdal and his researchers for providing the mathematics for the cellular automata models and for all the effort spent getting to understand reconfigurable computing so the equations mapped efficiently. Thanks to all the professors and students involved with the Configurable Computing Lab for making it a great place to work. Thanks to all the people that helped in the process of reviewing and editing this thesis. I am forever indebted to anyone who will review sixty pages of my writing. Thanks to everyone else who I have not mentioned that helped with my work. I could not have done it without the support from the people around me.

iii

Contents

1 Introduction

1

1.1

Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2 Background

4

2.1

Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

2.2

Configurable Computers for Scientific Computations . . . . . . . . . . . . . .

8

2.3

Limited Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3 System Design 3.1

13

Design Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1.1

System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1.2

Distributed Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.2

Problem Specific Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.3

Program Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

4 Results

32 iv

4.1

Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.2

Problem Specific Design Results . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.3

Program Based Design Results . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.4

Comparison of Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5 Conclusions

48

5.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

A

51

Vita

55

v

List of Figures 3.1

Setup of the configurable computer used for simulationing CA model. . . . .

15

3.2

Distribution of logical CA cells among PEs. . . . . . . . . . . . . . . . . . .

16

3.3

Arithmetic unit for Problem Specific design. . . . . . . . . . . . . . . . . . .

19

3.4

PE layout for Problem Specific design . . . . . . . . . . . . . . . . . . . . . .

20

3.5

Return data chains for Program Based design. . . . . . . . . . . . . . . . . .

23

3.6

Control Unit for Program Based design. . . . . . . . . . . . . . . . . . . . .

24

3.7

Analysis cycle flow and precision for each operation. . . . . . . . . . . . . . .

26

3.8

Computational unit for Program Based design. . . . . . . . . . . . . . . . . .

27

3.9

Multiply accumulator used in computational unit. . . . . . . . . . . . . . . .

27

3.10 MSB data return chain, used for determining the most significant ‘1’ of residuals. 28 3.11 Unit for shifting the precision of intermediate results. . . . . . . . . . . . . .

28

3.12 Matrix accumulator used for analysis updates. . . . . . . . . . . . . . . . . .

29

3.13 Data flow for uploading and download data to FPGAs. . . . . . . . . . . . .

30

4.1

Diagram of CA model for perform analysis on a beam. . . . . . . . . . . . .

33

4.2

Beam analysis problem modeled on the configurable computer. . . . . . . . .

33

vi

4.3

Precision of PE vs. Percent Utilization of FPGA for Problem Specific design

4.4

% Utilization of FPGA and maximum clock frequency vs. number of PEs for

35

Problem Specific design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.5

Cell updates per second vs. number of PEs for the Problem Specific design. .

37

4.6

Actual results and results from Problem Specific design for beam problem. .

39

4.7

Precision of PE vs. % Utilization of FPGA for Program Based design. . . . .

40

4.8

Efficiency vs. number of inner iterations per analysis cycle. . . . . . . . . . .

42

4.9

% Utilization of FPGA and maximum clock frequency vs. number of PEs for Program Based design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10 Cell updates per second vs. number of PEs for Program Based design.

43

. . .

44

4.11 Actual results and results from Program Based design for beam problem. . .

46

A.1 Spreadsheet with position of control signals and short description. . . . . . .

52

A.2 Spreadsheet containing update program . . . . . . . . . . . . . . . . . . . . .

53

A.3 Spreadsheet converting signals to the form used by the Program Based model

53

A.4 Spreadsheet containing the data values in a form that can be loaded into memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

54

List of Tables 4.1

Times for operations associated with Problem Specific analysis cycle for DINI board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2

Clock cycles for different phases of residual-update analysis cycle. . . . . . .

41

4.3

Time for operations associated with analysis on Program Based design. . . .

45

4.4

Maximum cell updates per second for both implementations. . . . . . . . . .

47

4.5

Maximum cell updates per second and speed up for both implementations compared to PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

47

Chapter 1 Introduction Structural analysis and design optimization are an integral part of many industries. Applications range from simple applications such as testing and optimizing a support beam, to very complicated applications such as optimizing the structure of a car for crash resistance. Performing the design iterations manually is very time consuming. Therefore, a significant amount of research has been conducted to develop efficient methods to automate the design process. Traditional methods for automating design have involved running simulations on general purpose processors. In these methods, calculations must be performed in high precision. Parallelization of the calculations, where possible, is done though expensive supercomputers with hundreds of processors. Even with massive amounts of computing power, the simulations will usually take hours to complete. Cellular Automata (CA) has proved to be a very powerful tool for modeling physical phenomena. CA models have successfully captured the behavior of complex systems such as fluid flow around a wing and pedestrian traffic [?, ?]. Recently, CA theory has been extended to structural analysis and design optimization [?]. Using CA in structural models changes analysis and design optimization into a high parallizable form that does not require

1

2 high-precision calculations. This provides the potential for significant speed-up.

1.1

Thesis Statement

Using CA provides a method to efficiently map structural design optimization problems onto FPGAs. By exploiting the inherent parallelism of FPGAs there is the potential for speed-up over general purpose processors. To achieve this objective, distributed processing systems were implemented on a configurable computer. The system consisted of a host PC connected to a PCI based board with five FPGAs. Two designs were developed for the FPGAs to rapidly iterate CA models for structural analysis. The two designs represent significantly different approaches to accomplishing the same objective. The author’s contributions to this work are the following: - developed a custom FPGA design for simulating a beam CA model, - developed a separate FPGA design that executes programs to simulating CA models, - wrote programs for the FPGA design to simulate a beam CA model, and - implemented a limited precision method in hardware for solving iterative improvement problems.

1.2

Thesis organization

Chapter 2 presents background information about CA theory, scientific computations on configurable computers, and the limited precision method used in the designs. Chapter 3 gives details on the two implementations developed to solve the CA models. Chapter 4 presents

3 the results for each of the two implementations and comparisons to traditional methods. Chapter 5 summarizes the work performed for this project and the results obtained.

Chapter 2 Background This chapter presents previous work in areas related to this research. The combined contributions discussed were used in the completion of the simulation environment and prototype presented in this thesis.

2.1

Cellular Automata

The concept of Cellular Automata (CA) theory is to be able to model systems with many objects that interact [?]. The systems are divided into discrete units, or cells, that act autonomously. The advantage of using CA is that the behavior of some complex systems can be captured using relatively simple rules for each cell [?]. Attempting to reproduce this behaviour without breaking those systems into autonomous units, even if possible, would be complicated. Each cell in a CA model can be in a single state at any given point in the simulation. The number of states the cell may be in depends on the problem being solved. In many models the number of elements in the set of states is small (8 or less), but there is a new class of CA models that use a continuous state space. These continuous state space CA models are

4

5 known as coupled map-lattice or cell dynamic schemes. The next state of a cell is based on an update rule, sometimes referred to as a transition rule, which is a function of its current state and the current state of its neighbors [?]. The collective state of all of the cells in the model at any given point is known as the global state [?]. Stanislas Ulam is generally credited with the first work in CA, originally referred to as cellular space or automata networks. John von Neumann extended Ulam’s work and proposed CA as a way to model self-reproducing biological systems [?, ?]. The work of Ulam and von Neumann provides a formal method for simulating complex systems. Their research, and much of the current research in CA, focused on modeling dynamic systems in which time and space are discrete. Each calculation of the next state of all the cells in a system represents a step in time [?]. A good example of this type of CA model is Conway’s Game of Life in which cells can be in one of two states: alive or dead. Each update of the global state represents a new generation of organisms [?]. There are a number of architectures for CA models, each resulting in different behavior. The number of dimensions of the CA model can differ greatly depending on the system being modeled. Models are typically one, two, and three dimensions in practice. However, there is no limit to the number of dimensions that can be used [?]. The number of dimensions of the grid has a large effect on the communication network among cells, known as the cellular neighborhood. In the work on two-dimensional grids there are two common cellular neighborhoods. The first is the von Neumann neighborhood, in which each cell communicates only with the four cells that are orthogonally adjacent to it. The second is a Moore neighborhood in which a cell communicates with all eight cells surrounding it [?]. Though von Neumann and Moore neighborhoods are common, cells are not limited to communicating only with those that are adjacent. The “MvonN Neighborhood” uses the nine cells (including the center cell) in the Moore neighborhood as well as the four cells orthogonally one space away from the current cell. Additionally, the communication of the cells within a model is not required to

6 be consistent throughout the model [?]. There has also been a significant amount of work investigating non-rectangular cell systems. Gas lattice automata are a subset of cellular automata that commonly use the FHP model. The FHP model uses a hexagonal grid, where cells communicate with their six immediate neighbors [?, ?]. The use of triangular and regular polygonal lattices is common in specialized applications of cellular automata because they can better capture the behavior of certain systems [?]. Models in which communication and update rules are consistent throughout the model are called uniform. Though most of the work in the area of CA has used uniform models, the use of non-uniform rules does not necessarily detract from the effectiveness of using CA. A number of experiments have been conducted to model the effect of “damaged” areas of a grid where cells use different rules [?]. In terms of simulating a CA model on a serial processor, a uniform grid has the advantage that only one update rule is needed [?]. The grid for a CA model may be finite or infinite. In his work, von Neumann examined infinite grids as a method to construct a universal computer [?]. Although von Neumann’s work on infinite grids was theoretical, methods for representing and calculating CA models on infinite grids have been developed [?]. Finite grids are much simpler to implement and process in parallel because the maximum size of the active area is known before processing begins. However, the use of finite grids introduces the problem of how to calculate cells on the edge of the grid, known as the boundary conditions. There are several ways to handle the processing of cells on the edge of a finite grid. The first method is to logically connect the cells on one edge of the grid to cells on the opposite edge, producing a loop. Another way to handle boundary conditions is to use a fixed value for cells at the perimeter of the grid. In systems with fixed boundary conditions, the edge cells are known as dummy cells because they do not need to be updated. The third method for calculating the update for edge cells is to use an update rule that is different than that used in the internal cells [?]. An example of a uniform rule would be an edge cell that simply

7 mirrored the value of the closest internal cell. The type of boundary condition used depends largely on the problem being modeled. The early work in CA theory concentrated on theoretical computational questions, such as computational universality. In later work, it has been used as a method to study social, physical, and biological systems [?]. A number of studies have been conducted to use CA to capture the aggregate behavior of groups of autonomous beings, for example, car traffic, pedestrian flow, and ant colonies [?,?,?]. In scientific computing, many successful attempts have been made to model such phenomenon as fluid dynamics, chemical absorption, and heat transfer using CA [?, ?, ?]. Some of the most recent work in CA has been in the field of structural analysis and design. The first work in this area was the development of methods to optimize the angle and cross-section area of trusses in a fixed structure [?]. These methods proved successful in merging analysis and design into a CA model and showed powerful computational properties. This success prompted more work to extend CA theory to create models for other structural design problems. A model was developed to minimize the weight of a beam needed to prevent buckling [?]. The beam is represented in sections, to which constraints and external forces can be applied independently. The cross sectional area for each section is determined to produce the minimum total weight of the beam. Experiments with this method showed models converged to the correct minimum solution. This area of CA research shows substantial promise for accelerating structural design optimization through parallel computing.

8

2.2

Configurable Computers for Scientific Computations

FPGAs, which are usually the basis for configurable computers, show considerable speed-up for a variety of applications when compared to general purpose processors. These applications include signal processing, pattern searching, and encryption. The use of FPGAs for these tasks, which mainly involve bit manipulation, has shown orders of magnitude acceleration [?, ?, ?]. These accelerations are possible because the tasks can be broken down into simple operations. The operations can then be performed in parallel throughout the chip. Although pattern matching and bit manipulation have been widely studied on FPGAs, FPGAs have typically not been used for scientific computations. Two significiant deterrents in using FPGAs are the limited available programmable logic and the slow clock speeds. In the past, FPGAs have only been able to represent circuits that had gate counts in the low thousands [?]. This low gate count is restrictive for scientific computations. For example, a 32-bit parallel multiplier could not be emulated by most of the FPGAs in the Xilinx XC3000 family, chips that were first produced in the mid 1990s (based on number of CLBs). This becomes even more of a handicap because FPGAs typically operate at clock speeds much lower than average CPUs. A general purpose processor will usually outperform an FPGA if the FPGA cannot carry out parallel or deeply pipelined operations. These limitations of FPGAs have been greatly reduced in recent chips because of the much larger transistor densities. The latest Xilinx FPGAs can emulate circuits with up to 10 million gates [?]. With increased programmable logic, it is possible to have many more arithmetic units performing complicated operations in parallel. In comparison to the previous example, a Xilinx XC2V8000 chip, currently Xilinx’s largest FPGA, has enough logic to represent thirty-five 32-bit multipliers (based on number of CLBs). Floating-point operations continue to require a large percentage of available resources. Still, researchers have begun exploring scientific computations on FPGAs. A paper from researchers at Virginia

9 Tech used the flexibility of FPGAs to develop representations of floating-point numbers that are more efficient on FPGAs [?]. In 2002, researchers published a paper detailing the development of a limited precision floating-point library and an optimizer to determine the minimum precision needed in DSP calculations [?]. Using the least precision possible is important on an FPGA. General purpose processors usually compute operations in higher precision than is needed because of the limited choices for precision. However, the fine-grain control of the logic in an FPGA allows custom arithmetic units of any precision. This flexibility can be extended to dynamically controlling the precision of different calculations on the same unit. Other work has been presented on a variable precision coprocessor for a configurable computer and algorithms given for variable precision arithmetic units [?]. Two papers have been published investigating how to manage dynamically varying precision and to show how the overall runtime is substantially decreased by using minimal precision [?, ?]. CA has been used in the computer science community for some time. In 1985, a book was published describing implementations for CA simulations on massively parallel computers [?]. However, there has been little work done in trying to run these models on configurable computers. There have been some papers written on using CA on FPGAs, but they all focus on models with simple cell update rules and small state sets. For example, the CAREM system was developed to efficiently model CA on FPGAs [?]. The two models published as examples of using the CAREM system were an image thinning algorithm and a forest fire simulation. In both cases the models were simple, having state set sizes of 4 or less. Other cellular automata simulation systems have been proposed for reconfigurable computers which concentrated on fluid dynamics [?, ?]. However, like CAREM, this system is only capable of handling simple models with very limited number of states. Custom hardware architectures were implemented by Norman Morgolus from MIT for processing CA. The most successful was known as the Cellular Automata Machine 8 (CAM8) [?]. The CAM-8 is based on custom SIMD processors that are connected in three

10 dimensions. Each processor is responsible for a section of data in the model which is stored in a DRAM. Processing on each cell’s data is performed using a look up tables (LUT) stored SRAM. This architecture shows impressive results, generating up to 3 billion cell updates per second. However, the LUT based processing limits models to a fairly small state size. There have been a number of projects which use the CAM-8 in areas such as modeling fluid motion [?] and gas lattices [?]. The CAM-8 is now sold commerically.

2.3

Limited Precision

The use of configurable computers has renewed study in the area of limited precision computing. Determining the least number of bits needed for a task was important when many chips were custom designed and silicon was expensive. With the rise of cheaper fabrication methods and inexpensive, powerful CPUs, this area has became less important. The use of general purpose processors with dedicated floating-point units lessens the penalty for using floating-point for all calculations. However, as configurable computers become popular, the use of limited precision for calculations has again become important [?]. All configurable computers are based on programmable logic at some level of granularity. Historically, the most popular type of programmable logic is the FPGA. FPGAs have bitlevel granularity so arithmetic units can be built with any precision. In most cases, each additional bit of precision of an arithmetic unit will require more chip resources. Also, the maximum clock frequency for an arithmetic unit may decrease with each additional bit of precision. This high sensitivity makes using the lowest precision possible very important to optimizing a design on an FPGA. Limiting precision has been extended further for FPGAs for solving iterative problems in a recent paper [?]. This paper describes a method for performing low precision calculations that are collated into high precision results. Similar ideas were developed for CPUs, but those studies focused on using single precision floating-point calculations to find double

11 precision solutions [?, ?]. As mentioned earlier, FPGAs have a much finer grain of control over precision, and floating-point calculations are expensive when using FPGAs. So a new, modified version of this concept has recently been investigated specifically suited for use on a configurable computer [?]. The reason that low-precision arithmetic can be used in iterative improvement problems is that the answer converges gradually. During each step, a correction is found that improves the solution. When the correction is large and the highest bits of the solution are converging, the low bits do not hold any useful information. Therefore, there is no advantage to using a precision that calculates the low order bits before the upper bits have converged. As the solution becomes closer to the final answer, the refinement at each step becomes less. Because the refinement is small, the high order bits no longer change. At this point there is no longer any reason to recompute the high bits of the solution. This property of iterative improvement problems makes it possible to use fewer bits to calculate the correction than the number of bits that are in the final result. In this way, only the high order bits are calculated while the correction is large; inversely, only the low order bits are computed when the correction becomes small. This is possible by calculating the error (residual) in the equation for the iterative improvement problem. The goal of the example below is to find a value for x which satisfies the equation.

A ∗ xi = b.

(2.1)

The residual (or error) in this equation can be written as

r = b − A ∗ xi .

(2.2)

Instead of using the initial equation, the change in x can be calculated

∆xi = A−1 ∗ r.

(2.3)

12 The previous calculation can be performed with lower precision arithmetic. This step is iterated a number of times, then ∆xi is then added back into the previous x

xi+1 = xi + ∆xi

(2.4)

This method has been shown to converge to the correct solution [?]. It is applicable to our work in CA because the CA models we use for structural design optimization are in a form that utilizes this method. The advantage of using this method on reconfigurable computers comes from the fact that the bulk of the operations are performed during the update phase. A large number of resources can be devoted to accelerating the update calculations because the update can be calculated at a low precision.

Chapter 3 System Design This section describes two approaches to implementing CA models on FPGAs. Both approaches use an array of uniform, simple processing elements (PEs) spread throughout the chip. A large number can fit on a single FPGA because the PEs are relatively simple. This distributed computing is effective because of the parallel nature of solving the CA models. The two designs described in this section illustrate a fundamental tradeoff in hardware design, flexibility versus speed. The first implementation is a custom circuit developed to solve the analysis equation for a given design. The second implementation executes a program stored in memory that controls arithmetic operations. Both designs solve the same analysis problem. It is important to note that the underlying theory behind the two designs is the same. In both cases, the design is intended to determine the displacement and rotation of sections of a beam given the constraints and external forces on the beam. Though they solve the same problem, the motivation behind each design is fundamentally different. Therefore, although the same equations are used for solving for the beam variables, the form of the equations are optimized for the specific implementation.

13

14

3.1

Design Background

When performing operations on an FPGA, it is much more efficient to use fixed-point arithmetic than floating-point arithmetic. For this reason, both of the designs represent numbers in fixed-point notation. The nature of CA models allows for this type of representation. The position of the decimal depends on the architecture and the type of data being stored. The number of bits of precision varies based on the operation being performed. In both models, intermediate values produced during calculations are stored in increasing precision to avoid loss of data. The data is then truncated before the final value is stored. These designs were developed to perform calculations for a one dimensional CA model, with two degrees of freedom for each cell. The arithmetic for higher dimensional problems can be performed without significant changes to the structure of the PE. The main difference in higher dimensional problems is the change in the communication pattern. In the onedimensional models considered, each cell only needs to communicate with its immediate right and left neighbors. In the case of a two dimensional problems, cells often need to communicate data with four to eight neighboring cells.

3.1.1

System Hardware

The concepts presented in this thesis for solving CA models on configurable computers can be applied to many hardware configurations; however, both designs were developed with a particular system in mind. The system uses a host PC connected over a PCI bus to a card containing five FPGAs (see Figure 3.1). The FPGAs are all Xilinx Virtex II - XC2V4000 chips [?]. There are several features which make the Virtex II desirable for simulating CA models. The first advantage of the Virtex II is the large amount of internal RAM distributed throughout the chip. These internal BlockRAMs have customizable width and depth. They also have two ports that can independently read and write to different addresses. Transfer-

15

Figure 3.1: Setup of the configurable computer used for simulationing CA model. ring data on and off chip is an expensive operation and is typically the bottleneck in most applications. By utilizing these memories, we avoid having to transfer data to external banks of RAM. The second advantage of the Virtex II is the built-in multiplication units. In the sea-ofgates model for FPGAs, implementing multipliers is expensive. This is especially true if the precision is large because the size of multipliers grow with the square of the number of bits of precision. In the Virtex II, there is a built-in multiplier associated with each BlockRAM. This lends itself to the distributed processor models we used.

3.1.2

Distributed Layout

FPGAs are designed to be as flexible as possible so they can be used in many applications, but this flexibility comes at a cost in terms of space and speed for any arithmetic unit when compared to custom VLSI. Chips such as general purpose processors have custom designed arithmetic units that have a significant advantage in executing sequential operations. The

16 reason an FPGA has the potential for speed up versus a general purpose processor is that it can perform many operations in parallel or deeply pipeline the operations. In order to maximize the ability of an FPGA to perform operations in parallel, as much of the reconfigurable resources as possible should be in use at the same time. To accomplish this objective, both designs use many uniform, simple PEs operating in parallel. Each PE is responsible for calculating the next value for a section of cells in the CA model (see Figure 3.2).

Figure 3.2: Distribution of logical CA cells among PEs. This distribution is simplified because each cell in the CA model is governed by the same equations. The arithmetic units in each PE implement the governing equation; each logical cell is represented by the data values that are inserted into the equation. There is a BlockRAM associated with each processing unit that stores the set of data values for each cell. The number of cells represented by a PE is determined by the number of logical cell data sets that can be stored in the BlockRAM. This concept of having multiple cells per PE greatly increases the number of logical cells that can be represented in a design. A certain amount of chip resources is needed to calculate a cell update. If only one cell was represented in each PE, then the PE could be slightly

17 smaller and a BlockRAM would not be needed. However, the resources required are not greatly increased by moving from a PE that calculates the update for one cell. There are enough BlockRAMs on the Virtex-II so that the number of BlockRAMs does not limit the number of PEs that can fit on the chip. During a single iteration, all of the logical cells contained within a processing unit are updated once. The update for a cell depends on its right and left neighbors. To calculate the update for cells on the edge of the section of logical cells a PE represents, the PE needs data from the PEs to its right and left. At the end of an iteration, each PE transfers the data from its leftmost cell to the PE representing cells to the left. Likewise, the data from the rightmost cell must be transferred to the PE representing the cells to the right. After this transfer, each processing unit has all of the information needed to compute the next update for all of the cells it represents. Calculations for all cells can start simultaneously because the necessary information about all cells is known at the beginning of the iteration. In both designs, registers are placed between arithmetic units. If an arithmetic unit required more than one cycle to complete, a pipelined version of the component was used. This pipelining allows multiple cell updates to be computed concurrently, because cells do not need to wait until the previous cell has completely finished processing.

3.2

Problem Specific Design

The original direction of this project was to develop a toolset that could rapidly produce custom hardware models based on specific problems. A designer who wanted to use the tools would specify the problem in a custom programming language. A compiler would interpret the input code and produce a custom FPGA configuration to solve the problem. It was expected that a toolset could be developed for creating custom bitstreams rapidly enough to make the system useful. The first step in this development process was to analyze typical CA analysis equations

18 and manually create an optimized layout. The equations used are based on an analysis problem with two degrees of freedom, v and Θ. v¯c = (C0 ∗ (vl + vr ) + C1 ∗ (Θl − Θr )) + Fc Θ¯c = (C2 ∗ (vl − vr ) + C3 ∗ (Θl + Θr )) + Mc

(3.1)

The variable vc represents the v value for the current cell being processed. The variables vl and vr are the v values for the current cell’s right and left neighbors. F represents an external force. v¯c is the value of vc at the next time step. These equations can be used to solve a one-dimensional CA analysis problem, such as deflection of a uniform beam. This form of the equations was chosen because it can be mapped to a small, linear circuit. The main goal was to minimize the number of multiplications needed because multiplication units are costly in terms of space on the FPGA, . An optimized design was built to solve these equations. For each operation a variety of components was considered, and multiple layouts were investigated in the implementation of the equations. Maximum clock frequency, latency, and size were examined when selecting each component. To further optimize the circuit, because they are independent, vc and Θc are computed simultaneously. Figure 3.3 shows the final optimized design. The outputs of all the components shown in Figure 3.3 are registered. Additionally, the constant multipliers are pipelined. The resulting latency though the circuit is 6 clock cycles. The circuit is designed such that all information to compute the update value is provided at the point at which it is needed. In particular, the F c and Mc values are loaded 5 clock cycles after the corresponding Θ and v values. In this design, when this pipeline is filled the circuit can produce an updated value every clock cycle. The constant multipliers were used because they had a much lower latency and were much smaller than traditional multipliers. Using constant multipliers is only possible if the coefficients in Equations 3.1 are fixed. In the case of analyzing the deflection of a uniform beam, these coefficients are constant. These multipliers have the characteristic of

19

Figure 3.3: Arithmetic unit for Problem Specific design. having a structure independent of the constant multiplicand. Therefore, if the location in the bitstream of the constant multiplier is known, the values in the FPGA look-up tables (LUTs) values could be modified directly to reflect changes in the coefficient. The disadvantage of using constant multipliers is that design optimizations made to the density of the beam would require that a different type of multiplier be used. Also, if the beam was not uniform, the value of vc and Θc would be needed to compute updated values, v¯c and Θ¯c . This narrows the usefulness of this design, but it provides an optimized baseline for comparing other designs. Each PE in the Problem Specific design contains arithmetic logic, a finite state machine (FSM), and a BlockRAM. The BlockRAM contains all of the values for the cells. The FSM controls the addresses from which data is loaded and stored in the BlockRAM. The PE operates most efficiently when the pipeline is filled. When the pipeline is filled, a new set of data needs to be applied each clock cycle, and updated values need to be stored each clock cycle. To accommodate this flow of data, one port of the BlockRAM is devoted to loading data and the other is devoted to storing data (see Figure 3.4).

20

Figure 3.4: PE layout for Problem Specific design The Edge Registers are used to communicate data to neighboring PEs. When the update for a cell on either end of the section of the model for which the PE is responsible is calculated, the new value is stored in the Edge Registers. Each PE has access to these registers in its right and left neighbor. When data is needed from a neighbor, the values are loaded from the Edge Registers instead of from the BlockRAM. For PEs on the boundary of the model, the Edge Registers are connected to constant values. To implement design optimization, FPGA configuration bitstreams need to be produced for both analysis and design phases. The FPGA would first be loaded with the analysis design and the configuration would be iterated until the data values converged. After the cells converge, the data needed for the design improvement phase is stored in the internal BlockRAMs. The FPGA would next perform a partial reconfiguration and load the design improvement bitstream, during which the contents of the BlockRAMs would not be changed. In this way, data would be passed between the analysis and design phases. The results would be extracted from the board through readback. During the readback

21 operation, the FPGA dumps its entire configuration including flip-flops and BlockRAM contents. Once the contents of the FPGA are dumped, careful filtering of the data would yield the current results. This method negates the need for using specialized hardware to support downloading data. The residual-update method, described in the Background chapter, can be used in finding the solution for a CA model because it is an iterative improvement problem. The advantage to using this method would be that low precision calculations can be used to generate a high precision result. The reconfiguration between analysis and design phases would provide the opportunity needed for loading updated coefficient to the FPGAs. The result of implementing the residual-update method would be that an 8-bit design could produce results with precisions such as 16 or 32 bits.

3.3

Program Based Design

The Program Based design represents a fundamentally different approach to solving the same analysis problem as the Problem Specific design. The Problem Specific design can perform analysis updates very rapidly because it uses custom hardware. However, using a custom design means that for each new problem an optimized circuit must be designed, and an FPGA configuration must be generated. The overhead of building a custom configuration for each problem could easily erase any speed advantage. On the opposite end of the spectrum, a compiled program running on a general purpose CPU has very low initial overhead, but it cannot take advantage of the inherent parallelism of CA. The Program Based design was developed to bridge the gap between the analysis speed of custom hardware and the flexibility of a general purpose processor. The first major change, compared to the Problem Specific model, is that the Program Based design executes a program stored in internal BlockRAM to control data accesses and the arithmetic units in the PEs. In the Problem Specific model these operations were

22 performed using a fixed finite state machine. Another significant change is that the control logic is removed from the PEs and placed in a central control unit. The signals are then propagated to the PEs throughout the chip. The third important modification is that the equations are represented in a matrix form to provide a more flexible architecture that can handle a variety of problems. This matrix arithmetic is expressed in the layout of the arithmetic units. The last major difference is that the Program Based model has the capability to compute results in both high precision and low precision forms on the FPGA and then combine the two results. The goal of flexibility for the Program Based design is reflected in the form of the equations for the model. The hardware is designed to solve problems set up in matrix form. This provides a simpler method to implement, and eventually automate, CA design algorithms. The matrix form of the beam equations is shown in the following equations:

  

























v¯c   vl   C0 C1   vc   K0 K1   vr   C4 C5   Fc  =  +  +  +  (3.2) Θ¯c Θl C2 C3 Θc K2 K3 Θr C6 C7 MC This equation solves the same analysis problem as the Problem Specific design. This is

one of a range of two-dimensional problems that can be solved by the Program Based design. Equations can be implemented with any number of terms and are expressed in matrix form. The Problem Specific model only solves problems that can be represented in the form of the beam equations, while the Program Based model has the capability to capture the behavior of a variety of problems. The complexity of control logic increased greatly in the Program Based model as compared to the Problem Specific model. The finite state machines that controlled the load and store logic of the Problem Specific model are ill-equipped to handle the increase in complexity. The control logic in the Program Based model uses significantly more resources, so it is advantageous to move the control logic to a centralized location. There is a penalty involved in distributing the control signals; however, the size of each PE would more than double if

23 the control logic was not centralized. The architecture of having a single control unit makes the Program Based design similar to a Single Instruction, Multiple Data (SIMD) parallel computer (see Figure 3.5). Removing the control from the individual PEs is possible because all cells, including boundary cells, can be represented by changing the coefficients in the matrix equation. Historically, there has been a lack of widespread interest in SIMD parallel computers because they are inflexible and require custom processors. However, SIMD machines have been successful in multimedia and DSP applications [?]. These applications involve repetitive calculations that can be performed in parallel, similar to those needed for CA models.

Figure 3.5: Return data chains for Program Based design. The Control Unit (CU) requires feedback from the PEs, for example a flag indicating that calculations are complete. The routing resources around the CU would be consumed quickly because there are a large number of PEs that need to communicate with the CU. To avoid this problem, there are multiple PEs on each return data bus so only the last PE in the chain needs to be routed directly to the CU. The drawback is that extra computational cycles are needed. This is because the returning data takes an extra clock cycle to propagate back to the CU for every link.

24 Instructions stored in the CU are not like those of a traditional microprocessor. The instructions for a traditional general purpose processor are encoded, while the instructions in this design are stored as a 72-bit word that requires no decoding. The result is that most control signals can be connected directly from the memory in the CU to the PEs (see Figure 3.6). This method of storing instructions has the advantages of being both fast and allowing any combination of control signals to achieve maximum parallelism.

Figure 3.6: Control Unit for Program Based design. The instructions contain two main parts, the flow control logic and control signals. The flow control portion interacts with the flow control logic in the CU to determine which instruction is executed next. The flow control logic allows for increments to the program counter, branches, and conditional branches. The control signals manage operations in the PEs. These include: clearing registers, loading data, and shifting data. The signals for controlling the BlockRAMs in the PEs are fed through address logic to allow absolute and relative address jumps.

25 Though there is a plan to automate the process of writing the programs loaded into the control unit, the first programs were written manually. A spreadsheet was used to select the values of every control signal at each time step. The spreadsheet was set up to automatically insert the signal values into the proper bit position (see Appendix A for example). The integer equivalent of the binary number is loaded into the control unit memory at compilation time. In the current design, the program cannot be changed at run time. To understand the reasoning behind the arithmetic logic in the PEs, it is necessary to understand the process for using a residual and an update to calculate results. It is possible to use a residual-update method, described in Chapter 2, to find the solution because the CA solutions are attained by iterative improvement. This method has the advantage of using low precision arithmetic for most calculations. In describing this method, n is the number of bits used in high precision calculations and k is the number of bits used in low precision calculations. This method works by first calculating the residual, or error in the equation that is being solved. The residual calculation must be performed in n bits for every cell in the model. The most significant k bits of the residuals are then extracted and stored. The k bits must be taken from the same position in every residual. The largest element in the residual vector dictates which bits are selected. The update equation is then calculated in k bits, and the k-bit version of the residual is used in place of the F c and Mc in Equation 3.2. This k-bit update is performed until the results converge. After the k bit updates are found, they are added into an accumulated version of the variables at the same offset as the bits that were taken out of the residual. The cycle repeats using the accumulated version of the variables in the residual equation. These iterations are repeated until the accumulated versions of the variables converge. The flow chart in Figure 3.7 shows an analysis cycle using this method. This method is effective because the majority of the time is spent in calculating the update in k bits. More parallel arithmetic units can be used to speed up the calculations because the update

26

Figure 3.7: Analysis cycle flow and precision for each operation. calculation is be performed in k bits, . There are three main parts to the PEs used in the Program-Based design (as shown in Figure 3.8): - Multiply Accumulator: calculates the residual in n bits - Shift Unit: extracts k bits from n-bit residual and adds the update into accumulated variables - Matrix Accumulator: calculates cell updates in k bits The Multiply Accumulator is simply a multiplication unit and an adder with the registered version of its output connected to one of its inputs (see Figure 3.9). There is only one multiply accumulator per PE because it uses n-bit arithmetic, and these n-bit precision units are large. The Multiply Accumulator takes advantage of the built-in 18x18 multiplier units on the Virtex-II FPGAs to save resources. The minimization of the hardware results in multiple clock cycles being needed to compute

27

Figure 3.8: Computational unit for Program Based design.

Figure 3.9: Multiply accumulator used in computational unit. residual values. The latency through each unit is one clock cycle, so the pipeline is two stages. For the equation proposed in the beginning of this section, it takes 16 clock cycles to calculate the residuals for one cell. The expense of the residual calculation is tolerable because many update calculations are performed between residual calculations. After the residual is calculated it must be converted to a k-bit number. During the residual calculation, the most significant bit of the largest residual value is found. There is a mechanism in each PE that stores the absolute value of the largest residual calculated. This value is passed along the return data chain until it arrives at the control unit. Each PE performs a logical OR on the value passed to it and the largest value it has calculated.

28 This process destroys the actual value of the largest residual, but the number passed to the control unit shows the position of the most significant ‘1’. This position is used to determine which bits of the residual are stored for the update phase.

Figure 3.10: MSB data return chain, used for determining the most significant ‘1’ of residuals. The logic to extract the bits is based on a multiplexer, a register, and a right shifter. The multiplexer selects between an input from memory or a right shift version of the value stored in the register. The value output from the multiplexer is loaded into the register. During the bit shifting phase, the register is first loaded with the n-bit value. The right shifted value is then selected from the multiplexor. The value is looped through right shifter until the desired bits are in the lowest position. The number of clock cycles required is dependent on the number of bit positions the value needs to be shifted.

Figure 3.11: Unit for shifting the precision of intermediate results.

29 The adder, after the shift logic, is used during the addition phase at the end of the outer analysis cycle. During the addition phase, the update is loaded into the highest bits and shifted to the correct position. It is then added to the previous value, which is read from memory. A signal from the control unit selects which value is output from the unit. The final piece of the PE for the Program Based design is the Matrix Accumulator. The Matrix Accumulator is similar to the Multiply Accumulator unit, except the arithmetic is performed in k bits and more hardware is used to speed up calculations. The unit is designed specifically to be able to multiply a 2x2 matrix by a 2x1 matrix. For example, Figure 3.12 shows the circuit calculating the equation: 









¯   v   C0 C1   v  =   ¯ C2 C3 Θ Θ

(3.3)

Figure 3.12: Matrix accumulator used for analysis updates. The multiplier has a three clock cycle latency and is fully pipelined. The entire unit has latency of five clock cycles. The update for each cell using the matrix version of the beam equations, described earlier in this section, takes 9 clock cycles. However, when the pipeline

30 is filled, the circuit can produce an update every five clock cycles, and this circuit calculates the update for both analysis variables simultaneously. Every PE can select to have the input to Port B of the memory connected directly to the output of Port A of its left or right neighbor. In this way, PEs transfer data about the cells on the edge of the section of the CA model for which it is responsible. This system is also used to upload and download data from the FPGA. The PE that calculates the values for the cells on the left end of the model can read data from the PCI bus, while the PE that calculates the values for cells on the right end of the model can write data to the PCI bus.

Figure 3.13: Data flow for uploading and download data to FPGAs. To upload coefficients and external forces, as well as initializing variable values, the host computer begins by writing the data into the memory of the leftmost PE for the rightmost PE. The data is then shifted through all the PEs until it gets to the proper place. At the same time new data is shifted into leftmost PE. Downloading the results is a similar process, it involves shifting the data right and reading it off the rightmost PE. An external clock is used to keep data transfers synchronized. Although reconfiguration is not part of the analysis cycle, the implementation of the system for performing design optimization will use reconfiguration in a number of ways. Each analysis model has fixed connections for communicating among PEs. It is possible to pass data through intermediate PEs to transfer data between PEs that are not directly connected. However, to achieve maximum efficiency, communication should be done over direct connections when possible. The design system will have a number of different analysis models, each with a different communication pattern. When the user specifies the initial

31 problem the system will select the bitstream for the most appropriate model and load it into the FPGAs. Design optimization may be performed in a number of different ways. The first possible technique is to use reconfiguration. A bitstream developed to perform design optimization could be loaded on the FPGAs using partial reconfiguration. The data would be passed between analysis and design models through the BlockRAMs, like the method proposed for the Problem Specific model. Another technique would be to use the analysis model to perform the design calculations. Design would require new coefficients, which could be loaded into the FPGA using the uploading and downloading models described earlier. The disadvantage of this method is that the analysis design might not be capable of performing all of the operations needed, or the operations may be very inefficient. The final possibility is to use a Virtex-II Pro FPGA, which contains built-in PowerPC processors. These internal processors could be used to run a program to calculate the new design values.

Chapter 4 Results

4.1

Problem Formulation

The results in this section are based on solving the analysis of a CA model of a onedimensional beam. The model is formulated from work by researchers at Virginia Tech [?]. The beam is divided into cells that have two degrees of freedom, vertical displacement (w) and rotation (q). Each cell also has a separate vertical thickness, which is the design variable. The thickness of the beam is specified at the middle of each cell, and then linearly interpolated in between the specified points (see Figure 4.1). Cells in the model are evenly distributed along the beam. There are a number of possible configurations for each cell. The cell can have a fixed displacement, a fixed rotation, a fixed displacement and rotation, or it can be free in displacement and rotation. External forces can be applied to any cell. The forces can be in the form of a vertical force (F) or a bending moment (M). These different configurations are represented by changing the coefficients in the equation that is solved by each model. Using these available cell configurations, many classical static beam problems can be solved. The CA model, shown in Figure 4.2, was modeled with 20 cells and was run on both the 32

33

Figure 4.1: Diagram of CA model for perform analysis on a beam.

Figure 4.2: Beam analysis problem modeled on the configurable computer. Problem Specific and Program Based designs. 20 cells was choosen so the model could be quickly simulated. The first cell in the model is a dummy cell, for which no computations are performed. The cells (1 and 19) on the ends of the beam have fixed displacement and rotation. Cell 14 has a fixed vertical displacement. All other cells in the model are free in displacement and rotation. There is a vertical force pushing up on cell 9. The force is scaled to produce a maximum displacement of slightly less than 127, so the result can be represented in 8 bits.

34

4.2

Problem Specific Design Results

The designs presented in the implementation section were intentionally developed independently of any fixed precision for the results. There is a trade off between the number bits of the solution that will be calculated and the number of cells that can be represented in the system. In addition, larger precision results in lower maximum clock frequency and/or an increased pipeline length. Another factor in changing the precision is memory access. The BlockRAMs have programmable port widths that can accommodate some changes in precision. The BlockRAMs’ two ports can each handle up to 36 bits and can independently read or write. Once the data transfer limit has been exceeded, accessing the data needed will take multiple clock cycles or more memories must be used in the design. During development, multiple versions of the Problem Specific design were built that used different precision for calculations. Figure 4.3 shows the growth of the size of a PE as the precision of the calculations is increased. The graph shows that size grows rapidly as the number of bits is increased. This makes it very important to use the least precision possible for calculations. From this data, and based on the beam problem being solved, 8 bits of precision was chosen as likely to be the most effective. In most of the following analysis, an 8-bit model was studied for the Program Based design. However, this precision could change based on the type of problem being solved and the number of cells in the model. In this respect, the Problem Specific design would have more flexibility with regard to precision than the Program Based design because the Problem Specific design is custom-made for each problem. Once 8 bits was selected for the precision, the number of PEs needed to be selected. The maximum number of PEs that can fit on an FPGA is limited by the programmable logic and routing resources on the chip. However, when the chip usage gets high, the routing gets inefficient and the maximum frequency at which the circuit can be clocked drops rapidly.

35

Problem Specific design − % utilization vs. PE precision 0.07

% chip utilization (based on Virtex II−4000)

0.06

0.05

0.04

0.03

0.02

0.01

0

4

6

8

10 12 PE precision (bits)

14

16

18

Figure 4.3: Precision of PE vs. Percent Utilization of FPGA for Problem Specific design

36 % utilization and maximum clock frequency vs. Number of PEs 1

100

0.8

80

0.6

60

0.4

40

0.2

20

0

5

10

15

20

25 Number of PEs

30

35

40

Maximum clock frequency (MHz)

% chip utilization (based on Virtex II−4000)

chip utilization maximum frequency

0 45

Figure 4.4: % Utilization of FPGA and maximum clock frequency vs. number of PEs for Problem Specific design. Figure 4.4 shows the chip utilization and the maximum clock frequency versus the number of PEs. The number of logical cells that can be represented increases linearly with the number of PEs. However, the clock maximum frequency decreases gradually as the number of PEs increases, then drops quickly after the 35th PE. The Problem Specific and Program Based designs vary widely in the number of cells they can represent and the precision of the result. In order to compare these differing designs, the maximum number of cell updates per second is used as a metric. This is also used as the metric to determine the speed-up over a program running on a general purpose processor. The number of cell updates per second for the Problem Specific design is simply the

37 Cell updates per second vs. # of PEs 2500

Cell updates per second (million)

2000

1500

1000

500

0

5

10

15

20

25 30 Number of PEs

35

40

45

50

Figure 4.5: Cell updates per second vs. number of PEs for the Problem Specific design. number of PEs multiplied by the maximum clock frequency. This is because in the Problem Specific implementation, each PE produces a result every clock cycle during analysis. Figure 4.5 shows a peak in the maximum number of cell updates when 35 PEs are on the chip. With 35 PEs the 8-bit design has a maximum clock frequency of 64.5 MHz. In comparison, a 12-bit model with 35 PEs cannot fit on the Virtex-II 4000 FPGA. There are additional factors that limit the cell updates per second that can actually be performed on the system (see Table 4.1). The most costly factors are the reconfiguration and readback times on the FPGA and the time it takes the host to compute the coefficients for a given design. These times are important because the communication with the host is

38 Time (ms) Operation

1 PE 1 FPGA

DINI Board

Reconfig

N/A

Readback

not yet implement in DINI API

Host computations

1.11

1190

39

4760

156

Table 4.1: Times for operations associated with Problem Specific analysis cycle for DINI board.

done through reconfiguration and readback. These results are dependent on the design being able to accurately produce analysis results. The problem described earlier in this chapter in the Problem Formulation section (see Figure 4.2) was modeled on the Problem Specific design. The force was scaled so the result would be able to be represented in 8 bits. The actual results were calculated using a C++ program running on a PC which used floating-point arithmetic for all calculations. The results show (see Figure 4.6) that the system was able to produce results that were similar, but not exactly correct. The mean of the error between the actual results and the results attained from the Problem Specific model for the displacement and rotation were 38.4% and 41.4% respectively. This large error is due to the rounding that takes place because fixed point arithmetic is used. The answer would be improved, if better accuracy is needed, by using reconfiguration and the residual-update method for iterative improvement.

4.3

Program Based Design Results

The Program Based design has much the same sensitivity to precision as the Program Based design, but the Program Based design does not have quite as much flexibility in term of precision. Because the full precision calculations of the residual rely on the built-in 18x18 multipliers, it is difficult to increase the full precision result to more than 18 bits. However,

39

Actual results vs Problem Specific design results 120

actual displacement Prob Spec displacement actual rotation Prob Spec rotation

100

Displacement (mm)

80

60

40

20

0

−20

−40

0

2

4

6

8

10 Position (m)

12

14

16

18

20

Figure 4.6: Actual results and results from Problem Specific design for beam problem.

40 Chip utilization vs. PE precision 0.07

% chip utilization (based on Virtex II−4000

0.06

0.05

0.04

0.03

0.02

0.01

0

4

6

8

10 12 PE precision (bits)

14

16

18

Figure 4.7: Precision of PE vs. % Utilization of FPGA for Program Based design. there is some flexibility in the precision of the update calculations. Figure 4.7 shows how the size of the PE grows as the precision of the update arithmetic is increased. The growth is similar to that of the Problem Specific model. Based on this data, 6 bits was selected for the precision of the update calculations. The 6-bit precision design attains the maximum per second with 60 PEs. The maximum clock frequency is 94.8 MHz. If the same design used 8 bits of precision the maximum clock frequency is 88.1 MHz. The precision selected for the update is also trade off between having smaller update units and having to perform the outer iteration more often. When the precision of the update calculation is larger, more inner iterations are performed before new residuals need to be calculated. Using a smaller precision has the advantage of being able to devote more, smaller units to calculating the update. The number of clock cycles needed for each phase of the

41 Analysis Cycle Phase

Clock Cycles

Residual Calc

550

Shift Residual

330-990

Cell Update Add

190 * Inner Iterations 410-1135

Table 4.2: Clock cycles for different phases of residual-update analysis cycle.

analysis cycle is shown in Table 4.2. As the number of inner iterations increases during each analysis cycle, the Program Based model becomes more efficient. Figure 4.8 shows the increase in the efficiency of the design versus the number of inner update iterations for each residual calculation. For this graph, the average number of cycles for the Shifting and Adding phase of the analysis cycle was used. When the number of inner iterations is below 10, more than half the time is spent in calculating the residual or shifting the results. However, the model rapidly becomes more efficient. With 35 inner iterations per analysis cycle this design achieves 75% efficiency, and at 90 inner iterations the efficiency is 90%. The number of iterations needed will depend on the type of problem and the number of cells in the model. The total cell updates per second that can be performed with the whole chip is dependent on the number of PEs on the FPGA. The maximum number of cell updates per second for the Program Based design is achieved slightly before the chip resources are exhausted because of routing congestion (see Figure 4.9). Figure 4.10 shows how the maximum number of cell updates per second rises then deteriorates as the number of PEs is increased. The Program Based model depends on communication with a host through the PCI bus in the current design. Before the calculation can begin, the coefficients for a specific design need to be loaded into each of the PEs. Then, after the analysis is complete, the results need to be downloaded back to the host. The transfer times for these operations are shown

42

Efficiency vs. inner iterations per analysis cycle 100

Efficiency (% of time spent calculating updates)

90

80

70

60

50

40

30

20

10

0

0

50

100 150 Inner iterations per analysis cycle

200

250

Figure 4.8: Efficiency vs. number of inner iterations per analysis cycle.

43

% utilization and maximum clock frequency vs. Number of PEs 1

200

0.5

0

100

0

10

20

30 40 Number of PEs

50

60

Maximum clock frequency (MHz)

% chip utilization (based on Virtex II−4000)

chip utilization maximum frequency

0 70

Figure 4.9: % Utilization of FPGA and maximum clock frequency vs. number of PEs for Program Based design.

44

Cell updates per second vs. # of PEs

Cell updates per second (million)

1500

1000

500

0

10

20

30

40 Number of PEs

50

60

70

Figure 4.10: Cell updates per second vs. number of PEs for Program Based design.

45 Time (ms) Operation

1 PE 1 FPGA

DINI Board

Uploading Coefficients

2.10

114

228

Downloading Results

.311

18.5

37

Host Computations

.360

21.5

86.0

Table 4.3: Time for operations associated with analysis on Program Based design.

in Table 4.3. This communication is synchronized by an external clock supplied by the host. There is additional overhead due to the computation of analysis coefficients on the host for each design. The problem shown in Figure 4.2 was modeled on the Program Based design. This was the same model run on the Problem Specific design, except the force is scaled up so that the maximum result was closer to an 18 bit number. The results were again compared to a C++ simulation that used floating-point for all calculations. Figure 4.11 shows the results attained from the Program Based model were very close to the actual results. The mean of the percent error between the Program Based model data and the actual results was 0.099% for displacement and .118% error rotation. The results for the Program Based model were much more accurate than the results from the Problem Specific model because the Program Based model uses 18 bits of precision during the residual calculations. The speed of the Program Based model comes from the use of only 6 bits for the update calculations. The external force was scaled so the maximum displacement would be close to 18 bits. After they were computed, the results were scaled down to match the original problem.

46

Actual results vs Program Based design results 120

actual displacement Prog Based displacement actual rotation Prog Based rotation

100

Displacement (mm)

80

60

40

20

0

−20

−40

0

2

4

6

8

10 Position (m)

12

14

16

18

20

Figure 4.11: Actual results and results from Program Based design for beam problem.

47 Maximum Cell Updates per Second (millions) Operation

1 PE

1 FPGA

DINI Board

Problem Specific

65.1

2279

9116

Program Based

18.9

1137

4548

Table 4.4: Maximum cell updates per second for both implementations.

Operation

Maximum Cell Updates Sec (millions)

Speed-up

PC

48.9

-

Problem Specific

9116

186.4

Program Based

4548

93.0

Table 4.5: Maximum cell updates per second and speed up for both implementations compared to PC.

4.4

Comparison of Designs

Most of the results presented in the earlier sections were for systems using one FPGA. However, the DINI board intended for this system has 4 FPGAs. The total computing power increases linearly because all the FPGAs can run in parallel. Table 4.4 shows the maximum number of updates for each of the systems for each level of the system. To compare these FPGA based designs to conventional methods, a C++ program was written to calculate the results. To make the comparison as fair as possible, the program uses integer arithmetic instead of slower floating-point arithmetic. Integer computations are closer to the fixed point math used by the FPGA designs. The program was compiled using GCC with optimization enabled and executed on a PC with a 1.7 GHz processor and 1 GB of RAM running Linux Debian. Table 4.5 shows the speed up attained by the FPGA designs over the general purpose processor version.

Chapter 5 Conclusions

5.1

Summary

Cellular Automata (CA) theory has been studied for decades, with most of the work done on modeling natural systems. Recently, the use of CA theory has been extended to provide a system for structural analysis and design optimization. This work has proved to be successful, but the calculations are very slow on traditional general purpose processors. The parallel nature of these CA models makes them a good candidate for implementation on a reconfigurable computer. The work presented in this thesis shows the initial steps toward making an automated tool to perform structural design optimization accelerated by a reconfigurable computer. The contribution of this thesis was to design and implement two models for the analysis phase of the CA structural design optimization cycle. Both designs take advantage of the parallel nature of cellular automata by using a distributed array of processing elements. For the Problem Specific implementation, these processing elements are customized to each problem. The Program Based implementation has a more flexible processing unit that is controlled by a program designed to simulate a specific cellular automata model. The

48

49 Program Based implementation also has the built-in capability to use a residual-update method to accelerate calculations and improve accuracy.

5.2

Results

The results show the Problem Specific design and the Program Based design were able to generate cell updates at the rate of 9.12 and 4.55 billion per second, respectively. Though the Problem Specific design proved to be able to generate updates more rapidly, this increase in speed came at the expense of precision and flexibility. The Program Based model’s competitive speed, improved accuracy, and ability to handle a range of update rules make it the architecture that provides the most potential for an automated system. Both hardware implementations of these CA model for structural analysis were very successful in term of performance. When compared to a 1.7 GHz Pentium 4 processor, the Problem Specific design proved to be 186 times faster. The Program Based design, which was slightly slower, was still 93 times faster than the general purpose processor version. These speed-ups are a step towards making a CA system for structural design optimization that significantly outperforms traditional methods.

5.3

Future Work

There are a number of interesting areas that need to be studied in order to design an automated tool for perform structural design optimization using CA. The most immediate may be the need for a translator and compiler for the programs used by the Program Based design. For the work in this thesis, the programs were all written by hand. This process was very difficult and time consuming. A compiler is needed to take a higher level abstraction and generate machine level instructions. The end product should be a compiler that could take a problem specified by a design engineer who has knowledge of the hardware implementation

50 and produce the necessary instructions. Additionally, an efficient method for design calculations must be implemented. In the current system, the results are downloaded to a host computer where design calculations can be performed. However, this is an inefficient technique. A number of possibilities exist for executing the design calculations on the board, such as using partial reconfiguration or on board processors. These possibilities need to be investigated to identify the best method. Another area that needs work is implementing multi-grid on the system. The multi-grid approach to these CA problems would be to calculate results while varying the resolution of the grids. In other words, the number of cells representing the system would increase and decrease based on certain algorithms. Multi-grid could also be used to blend analysis and design steps into a single cycle. These methods have the potential for huge reductions in the number of calculations needed.

Appendix A

This appendix gives an example of how the programs for the Program Based model are written. The Processing Elements (PEs) in the Program Based model were developed to perform high and low precision arithmetic and convert between the two forms. The control logic needed to simulate a cellular automata model is complex, so programs are used to set the control signals. The program is stored in an internal BlockRAM contained in the Control Unit(CU). As the PEs were designed, the control signals for each unit were assigned to particular bits of the BlockRAM in the CU. The position of the bits and a short description of their function was recorded in a Excel spreadsheet. Figure A.1 shows a screenshot of this spreadsheet. Each phase of the analysis cycle was written as a separate program. There are control signals for each functional unit of the PE, but only one functional unit is in use during each phase. The first step in writing a program was to identify the signals needed for the particular phase. The pertinent signals were placed across the top of a spreadsheet and the value of the signal was specified below. Each horizonal line represents a clock cycle step. Figure A.2 shows an example of a program. This particular program calculates the update during the inner iteration of the analysis cycle. The signals on the left are used to determine the order in which instructions are executed.

51

52

Figure A.1: Spreadsheet with position of control signals and short description. The program counter will increment by one unless a loop is specified. The signals on the left control the PEs. Signals with only two options are specified as Y or N. Signals with more options are specified as a number or letter from a certain set. There is a second spreadsheet which determines the numerical value for each signal. The values of the signals are then converted into an intermediate form. The intermediate form is the numerical value of the signal multiplied by two to the power of its bit position. The final value is the sum of all the intermediate values (see Figure A.3). This is the number that is loaded into the CU BlockRAM. These final values are then put in a form that can be read into memory (see Figure A.4).

53

Figure A.2: Spreadsheet containing update program

Figure A.3: Spreadsheet converting signals to the form used by the Program Based model

54

Figure A.4: Spreadsheet containing the data values in a form that can be loaded into memory.

Vita Thomas Hartka was born in June 1980 in Baltimore, Maryland. He atttended from Archbishop High School in Severn, Maryland. Thomas enrolled in the College of Engineering at Virginia Tech in the fall of 1998. He graduated Cum Laude with a Bachelor of Science in Computer Engineering. Thomas choose to remain at Virginia Tech to pursue his Master’s Degree. He became involved in research at Virginia Tech Configurable Computing Lab. After graduating, Thomas will attend Johns Hopkins’ Post-Baccalaureate Premedical Program.

55