A Custom VLSI Architecture for the Solution of FDTD Equations

6 downloads 889 Views 281KB Size Report
defined by means of a hardware description language, allowing to keep high-level system specification independent of the actual fabrication technology.
IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002

572

PAPER

Special Issue on Signals, Systems and Electronics Technology

A Custom VLSI Architecture for the Solution of FDTD Equations Pisana PLACIDI†a) , Leonardo VERDUCCI† , Guido MATRELLA†† , Luca ROSELLI† , and Paolo CIAMPOLINI†† , Nonmembers

SUMMARY In this paper, characteristics of a digital system dedicated to the fast execution of the FDTD algorithm, widely used for electromagnetic simulation, are presented. Such system is conceived as a module communicating with a host personal computer via a PCI bus, and is based on a VLSI ASIC, which implements the “field-update” engine. The system structure is defined by means of a hardware description language, allowing to keep high-level system specification independent of the actual fabrication technology. A virtual implementation of the system has been carried out, by mapping such description in a standardcell style on a commercial 0.35 µm technology. Simulations show that significant speed-up can be achieved, with respect to stateof-the-art software implementations of the same algorithm. key words: FDTD method, digital arithmetic, very-large scale integration, VHDL language

1.

Introduction

In recent years, the FDTD method [1] has became one of the most widely used computational techniques for the full-wave analysis of electromagnetic phenomena. A number of inherent characteristics makes such an algorithm almost ideal for the analysis of a wide class of microwave and high-frequency circuits [2]–[4]: its application to practical, three-dimensional problems, however, is often limited by the demand of very large computational resources. In its basic formulation, in fact, the discretization procedure exploits an orthogonal lattice featuring N = Nx × Ny × Nz mesh cells, onto the edges of which the vector components of electric and magnetic fields are mapped. The total memory required to store the fields and the update coefficients thus increase as O(N ), while, at first order, one can roughly assume the total number of time iteration to increase as O(N 1/3 ) [2]. The number of floating-point operation of the FDTD algorithm thus increases as O(N 4/3 ). With respect to different discretization schemes, such a figure is a definitely favorable one; nevertheless, N still comes from a fully 3D volume discretization, so that the computational requirements quickly increase with the domain complexity. Manuscript received August 3, 2001. Manuscript revised October 30, 2001. † The authors are with the Dipartimento di Ingegneria Elettronica e dell’informazione, University of Perugia, Italy. †† The authors are with the Dipartimento di ingegneria dell’informazione, University of Parma, Italy. a) E-mail: [email protected]

Thanks to microelectronic technologies, inexpensive RAMs and high-speed processors have become available, allowing for practical engineering application of the FDTD method to be faced. Recent microprocessors exploit RISC (Reduced Instruction Set Computer) architectures. By using short, fixed-length instructions and relatively simple address modes, pipelined and superscalar architectures can effectively be implemented, allowing for large data throughputs to be achieved. Deep submicron technologies allows for embedding within the processor fast and powerful floating-point units, thus resulting in large computational speed. However, the actual performance which can be achieved by means of high-level programming is very much dependent on the operating system, the language compiler and the program structure [5]. On the other hand, the FDTD algorithm, taking advantage of the simplicity and symmetry of discretized Maxwell’s equations, exhibits some features which makes an hardware implementation appealing. The algorithm core, taking care of the time-varying computation of the components of electric and magnetic field, consists of six, triple-nested loops. Fieldupdate loops are independent of each others, and can be formulated without any branch, so that paralleling and pipelining techniques can be straightforwardly exploited to obtain better performance. Moreover, due to the intrinsic nature of a finitedifference algorithm, updating a field component at a given location only involves a limited number of variables, located at nearest-neighboring positions. By using suitable numbering techniques, simple and regular coefficient patterns can be obtained, so that the datacommunication bandwidth can be kept under control. Vector and massively-parallel computers can thus be exploited to achieve fast implementation of the FDTD algorithm [5]. The basic drawback of such an approach consists of the huge hardware costs, which limits the development and diffusion of supercomputing facilities. The same characteristics recalled above, however, can be exploited to implement an efficient (yet scalar), custom processor, at a fraction of the cost of a general-purpose supercomputer. In this paper, the architecture of a digital system is described, dedicated to the solution of the FDTD algorithm on a 3D domain. The system architecture is

PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS

573

built around a custom VLSI chip, which takes care of the floating-point operations needed to carry out the field updates: a superscalar, pipelined FPU has been designed to this purpose. The system is conceived as a PCB module, communicating with a host personal computer via a PCI bus. To speed up operations, communications with the host computer are kept at a minimum, and dedicated synchronous DRAM banks are exploited for the storage of variables and coefficients: thanks to the highly regular data pattern coming from the structured discretization grid, efficient memory-scan protocols can be exploited. A virtual implementation of the system has been carried out by specifying the whole architecture through the VHDL hardware description language: in particular, the physical implementation of the custom FPU has been carried out by mapping the network on a commercial technology. By adopting a standard cell design style, reliable timing characterization of the circuit can be obtained by means of simulation, and the overall performance of the system can be thus estimated. Expectations are that significant speed-up, with respect to state-of-the-art software implementations of the FDTD algorithm, can be achieved. This paper is organized as follows: in Sect. 2 below, basics of the FDTD algorithm, applied to the solution of Maxwell’s equations, are reviewed; Sect. 3 illustrates the system architecture, and details the structure of the custom floating-point unit. Some results of the system simulation are shown in Sect. 4, aiming at the estimate of the actual computational performance. Conclusions are eventually drawn in Sect. 5. 2.

The FDTD Algorithm

Maxwell’s equations in an isotropic medium can be written as follows:  ∂B  =0 +∇×E ∂t  ∂D  = −J −∇×H ∂t  = µH  B  = εE  D

(1)

where the symbols have their usual meaning, and J, µ, and ε, are assumed to be a given functions of space and time. According to the FDTD discretization method, the propagation domain is divided into cells, each cell being independently characterized in terms of material properties. Electromagnetic field components are mapped at cell edges as shown in Fig. 1, and Eqs. (1) are discretized, in both time and space, at each cell, using the (central) finite difference method. Referring to a given field magnetic-field component (for instance Hx ) the update equation reads:

(i,j+1,k+1) (i,j,k+1)

∆z

Ey Ez

Ez

Hx Ey

z y

(i,j,k)

(i+1,j,k)

x Fig. 1

∆y

∆x

Mapping of field components on Yee’s mesh.

n+ 1

n− 1

Hx |i,j+2 1 ,k+ 1 = Hx |i,j+2 1 ,k+ 1 2 2 2  2  n + c1 Ez |i,j+1,k+ 1 − Ez |ni,j,k+ 1 2 2   n n + c2 Ey |i,j+ 1 ,k − Ey |i,j+ 1 ,k+1 2

(2a)

2

whereas the update equation for an electric field component (for instance Ex ) becomes: Ex |n+1 = Ex |ni+ 1 ,j,k i+ 12 ,j,k 2   n+ 12 n+ 1 + c3 Hz |i+ 1 ,j+ 1 ,k − Hz |i+ 12,j− 1 ,k 2 2 2 2   n+ 12 n+ 12 + c4 Hy |i+ 1 ,j,k− 1 − Hy |i+ 1 ,j,k+ 1 2

2

2

(2b)

2

Here, the standard Yee’s notation has been adopted: the superscripts refer to the time iteration, whereas the triplets at the subscript indicate the spatial location of the field components. In Eqs. (2a) and (2b), coefficients c1 , c2 , c3 and c4 are kept constant throughout the simulation: their value depends only on the actual material, on the discretization mesh-size and on the time step adopted [1]. The same discretization procedure is applied to all components at each cell, this eventually resulting in a large set of algebraic linear equations. However, thanks to the so-called “leapfrog” scheme, a computationallyeffective way to face solution of such an algebraic system can be found, which makes inversion of huge, sparse matrices unnecessary. Basically, such a scheme involves evaluation of time derivatives of electric and magnetic-field components at alternate time intervals (denoted by integer and fractional superscripts, respectively, in Eq. (2) above): according to such a scheme, field updates depends only on quantities computed at previous time iterations, so that update equations pertaining to a given time step can be decoupled and independently solved (i.e., sequentially, in any arbitrary order). Six scalar equations per cell are eventually obtained, which share the same structure: y = a + k1 (b − c) + k2 (d − e) + js

(3)

IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002

574

SDRAM

CU

CU

CU

CU

d

k1

d

d

d

y

b c

REGs

PCI TARGET

BUS PCI

CU

a

CU

k2

CU

d

F FPU PU

e

d

REG

Fig. 3 FPU architecture (field source input not shown, i.e., js = 0). Fig. 2

Overall system architecture.

In the above equation, y represents the field component to be updated, a is its current value (i.e., that computed at the previous time step), k1 and k2 are material-related constant coefficients, b, c, d and e are known field components (i.e., previously computed at neighboring grid points) and js represents the field source, if any. 3.

System Architecture

The overall architecture is depicted in Fig. 2, and includes a dedicated floating-point unit (FPU), some registers banks, a PCI Target Interface Circuit (PCI TARGET), a set of control unit (CU) and several external SDRAM memories. Five 32 bit data bus have been considered, to allow for parallel data fetching from the SDRAM banks: actually, all data required to update the field components from the on-board SDRAM are obtained in just two read operations; if a general-purpose processor were used, eight operation would have instead been necessary. Ideally, implementing eight busses (plus 8 set of SDRAM control signals) would have allowed for fetching all of the data in a single read cycle. However, this would have prohibitively impacted on the I/O pin count, and on the chip and board signal routing. By limiting ourselves to five busses, a reasonable tradeoff between the total throughput of the system and the number of chip I/O pin is obtained: due to sequential data fetch, this solution also requires some additional registers for temporary data storage. No data buffering is instead needed for output data, which are directly written back into the SDRAM banks. Design has been driven by performance and scalability concerns. The architecture has been specified in VHDL language [6]; by means of such a formalism, the system is described within a standard, industrial design flow, allowing for simulation tools to be exploited to estimate performance and, in a further perspective,

to keep the design relatively independent of the actual fabrication technology. Equation (3) actually involves only arithmetic addition and multiplication operations, so that the FPU datapath can be hierarchically designed, starting from a reduced set of modular operators. The designed FPU, illustrated in Fig. 3, is a fully pipelined unit, based on the IEEE-754 standard floating-point number representation, and includes adders and multipliers. Registers are used as delay units (d), to keep data aligned within the pipeline. The floating-point unit has 7, 32 bit, parallel inputs and is capable of working out an updated field component per clock cycle. General-purpose microprocessors, instead, generally include a smaller number of FP operators, and hence require to split the update computation over several clock cycles. As stated above, the FPU has been mapped on a deep-submicron, commercial technology (ALCATEL MTC45000, 0.35 µm), available for multi-project wafer runs. Propagation delays throughout the datapath have thus been taken into account: in this first version, the design timing has been targeted to a 166 MHz operating frequency: this relatively loose requirement comes from the consideration that, as detailed below, the system throughput is actually limited by communication with SDRAM modules. Propagation times of 3.5 and 4 ns were obtained for the adder and the multiplier, respectively. The PCI TARGET module manages the data flow between the FDTD system and the host PC: it complies with standard PCI specifications [7], so that high-level control routines, running on the host processor, can take care of initial data loading, of output data management and of exception handling. High-speed communication between the FPU and RAM modules, instead, is managed by the on-chip Control Unit (CU). Such unit is actually split in a set of parallel sub-units, which take care of the data flow between the SDRAM and the FPU and the PCI interface: by issuing proper instructions to the memory and the datapath, subsequent operations needed by the FDTD algorithm

PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS

575

(loading of the constant coefficients, initialization of the system variables, acquisition of the field external stimuli, actual field-component update, write-back of the results to the host system) can be executed. In order to avoid stalls of the pipelined FPU, fresh data from the SDRAM should be accessed in a synchronous fashion (i.e., fetching a new set of consistent data per clock cycle). This goal can be achieved by exploiting the streaming-mode (“burst”) access of SDRAM’s. To this purpose, however, contents of the SDRAM memories need to be properly organized, taking advantage of the regular structure of the discretization mesh. I.e., by fixing 2 out of 3 coordinates in the space, a straight row of cells is selected: for each cell in the row, and due to the translational symmetry of the FDTD mesh, field coefficients appearing in homologous position within Eq. (3) refer to aligned mesh cells as well. Hence, it is convenient to store field components relative to aligned cells into the same memory row, as illustrated in Fig. 4: the main update loop then scans aligned cells, so that required coefficients come, in sequence, from adjacent RAM locations, as shown by Fig. 5. This, thanks also to the parallel-bus architecture, allows the optimization of RAM communication; taking advantage of burst access, fully synchronous operation can be achieved and no pipeline stall, due to SDRAM memory row

FDTD mesh

z y x (i,j+Ny-1,k) Hyi,j,k Hyi,j+1,k

Hy

(i,j+2,k) (i,j+1,k) (i,j,k)

Fig. 4

Hyi,j+Ny-1,k

Storage of the field components into SDRAM memories.

data latency, may occur while scanning a row: just two read cycles are needed to access input data, and one single cycle is required to store the result. Pipeline stalls may only occur when jumping from a row to the subsequent one: it is therefore convenient to scan the mesh along the longest edge (if any) of the device under inspection. In the test-case at hand, we assumed to deal with 8 Mbyte SDRAM chips, which would allow to face problems featuring up to two millions of field components. This is actually quite a reasonable figure for a wide range of practical cases: nevertheless, much larger SDRAM are being designed and produced, which could be exploited almost straightforwardly (only minor changes would be needed in the memory control section, to manage a larger address space). In any case, the SDRAM physical size does not necessarily represent an upper limit for the mesh-size: since, due to the discretization algorithm, equations are decoupled, the simulation domain can be easily partitioned into separate segments. At the expense of some overhead and of increased complexity in the control software (i.e., a routine would have to manage swapping of the mesh segments), each segment could be independently considered by the FDTD engine. 4.

Performance Estimate

To evaluate the performance of the proposed architecture, and to allow for comparison with software implementation on a general purpose architecture, a fairly simple case has been considered: a cubic resonant cavity has been simulated, discretized by means of a 21 × 21 × 21 cells. After a pulsed stimulus is injected at a given location (accounting for a non-zero initial field) the transient simulation is carried out: given such an elementary structure, it makes little sense, here, to look at the simulation results in terms of actual time

FDTD mesh

SDRAM memory rows

z y x

n+1/2

Hyi,j+1,k

n-1/2

Hyi,j+1,k

n+1/2

Hyi,j,k

y j+1

Hyi,j,k

n-1/2

n

n

n

n

n

n

n

n

Ezi,j,k+1/2 Ezi,j+1,k+1/2 Ezi,j,k-1/2 Ezi,j+1,k-1/2

j

Exi +1/2,j,k Exi +1/2,j+1,k Exi -1/2,j,k Exi -1/2,j+1,k

j

j+1 y

Fig. 5

Field update and memory update scheme.

IEICE TRANS. ELECTRON., VOL.E85–C, NO.3 MARCH 2002

576

and space distribution of electric and magnetic field; as soon as the fabrication of the physical board will be completed, and the software control routines on the host PC will be developed, a simple post-processing program will easily allow for this. Here, instead, we have limited ourselves to check for consistency between the “hardware” and the “software” simulation results by direct inspection. Such a check has been completely successful: differences between the output vectors are negligible and due to the different impact of round off (the sequence of arithmetic operations is not necessarily the same in the hardware and software solution). Performance has been estimated by looking at the time required to compute a given amount of time iterations: for instance, in order to compute 10,000 time steps, an estimate of 15 seconds has been obtained for the “hardware” solver, whereas 64 seconds were needed by the “software” one. The latter figure was obtained by using a highly-optimized FORTRAN code, running on a 500 MHz Pentium-class machine, equipped with 128 Mbyte RAM (sufficient to avoid virtual-memory page swapping, given the size of the problem at hand) under Linux operating system. Such a rough estimate therefore indicates that a significant speed-up factor seems achievable; a more accurate evaluation depends on several factors and is currently under way. Increasing of the clock frequency (either of the CPU of the software implementation, or of the FPU of the dedicated solver) is not necessarily expected to reflect on a proportional performance improvement. The problem, in fact, is more severely constrained by the speed and width of the memory bus: to this purpose, it is worth observing that, at larger problem size, an increasing overhead is to be expected for the software implementation whereas, for the hardware solver, the per-element computational cost stays virtually constant. Some architectural options, neglected for the sake of simplicity in this first design, should allow for larger speedup margins: for instance, the implementation of socalled “absorbing” boundary conditions (which fakes field propagation in a semi-infinite medium [8]), represents a performance-critical issue for the software implementation, whereas they could be taken into account at no computing-time additional cost (but clearly at the expense of increased chip area) by the hardware solver. 5.

From these results it emerges that, even at relatively low system clock frequencies, the time required for the solution of FDTD equations positively compares with software implementation on general-purpose microprocessors. Apart from possible design refinements, future works therefore will include the actual technologymapping of the architecture, chip fabrication and assembly of the system board. Although much work is still to be done, we consider the approach presented so far as a promising option to significantly reduce the computational costs of FDTD simulation of practical interest. Acknowledgments The authors would like to thank Dr. Paolo Mezzanotte, Dipartimento di Ingegneria Elettronica e dell’informazione (DIEI), University of Perugia (Italy) for his support and helpful discussions. References [1] K.S. Yee, “Numerical solution of initial boundary value problems involving Maxwell’s equation in isotropic media,” IEEE Trans. Antennas & Propag., vol.14, pp.302–307, 1966. [2] S.D. Gedney, “Efficient implementation of the FDTD algorithm on high performance computers,” Time-Domain Methods for Microwave Structures, pp.313–322, IEEE Press, 1997. [3] M. Picket-May, A. Taflove, and J. Baron, “FD-TD modeling of digital signal propagation in 3-D circuits with passive and active loads,” IEEE Trans. Microwave Theory & Tech., vol.42, pp.1514–1523, 1994. [4] P. Ciampolini, P. Mezzanotte, L. Roselli, and R. Sorrentino, “Efficient simulation of high-speed digital circuits using time adaptive FD-TD technique,” 25th European Microwave Conference, pp.636–640, Bologna, IT, Sept. 1995. [5] S.D. Gedney, “Finite-difference time-domain analysis of microwave circuit device on high performance vector/parallel computers,” Time-Domain Methods for Microwave Structures, pp.344–344, IEEE Press, 1997. [6] “IEEE Standard VHDL Reference Manual,” IEEE Standard 1076, 1987. [7] E. Solari and G. Willse, PCI Hardware and Software, Annabooks, 1998. [8] G. Mur, “Absorbing boundary conditions for the finitedifference approximation of the time-domain electromagnetic-field equations,” IEEE Trans. Electromagn. Compat., vol.EMC-23, no.4, pp.377–382, 1981.

Conclusions

In this paper, a hardware architecture for the solution of the FDTD algorithm on a 3D domain has been presented. This architecture is based on a custom VLSI chip implementing the “field-update” engine, and is conceived as a module communicating with a host personal computer via a PCI bus. A virtual implementation of the system has been carried out by means of a VHDL description of its architecture, and the performance estimate has been extracted from simulation.

Pisana Placidi received the “laurea” degree cum laude and the Ph.D. degree in electronic engineering from the University of Perugia, Perugia, Italy, in 1994 and 2000 respectively. Her research interests are mainly connected with mixed IC design and modeling of HF electronic devices.

PLACIDI et al.: A CUSTOM VLSI ARCHITECTURE FOR THE SOLUTION OF FDTD EQUATIONS

577 Leonardo Verducci is a Ph.D. student of the University of Perugia, Italy. He received the “laurea” degree cum laude in 2000. His research interests concern with the mixed IC design.

Guido Matrella was born in Foggia, Italy, on December 5, 1973. He graduated in Electrical Engineer in 1999 at the University of Parma, discussing a thesis about FIR filter design for audio applications by means of tools for DSP code generation. Since January 2000 he is a Ph.D. student at the “Dipartimento in Ingegneria dell’Informazione (DII)” at the University of Parma. His main research interests include Digital System Design with VHDL for FPGA applications.

Luca Roselli received the “laurea” degree in electronic engineering from the University of Florence, Florence, Italy, in 1988. Presently, he is Assistant Professors at the Department of Electronic Engineering of Perugia. Since 1996 he has been in the reviewer list of Microwave and Guided Wave Letters (MGWL). Since 1998 he has been in the reviewer list of the transactions on Microwave Theory and Techniques (MTT). His research interests includes the design and development of microwave and millimeterwave active and passive circuits by numerical techniques.

Paolo Ciampolini graduated (with honors) in Electronic Engineering in 1983, in Bologna. He got the Ph.D. degree in Electronics and Computer Sciences in 1989. From 1990 to 1992 he was with DEIS, at the Univ. of Bologna, as a research assistant. From 1992 to 1998 he was with the University of Perugia, as an associate professor. Since 1998 he is with the he is with the Dept. of Information Engineering at the University of Parma, where he is now a full professor of Electronics. His research interests include numerical modeling of semiconductor devices and circuits, solid-state silicon sensors, as well as digital system design.

Suggest Documents