Design and Architecture for an Embedded 32-bit ... - Semantic Scholar

5 downloads 1887 Views 285KB Size Report
Sample data flow graph and queue-register contents for the expressions: e = ab/c and f ... out (FIFO) queue registers instead of random access registers to store ...
JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

1

Design and Architecture for an Embedded 32-bit QueueCore Abderazek Ben Abdallah, Sotaro Kawata, and Masahiro Sowa Graduate School of Information Systems The University of Electro-communications 1-5-1 Chofu-gaoka, Chofu-shi, Tokyo 1828585 Email: [ben,kawata,sowa]@is.uec.ac.jp

Abstract— Queue based instruction set architecture processor offers an attractive option in the design of embedded systems by providing high performance for a specific application. This work describes the design results and methodology of a queue processor core, named QueueCore, as a starting point for applicationspecific processor (ASP) design. By using simple and common base queue instruction set, the design space exploration is focused on the application-specific aspects of performance. In any new architecture, verification, which usually requires complicated and lengthy software simulation of an emulated model, is an important aspect. We show how cooperative hardware emulation, based on programmable logic, can be integrated into a co-design flow to evaluate the performance of the novel QueueCore processor and to verify the functional correctness of a specific benchmark on the system core. To avoid duplication of the design effort, different target implementations are derived from a common source. We present the evaluation results of the QueueCore for three different platforms. Index Terms— QueueCore, Queue computing, design, modular design, simple.

I. I NTRODUCTION

W

ITH VLSI systems becoming more and more commonplace, and with myriads of new applications for VLSI systems, there is a growing need for design of novel systems that will perform given tasks efficiently. In our previous work, we have researched about queue computing infrastructure as a starting point for cooperative design space exploration for general and embedded applications [2], [4]. The choice of this architecture was based on a number of factors, such as its suitability for embedded applications. A parallel Queue processor architecture [4] exploits instruction-level parallelism without considerable effort and need for heavy run time data dependence analysis, resulting in a simple hardware organization when compared with the conventional superscalar processor. This allows the inclusion of a large number of functional units into a single chip, increasing parallelism exploitation. The key ideas of the QueueCroe are the operands and results manipulation schemes. The Queue computing scheme adopted in this work stores intermediate results into a so called queueregisters (QREG). In this model, a given instruction implicitly reads its first operand from the head of the QREG, its second Manuscript received Jul., 2005; revised Sept., 2005. This work was partially supported by the Grdaute School of Information Systems.

operand from a location explicitly addressed with an integer offset from the first operand location and stores the computed result into the tail of the QREG. An important feature of this scheme is that, write after read false data dependency does not occur [2]. Furthermore, since there is no explicit referencing to the QREG, it is easy to add extra storage locations to the QREG when needed. Since the operands and result addresses of a given staticinstruction (compiler generated) are implicitly computed during run-time, an efficient and fast hardware mechanism is imperatively needed for parallel execution of instructions. The designed QueueCore implements a so named queue computation mechanism that calculates operands and result addresses for each instruction. The mechanism used for calculating these addresses will be described in the coming section. From another hand, an important aspect of developing any new architecture is verification which usually requires complicated and lengthy software simulation of an emulated model. An event-based or cycle level simulation becomes increasingly inadequate to verify a significant execution trace for a given problem. These software-based simulation approaches are not capable of predicting all micro-architectural issues related to final physical design [1], [18], [19], [20]. To this end, we consider prototyping-based emulation that substitutes real time hardware emulation for slow simulator-based execution. Prototyping-based design itself is not a new idea as several designers adopted this method [21], [22]. However, what is important for us is to investigate how to describe the novel QueueCore architecture to achieve good synthesis results for FPGA implementation with sufficient performance to support realistic evaluation. Using a hardware description language, we have created the Synthesizable model of the QueueCore for the integer subset of a parallel Queue processor (PQP) architecture [2], [4]. A prototype implementation is produced by synthesizing the high-level model for the Stratix FPGA device [14], [16]. As a result, we were able to evaluate relative circuit area and speed metrics for design alternatives. This paper has two contributions: •



The first contribution is the description of the hardware design, implementation and evaluation of a 32-bit producer order parallel Queue processor (QueueCore). The second contribution arises from the observation that

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

high-level modeling is most beneficial when it opens up new possibilities for organizing designs. Thus, we introduce a design strategy, enabled by the high modulardesign level, which we refer as hot modular design. We describe a quantitative study of the impact of synthesis and tools on performance and design quality achieved. We also show how cooperative hardware emulation, based on programmable logic, can be integrated into a co-design flow to evaluate the QueueCore performance and verify the functional correctness of a specific benchmark on the QueueCoe system. This design methodology is used for our QueueCore system, however it can be used for other newly developed architectures. The rest of this paper is organized as follow: Section II gives prior work about Queue architectures. To guide the reader, we present an overview of the producer order queue execution model in section III. We also give an example that illustrates how operands and results are manipulated in QueueCore architecture. In section IV , we describe in a fair amount of details the QueueCore architecture. As we adopted some conventional design techniques during the development of the QueueCore, such as fetch, decode or execution scheme, we will not focus on these issues that can be found elsewhere. We rather focus on key hardware features of the QueueCore. In section V we present the QueueCore design approach and pipeline control description. QueueCore validation description is given in section V I. Section V II gives the FPGA synthesis of the QueueCore. We investigate the processor coding style affects on hardware resource usages in section V III. Modeling style with special FPGAs circuits and storage structure for the QueueCore is described in the following two sections. We conclude our work with the results discussions. II. Q UEUE A RCHITECTURES P RIOR W ORK This section presents an overview of previous work carried out by other research groups and our research team. This, should help to justify and establish the context of the latest design reported in this paper. Historically, queue computing idea is traced back to several decades ago. In [6], Bruno investigated an indexed queue machine architecture that uses a FIFO as the underlying mechanism for operands and results manipulations. At the execution stage, each instruction removes the required number of operands from the front of the queue, performs computation and stores the result back into the operand queue at a specified offsets from the front of the queue. A major problem with the above indexed queue architecture is that it requires the relocation of a potentially large number of operands. In addition, the operand queue within the above architecture is implemented in the main memory. Another research work was introduced in [23]. This work investigated issues related to the design constraints of a super scalar processor based on queue computation model. In [2], [4] we proposed a produced parallel Queue processor architecture that solves problems in earlier proposed queue archiectures. The work described here presents the hardware

2

a

m1

c

b

m2 ld

ld

a

d

m3 ld

m4

m1 ld

ld

A1 A1

*

A2

/

A4

ld

q1

d

ld

q3

d

st

+

q5 ld

*

m4

A2

st

/

f

A4

m6

A3

m6 e

m3

*

r

st

st

ld

*

A3

m5

m2

d

+

q2 A3

c

b

m5 st

a

f

(a)

(b)

ld a # load variable "a" ld b # load variable "b" ld c # load variable "c" ld d # load variable "d" mul # multiply the first two variables add # from the front of the Queue div -2 # divide the entry pointed by QH by the variable "c" mul -1 # which is located at a negative offset "-2" from QT st e # store (ab/c) into "e" st f # store a*b*(c+d) into "f"

(c) LQH

QH QT

000000 000000 000000

a

b

c

d

ld a, ld b, ld c, ld d

c

QT

QH

LQH

000000 000000 000000

d

mul , add

a*b

c+d LQH

QH QT

000000 000000 000000

ab/c

div -2, mul -1

ab*(c+d)

LQH HQ QT

000000 000000 000000

st e, st f

(d)

Fig. 1. Sample data flow graph and queue-register contents for the expressions: e = ab/c and f = ab(c + d).(a) Original sample program, (b) Translated (augmented) sample program, (c) Generated instructions sequence, (d) queue-register contents at each execution state.

design and evaluation of a produced order parallel Queue processor based on our earlier work published in [2], [4]. To the best of our knowledge, this is the first work about actual hardware design and evaluating of a Queue processor. III. Q UEUE C OMPUTING OVERVIEW The produced order queue computing uses a first-in-firstout (FIFO) queue registers instead of random access registers to store intermediate results. A datum is inserted in the queue in produced order scheme and can be reused. This feature has a profound implication in the areas of parallel execution, programs compactness, hardware simplicity and high execution speed [2], [4]. This section gives a brief overview about the produced order Queue programming model. We show in Fig. 1 (a) a sample data flow graph for the expressions: e = ab/c and f = ab(c+d). Datum is loaded with load instruction (ld), computed with multiply (*), add (+), and divide (/) instructions. The result is stored back to the memory with store instruction (st). In Fig. 1(b), producer and consumer nodes are shown. For example, A2 is a producer node and A4 is a consumer node (A4 is also a producer to m6 node). The instruction sequence for the queue execution model is correctly found when we traverse the data flow graph (shown in Fig. 1(a)) from left to right and from the highest to the lowest level [4].

renew QH,QH, and LQH

Queue computation unit (QCU)

4 inst Barrier and Issue units 4 inst

SET

ALU (4 FU)

(4 FU)

MUL (1 FU)

PROG/DAT A MEMORY

source bus

Architectural Registers

instruction decode unit (DU)

store/load to/from MEM

4 inst

fetch counter

instruction fetch unit (FU)

data bus

In Fig. 1(b), the augmented data flow graph that can be correctly executed in the proposed queue execution model is given. The generated instruction sequence from the augmented graph is shown in Fig. 1 (c) and the content of the QREG at each execution stage is shown in Fig. 1(d). A special register, called the queue head pointer (QH), points to the first datum in the QREG. The queue tail pointer (QT) points to the location of the operand QREG in which the result datum is stored. A live queue head pointer (LQH) shows the data that were used and may be reused. Data between QH and LQH pointers are called livedata. The live entries in the queue-register are statically controlled. Two special instructions are used to stop or release the LQH pointer. Immediately after using datum, the QH is incremented so that it points to datum for next instruction. QT is also incremented after the result is stored. The four load instructions load in parallel a, b, c and d data and place them into the QREG. At this state, QH and LQH point to datum a and the QT points to an empty location as shown in Fig. 1 (d) (State 1). The fifth and sixth instructions are also executed in parallel. The mul refers a and b then inserts a∗b into the QREG. QH is incremented by two and the QT is incremented by one. The add refers c and b then inserts a + b into the QREG. At this state, the QH, QT and LQH are incremented as shown in Fig. 1 (d) (State 2). The seventh instruction (div − 2) divides the datum pointed by QH (in this case a ∗ b) by the datum located at −2, negative offset, from QH (in this case c). The QH is incremented and points to c + d. The eighth instruction multiplies the data pointed by QH (in this case c + d) with the data located at −1 from QH (in this case a ∗ b). After parallel execution of these two instructions, the QREG contents becomes as Fig. 1 (d) state 3 shows. The last two instructions store the result back to the data memory. Since the QREG becomes empty, LQH, QH and QT point to the same empty location (State 4).

3

fetch 4x16 bit inst/cycle

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

LD/ST (2 FU)

BRAN (1 FU)

load to QREG

Fig. 2. QueueCore architecture block diagram. During RTL description, the core is broken into small and manageable modules using modular approach structure for easy verification, debugging and modification.

PNBR

NQT QH0

QT0

+

CNBR

+

NQH

PNBR QH1

QT1

IV. Q UEUE C ORE A RCHITECTURE The QueueCore block diagram is shown in Fig. 2. The system has six pipeline stages: (1) Fetch (FU): at each cycle the fetch counter is sent to PROG/DATA memory and eight bytes (four instructions) are fetched then inserted into a fetch buffer (FB). (2) Decode (DU): this unit reads eight bytes from the FB every cycle, decodes them, then inserts decoded information into a decode buffer. This stage also calculates the number of consumed (CNBR) and produced (PNBR) data for each instruction. The CNBR and PNBR are used by the next pipeline stage (queue computing stage) to calculate the sources and destination locations for each instruction. (3) Queue computation (QCU): calculates the first operand (source1) and destination addresses for each instruction. The mechanism used for calculating the source1 address in given in Fig. 3. The QCU unit keeps track on the current value of the QH and QT pointers. Four instructions arrive to the QCU unit each cycle. For the first instruction, the number of the consumed operands (CNBR) (8-bit field) is added to the current QH value (QH0) to find the address of the first operand and the number of produced results (PNBR) (8-bit

CNBR

+

PNBR :number of produced data CNBR :number of consumed data QH0 :initial queue head value QT0 :initial queue tail value NQH : next queue head value NQT : next queue teail value QHn+1:next queue head value QTn+1: next queue tail value

Fig. 3.

+

QHn+1 QTn+1

Address calculation mechanism for source1 and destination.

filed) is added to the current QT value (QT0) to find the result address (QT1) of the first instruction. The other three instruction’s first operand and result addresses are calculated similarly as indicated in the lower part of Fig. 3. Notice that the calculation is performed sequentially. (4) Barrier: inserts barrier flag for dependency resolutions. (5) Issue: four instructions are issued for execution each

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

cycle. In this stage, the second operand (source2) of a given instruction is first calculated by adding the address source1 to the displacement that comes with the instruction. The second operand address calculation could be earlier calculated in the QCU stage. However, for timing balance consideration, the source2 is calculated in the IS stage. The hardware mechanism used for calculating the second operand (source2) address is shown in Fig. 4. An instruction is ready to be issued if its data operands and its corresponding functional unit are available. 6) Execution (EXE): the QueueCore’s execution unit computes the effective address for loads and stores, and executes ALU instructions. During the EXE stage, the machine also determines whether a conditional branch is taken and, if so, updates the program counter and send the QH and QT of branch instruction to the QCU unit. SRC1(n-1) QHn-1 SRS2(n-1) OFFSET(n-1)

+ DEST(n-1)

QTn-1 SRC1n QHn SRS2n OFFSET(n)

+ DESTn

QTn OFFSET: positive/negative integer value that indiactes the location of SRC2(n-1) from the QH(n-1) QTn : queue tail value of instruction n DESTn : destination location of instruction n SRC1(n-1): source data 1 of instruction (n-1) SRC2(n-1): source data 2 of instruction (n-1)

Fig. 4.

Source two (source2) calculation mechanism.

A. Instruction Set Design Considerations The QueueCore supports a subset of the produced order queue processor architecture (PQP)[2]. All instructions are 16bit wide, allowing simple instruction fetch and decode stages and facilitate pipelining of the processor. The QueueCore instruction format is illustrated in Fig. 5. Several examples showing the operation of five different instructions are also given in Fig. 5(a)-(e). In the current version of our implementation, we target our QueueCore for small applications where our concerns focus on the the ability to execute Queue code on a processor core with small die size and acceptable power consumption characteristics. However, the short instructions may limit the memory addressing space as only 8-bit are left for offset and base address. To cope with this shortage, we designed a special ”covop” instruction that extends load and stores instruction bases addresses. The QueueCore compiler outputs full addresses and full constants and it is the duty of the QueueCore assembler to detect and expand a ”covop” instruction whenever an address

4

or a constant exceeds the limit imposed by the instruction’s field sizes. Conditional branches are handled in a particular way since the compiler does not handle target addresses, instead it generates target labels. When the assembler detects a target label, it looks if the label has been previously read and fills the instruction with the corresponding value and ”covop” instruction if needed. For the case of forward references on target labels, the assembler emits a security ”covop” instruction since it must guarantee the correctness of the program. There is a backpatch pass in the assembler to resolve all missing forward referenced instructions. This issue is discussed elsewhere in [2]. V. Q UEUE C ORE I MPLEMENTATION A. Design Approach To the best of our knowledge, this is the first hardware design of a Queue processor architecture. Therefore, there are no related works in the literature that give design issues or methodology related to hardware Queue processor. To make the QueueCore design easy to debug, modify and adaptable, we decided to use a high-level description, which was also used by other system designers, such us works in [21] and [22]. We have developed the QueueCore system in Verilog HDL. After synthesising the HDL code, the designed processor gives us, then, the ability to investigate the actual hardware performance and functional correctness. It also gives us the possibility to study the effect of coding style and instruction set architectures over various optimizations using rapid prototyping. For the processor to be useful for these purposes, we identified the following requirements: (1) High-level description: the format of the QueueCore description should be easy to understand and modify; (2) Modular: to add or remove new instructions, only the relevant parts should have to be modified. A monolithic design would make experiments difficult; and (3) The processor description should be synthesisable to derive actual implementations. The QueueCore has been designed with a distributed controller to facilitate debugging and future adaptation for specific application requirements since we target embedded applications. This distributed controller approach replaces a monolithic controller which would be difficult to adapt. The distributed controller is responsible for pipeline flow management and consists of communicating state machines found it each pipeline. In our work, we used Altera Quartus II professional edition [16] as synthesis tool. We have synthesised the processor core for Stratix FPGA and Hard copy devices. B. QueueCore System Pipeline Control In many conventional processors, the control unit is centralized and controls all central processing core functions. This scheme introduces pipeline stalls, bubbles, etc. However, especially for pipelined architecture, this control unit is one of the most complex part of the design, even for processors with fixed functionality [12].

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

MLT- signed 32-bit multiplyer divider and mod instructions

ALU add, addu sub, subo, subu, subuo and,or, sru,slu, sr,rol, ror, xor,neg,not,com, comu, comc,comcu,inc, lda

opcode

mnemonic binary operation

m

add

dis

1,

0000111

1000000010000111

mult,mulu div,divo,divu, divuo,mod, modo, modu, moduo

opcode

m

mod

1,

dis

0010111

1011111010010111

QREG(SRC1adr)+ QREG(SRC1+00000111) QREG(SRC1adr)% QREG(SRC1+0010111) =>QREG(DEST adr) =>QREG(DEST adr)

(a)

(b)

5

SET

LOAD/STORE

Branch

setHH,setHL,setLH,setLL ldil,setr,mv,dup

opcode

val

ldil

11111110

0100000011111110 ; //the value is stored in [7:0] (c)

11111110 => QREG(DEST adr)

bge,jump,call,rfc b,beq,blt,ble,bgt

opcode

call0

targ

11111110

0001101111111110 [a0+ 11111110]=> PC (d)

stb,sts,stw ldb,ldbu,lds ldsu,ldw,ldwu

opcode

stw0

ofst

11101110

0 1 stw0 1 1 1 0 0 0 1 1 10111110 101110 QREG(SRC1addr) => MEM(a0+11101110)

(e)

Fig. 5. QueueCore instruction format and computing examples: (a)add instruction,(b)mod instruction, (c)load immediate (ldil) instruction, (d)call instruction, and (e)store word (stw) instruction.

Fig. 8. Snapshot of the graphical validation tool. The memory, QREG , and general purpose registers (GPR) are displayed. This view of the processor state was generated automatically by feeding the processor with a clock signal after loading the program memory with a sample binary program.

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

6

TABLE I

data path B1

B2

Bn-1

Bn

Q UEUE C ORE HARDWARE PARAMETERS USED DURING SYNTHESIS STAGE . Items IW FW DW SI QREG ALU LD/ST BRAN SAS GPR MEM

 B1

Bn

B n-1

B2

data path



Fig. 6. Control unit: (a) In conventional Processor design, (2) in QueueCore processor design.

In this design, we have decided to break up the unstructured control unit to small, manageable units. Each unit is described in a separate HDL module. That is, instead of a centralized control unit, the control unit is integrated with the pipeline data path. Thus, each pipeline stage is mainly controlled by its own, simple control unit as illustrated in Fig. 6. In this scheme, each distributed state machine corresponds to exactly one pipeline stage, and this stage is controlled exclusively by its corresponding state machine. Overall flow control of the processor is implemented by cooperation of the control units in each stage based on communicating state machines.

Configuration 16-bit 8 bytes 8 bytes 75 128 4 2 1 4 16 2048 word

Description instruction width fetch width decode width supported instructions queue-registers arithmetic logical unit load/Store unit branch Unit set unit general purpose registers PROG/DATA memory

status ready bits and synchronization signals from adjacently stages), computing the next stage is simple. The state transition of a pipeline stage in the QueueCore is illustrated in Fig. 7. This basic state machine is extended to cover the operational requirements of each stage, by dividing the PROCEED state into substates as needed. An example is the implementation of the Queue computation stage, where PROCEED is divided into substates for reading initial addresses values, calculating next addresses values, and addresses fixup (when needed). VI. Q UEUE C ORE VALIDATION Validation loosely refers in this paper to the process of determining that the designed processor is correct. It is performed at multiple abstraction levels by integrating several approaches performed in three phases and based on the same HDL that describes the processor core. The hardware design parameters of the QueueCore processor are shown in Table 1.

CPT v(CPT ^ ACP ^SUP)

state machines synchronization between pipeline stages • basic computation • queue and general purpose registers as well as data memory states I. Functional Verification: To simplify functional verification we have developed a front end tool (FET), which displays the internal state of the QueueCore processor. The states displayed include: the state of each pipeline stage, the QREG contents, data memory contents and some special/general purpose registers contents. We can easily extend the FET tool to display several other states at each pipeline stage. For visibility, we only show the state of the QREG and the data memory −the two states that are enough for checking the correctness of a simulated benchmark program. This approach allows monitoring program execution as each instruction passes through the pipeline and identifies functional problems by tracing processor state changes. In addition, we have to note that this method does not replace other verification techniques or methods, but reduces the amount of trace data which has to be scanned to identify problems in the novel QueueCore design. Fig. 8 shows a snapshot of the graphical validation tool that shows the validated output ports when the source clock changes. The right part of Fig. 8 shows the QREG •

PROCEEDS



CPT^ACP SUP^ACP

STALL

SUP CPT^ACP^SUP 00000000000 00000000000 00000000000 00000000000 00000000000

ACP ^ SUP

00000000000 IDLE 00000000000 00000000000 00000000000 00000000000 00000000000

ACP

Fig. 7. Finite state machine transition for QueueCore pipeline synchronization. The following conditions are evaluated: next stage can accept data (ACP), previous pipeline stage can supply data (SUP), last cycle of computation (CPT).

Each pipeline stage is connected to its immediate neighbours, and indicates whether it is able to supply or accept new instructions. Communication with adjacent pipeline stages is performed using two asynchronous signals, AVAILABLE and PROCEED. When a stage has finished processing, it asserts the AVAILABLE signal to indicate that data is available to the next pipeline stage. The next stage will, then, indicate whether it can proceed these data by using the PROCEED signal. Since all fields, necessary to find what actions are to be taken next, are available in the pipeline stage (for example operation

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

and the general purpose registers (GPR) states ( GPR state are not shown in the figure). A summary of the memory status and the instructions processing counters are shown in the left bottom part of the figure. II. Rapid Prototyping: While simulation allows very accurate tracing of each individual signal in a design, simulation speeds are too slow to simulate significant programs. Thus, we have decided to developing an FPGA prototype in order to allow real time testing of the QueueCore and before actual fabrication. In addition to higher simulation speeds, FPGA prototyping allows the verification of functionality which is otherwise difficult and time consuming to verify it using simulation. It is also well known that further advantage of FPGA-based hardware emulation is that actual peripheral components can be attached to the processor core [11], [12]. Not all processor designs are amenable to this type of verification, due to limitation in the complexity of available FPGA devices. Currently available field-programmable devices offers logic capacities of more that 100k gate equivalents, but this number is increasing rapidly. One of the most challenging problems in high-level synthesis is how to explore a wide range of design to achieve highquality design within a short time [7], [8], [9]. Prototyping of the QueueCore is described in the next several sections. III. Gate Level Simulation: High-level verification is mainly used to verify functional correctness. Low-level problems such as timing violation cannot be verified directly with high-level verification tools. To ensure correctness of low-level implementation details, interfaces and timing, we use gate-level simulation for ensuring compliance with design specifications. We specified timing and other constraints using a unified user constraints file with the core modules. With gate level simulation, we validate the actual timing and provide details information about the QueueCore. VII. FPGA S YNTHESIS

OF THE

Q UEUE C ORE

FPGAs consist of two resources types, a set of logic resources implementing the logic functions in a design interconnect resources which combine logic resources to form more sophisticated functions. In the Stratix FPGA family,which was used to prototype the QueueCore system, the basic logic resources are contained in the Logic Array Blocks (LABs) [9]. Each LAB contains 10 logic elements (LEs) and each LE comprises a single 4-input look-up tables (LUTs) and flipflops. The Stratix device provides a bit-oriented architecture, where each logic resource is programmed individually. This approach offers maximum versatility but incurs into considerable cost for data path oriented applications, where control information is duplicated for each bit of the data path [9]. When targeting a design to data path oriented architecture, resource efficiency is a function of how well a description can be mapped to the multibit data path structure. Using a Verilog HDL-based (or other HDL) design, a prototype implementation can be obtained from the same HDL description as the final implementation. Logic synthesis is targeted at FPGAs to exercise

7

the design for verification purposes. Final implementation is performed using logic synthesis to an ASIC technology. In order to estimate the impact of the description style on the target FPGA efficiency, we have explored logic synthesis for FPGAs. The idea of this experiment was to optimize critical design parts for speed or resource optimizations. The optimizations of the Verilog HDL models for FPGAs can be arranged into two distinctive approaches: 1) Coding style Because the hardware structure is directly inferred from the HDL description, the HDL coding style has significant impact on the type, size and number of resources required to model the QueueCore. Design components typically affected by coding style are sequential elements or finite state machines. Coding style also determines whether a description can be successful mapped onto multibit data path logic resources in data path oriented architectures. 2) Usage of special-purpose Devices on FPGA Many FPGA devices include special-purpose components to improve device effectiveness in some application area. Typical examples of such components are dedicated to wide decoder, RAM functionality or arithmetic functions. For an HDL to be effective, these special purpose components (logic) have to be exploited wherever possible. In FPGAs, many of these features are accessed by using parameterizable megafunction cores (Altera Megawizard tool is an example). Optimizing the Verilog HDL description to exploit the strengths of the target technology is of paramount importance to achieve an efficient implementation. This is particularly true for FPGA targets, where a fixed amount of each resource is available and choosing the appropriate description style can have a high impact on the final resources efficiently [1], [3], [5]. For typical FPGA features, choosing the right implementation style can cause a difference in resource utilization of more than an order of magnitude [9], [10]. Synthesis efficiency is influenced significantly by the match of resource implied by the HDL and resources present in a particular FPGA architecture . When an HDL description implies resources not found in a given FPGA architecture, those elements have to be emulated using other resources at significant cost. Such emulation can be performed automatically by EDA tools in some cases, but may require changes in the HDL description in the worst case, counteracting aim of a common HDL source code base. In this work, our experiments and the results described are based on the Altera Stratix architecture [14]. We selected Stratix FPGA device because it has a good tradeoffs between routability and logic capacity. In addition it has an internal embedded memory that eliminates the need for external memory module and offers up to 10 Mbits of embedded memory through the TriMatrix TM memory feature. We also used Altera Quartus II professional edition [16] for simulation, placement and routing. Simulations were also performed with Cadence Verilog-XL tool [15].

JOURNAL OF EMBEDDED COMPUTING, VOL. 1, NO. 11, NOVEMBER 2005

8

00 00

VIII. Q UEUE C ORE C ODING S TYLE AND FPGA S R ESOURCE U SAGE The quality of the final QueueCore hardware and the resource usage of the target FPGA device depend very much on the description style used at a higher level. To account for this, the high-level HDL description has to be adapted to guide the synthesis tool to choose the appropriate implementation method for functionality. Signals are often selected in Verilog HDL from a number of sources by such Verilog HDL constructs as conditional statements and indexing of an array. From the above source level, logic synthesis normally generates multiplexors to select input. While this implementation is appropriate for ASIC processors, wide multiplexors can result in enormous resource usage in FPGAs requiring huge logic array blocks (LABs). For FPGAs supporting tristate resources, these expensive multiplexors can often be replaced by using tristate buses. This alternative implementation uses considerably less logic resources on the target FPGA device. The coding style also affects the design of finite-state automata/machine (FSM) that are common is the QueueCore. A poor choice of codes will result in an inefficient state machine design. Many advanced techniques have been developed for choosing an optimal state assignment. Typical such approaches use the minimum number of state bits or assume a two-level logic [17], [13]. Fig. 9 compares the encoding tradeoffs for a simple FSM for the Altera STRATIX-EP1S FPGA and Structured ASIC library using three encoding techniques: gray code encoding, binary encoding, and onehot encoding of states. When ASICs models are the target process, fully encoded representation such as binary or gray code encoding of states lead to space efficient design, whereas the onehot encoding scheme consumes more resources. This is different for FPGA devices with a large number of flip-flops, where a decoded representation reduces decoding logic at the cost of typically under-utilized flip-flops. Since the resource distribution of the onehot encoded FSM maps well to the resource distribution of the target architecture, it is a very compact representation. IX. M ODELING S TYLE WITH S PECIAL FPGA S C IRCUITS Stratix and other FPGA devices include special circuits (sometimes called building blocks) to improve device effectiveness in some application areas. To have an effective design, this special logic has to be used wherever possible to achieve maximum resource efficiency and speed. A typical example of such special circuitry is the provision of optimized arithmetic functions to simplify the implementation of fast adders, subtractors, counters and other related function blocks. When such special logic is available, an optimal implementation is based on the usage of such hardware. In general, no algorithmic advantages can be gained by substituting a superior description for such a function block, as the special purpose hardware is implemented at the hardware level. Thus, using high-level operators which map to module generators optimize for the target platform offer superior results. Using such module generators makes the HDL description independent of the

 

00 00 00 00 00 00 00 00 0 0 0 0 0 0 0 0 00 00 00 00 00000000

0000 0000 0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00











0 / ' -.,  +   * ) (  '    

00 00

00 00 0000 0000 00 00 00 00 00 00 00 00 0 0 0 0 00 00 00 00 00 00 00 00 0 0 0 0 00 00 00 00 00000000

 # &"!&1





00 00

 232 

3

00 00 00 0000000000 0 0 0 0 00 00 00 00 00 00 00 00 0 0 0 0 0000

0000 0000 00 00 00 00 0000 0000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00













   !"!#$% $&"!& I@:

Suggest Documents