Fast Parallel FFT on a Reconfigurable Computation Platform

2 downloads 0 Views 154KB Size Report
This paper presents implementation of a very fast parallel complex FFT on M2, the second generation of. MorphoSys Reconfigurable computation platform, ...
Fast Parallel FFT on a Reconfigurable Computation Platform

Amir H. Kamalizad, Chengzhi Pan, Nader Bagherzadeh EECS Department, University of California Irvine Irvine, CA {akamaliz;panc;[email protected]}

Abstract This paper presents implementation of a very fast parallel complex FFT on M2, the second generation of MorphoSys Reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. The proposed mapping comprises fast presorting, cascaded radix-2 stages, and post-reordering. Data and twiddle factors are 16-bit real and 16-bit imaginary in 2’s complement format and scaling is performed to avoid overflow. The mapping is tested on our cycle-accurate simulator, “Mulate”, and the performance is encouragingly better than other architectures such as Imagine and VIRAM. Moreover, the performance is scalable according to FFT sizes. Since there is no functionality specifically tailored to FFT, the results demonstrate the capability of MorphoSys architecture to extract parallelism from streamed applications. Further rationales are given based on the concepts of scalar operand networks and memory hierarchy.

1. Introduction Toward a coming billion-transistor era, today’s computation platforms design has already foreseen the end of the road for conventional micro-architectures [4], and numerous new approaches have arisen above the horizon, such as EPIC (Itanium 2) [5], RAW [6], Imagine [7], and VIRAM [8][17], etc. Many of them target on streamed applications, which have already been consuming more than 90% of total computing cycles nowadays [7]. The biggest challenge of architecture design is the scalability, only with which can one follow up the step of Moore’s Law. The difficulty of scalability is imposed by slower decrease of wire transmission delay than that of transistor switching delay. This discrepancy requires a new philosophy on design of scalar operand network [9] and memory hierarchy.

Among all existing architectures, some of them focus on fast scalar operand network but with flatter memory hierarchy, e.g., RAW; others on the other hand take advantage of high memory hierarchy while sacrificing scalar operand network capability, e.g., Imagine and VIRAM. The former category of architectures have stronger interconnection power in extracting finer Instruction Level Parallelism and facilitating parallelizing compiler design [6], but flat memory hierarchy reduces the chance to fully utilize the data locality. The latter, however, tries to catch most out of the data locality typical in streamed application, although the limited interconnection either makes some code segments un-vectorizable, shown in VIRAM, or requires hardware-specific language to program, like StreamC for Imagine. In this paper we present a fast, efficient and scalable implementation of FFT algorithm on a reconfigurable computation platform called MorphoSys Version 2, or M2 [1-3]. It has a scalar operand network bandwidth even higher than RAW and a memory hierarchy with no less bandwidth than VIRAM and Imagine. The key idea is to control the memory size within each RAW-like tile to allow more tiles and faster single-cycle edge-to-edge scalar operand network, and at the same time to add another level of memory hierarchy to accommodate data locality in a larger scale. The platform is described briefly in section 2 and a comparison between it and other architectures is given. In order to demonstrate the capability of M2, a complex fixed-point Fast Fourier Transformation (FFT) algorithm with different sizes is mapped onto M2 in section 3. Section 4 addresses related works and performance comparisons. Conclusion is drawn in the last section.

2. MorphoSys Reconfigurable Computation Platform Reconfigurable computation platform is an intermediate architectural solution between ASIC and general-purpose processors mainly for streamed multimedia applications. While they almost offer same flexibility as general-purpose processors, they approach the performance of fixed

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

Application-Specific Integrated Circuits (ASICs). Reconfigurable architectures are predicted to dominate the DSP market, which has been a several billion dollars demanding market, as ASIC design is expensive and the design process is slow.

2.1. Introduction of MorphoSys MorphoSys is a reconfigurable computation platform targeting computation intensive data parallel applications including streamed applications. M1, the first prototype of MorphoSys, has been used as a platform for many applications such as multimedia, wireless communication, signal processing, and computer graphics. M2, the 2nd generation of MorphoSys, follows the basic concepts of MorphoSys; however, it is redesigned in both scalar operand network and memory hierarchy, thus greatly enhanced in performance. Feedbacks from numerous kernel and system mappings pointed out M1’s bottlenecks, which have been revisited in M2. M2 architecture consists of three main subsystems: a core processor called TinyRISC, an 8x8 array of Reconfigurable Cells (RCs) organized in SIMD fashion, and a special data movement unit called Frame Buffer (FB). The programming model is simple. TinyRISC takes charge of the whole Data Control Flow (DCF); RC array and Frame Buffer are only triggered by TinyRISC and executing on their own configurations (called context) continuously for a given number of cycles specified by TinyRISC. The main difference between M1 and M2 is described in [3]. Figure 1 shows the diagram. Main Memory can be either on-chip or off-chip without significant difference of the connection interface, as long as Main Memory is also composed of 64 banks. The main changes in M2 that contribute a lot to the better performance are better inter RC communication, more data bandwidth utilizing high bandwidth frame buffer and increase in depth of programmability. TinyRISC

Main Memory

Frame Buffer

RC Array

DMA

Context Memory

Figure 1. MorphoSys

2.2. Comparison of M2 with others

Table-1 gives out the comparison of M2 with several typical architectures, in aspects of parallelism model, capability, scalar operand network [9], and memory hierarchy. Results of first order simulation using Synopsis tools show that the critical-path delay is about 1.8~2.3ns. Hence, 450 MHz is used in our simulation. Other data are either based on M1 implementation and M2 post-synthesis (current status), or projected as our design aim, which are chosen no more aggressive than commercial processors. From the table we can clearly see that M2 scores best in most criteria, i.e., more computation power than Imagine, the same scalar operand network throughput but with shorter latency than RAW, highest level of memory hierarchy (3), just to list a few. Though independently designed, M2 combines the structural advantages of the other three architectures: the overall structure of hostprocessor/slave-computation-fabric as in VIRAM, the controlled size of distributed memory within each ALU clusters as in Imagine, and powerful scalar operand network as in RAW. First, host-processor/slave-computation-fabric structure eases management of data flow and the dealing with sequential code, which is hard to be parallelized. Second, the modest size of distributed memory enables more ALUs to be integrated in the array and, more importantly, reduces the latency of scalar operand network, which is essential to extract the finest ILP. Finally, powerful scalar operand network helps extract fine ILP, gives flexibility for compiler (although there has already been some effort on it [10-11], compiler design is our next major step) to adopt wide range of optimization techniques, both coarse-grain and fine-grain, and avoids usage of hardware-specific languages as StreamC.

2.3. Programming model One potential drawback of M2 is the SIMD model of RC array. However, mapping experience on MorphoSys, VIRAM, Imagine, and other SIMD extension of generalpurpose processor like Pentium MMX, all show that SIMD is sufficient for streamed multimedia applications, not mentioning the much smaller code size than that of MIMD model. For example, mapping of DVB-T system on M2 does not need MIMD generality. Although SPMD feature might make mapping of MPEG-2 decoder easier, other subsystems are applying mostly affine array access and small portion of non-affine but static access involving random communication, which can be done efficiently by the scalar operand network. Since SIMD programming has long been a matured field in general, it is reasonable to suppose that an efficient C compiler for streamed applications is highly possible, with some previous experience already [10-11].

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

Table 1. Comparison of M2 with other architectures VIRAM Imagine RAW Parallelism

Capability

Scalar Operand Network

Memory Hierarchy

Parallelism model Peak 16-bit OPS Clock Speed Chip area # Transistors Power Technology # Network nodes Total bandwidth Latency* Full permutation 1st level size 2nd level size 3rd level size st 1 level bandwidth 2nd level bandwidth 3rd level bandwidth

Vector 6.4G 200MHz 15*18mm2 120M 2W(average) 0.18µm 8(Banks) 51.2Gbps No 64Kb(VRF) 104Mb(DRAM) Off-chip 205Gbps 51.2Gbps N/A

SIMD 23.7G 296MHz 12*12mm2 21M 4W 0.15µm 8 75.8Gbps Yes 96Kb(LRF) 1Mb(SRF) Off-chip 758Gbps 829Gbps** 12.4Gbps

MIMD 3.6G (32-bit) 225MHz 18*18mm2 122M 25W 0.15µm 16 922Gbps Yes 16.4Kb(RF) 16Mb(Local Mem.) Off-chip 230Gbps 334Gbps 200Gbps

M2(with/without on-chip memory) SIMD 28.8G 450MHz 8*8mm2/16*16mm2 120M/20M 4W(peak MAC) 0.13µm 64 922Gbps Yes 16.4Kb(RF) 2Mb(Local Mem.) 16Mb/off-chip 1382Gbps 461Gbps 115Gbps/56.7Gbps

* The latency is analyzed in 5-tuple [9]. ** Average usage is 152Gbps.

Moreover, parallel programming in M2 can adopt domain decomposition or function decomposition, the former more straightforward, while Imagine has to stick with cumbersome function decomposition often experiencing unbalanced load problem. Figure-2 illustrates a real piece of code segment for FFT butterfly operation which also show programming model of MorphoSys. Sequential functions CONTROL + DATA

Data and Computation Parallel (RC Array) functions DATA PATH

CONTROL FLOW

TinyRISCassembly

RC context

$4, 1, 0, 1, 0, 41 ldfb $2, 8, 511 Ldrcexldfb subi $1, $15,$15,3 Ldrcex $5, 1, $2,0,0,256 0, 1, 0, 42 ldli $10,511 addi $2,0,0,256 addi $4, ldctxt $1, 0, 80 $4, 64 delay4: $5, 64 ldw $1, 0, 0, 0, addi $5, cbcast 0, 0, 0, 0 subi $10,$10,4 $4, 1, 0, 0, 2,1,1, 0, jal 0, 2, 0, 1, ldrcexsbcb nop 17 43 ldrcex $5, 1, 0, 1, 0, 44 brle $0,$10,delay4 lui 0, 0, 0, 5 dbcbc $1, 1, 4, 64 addi $4, $4, 64 nop

0: ADDKEEP R0 I H1 set 0,1 I ; > R6 |R0; 1: 1,1 SUBCLOAD!0x004 H0 R0 > R6 |R0; set I I >2 ; 2: 2,1 ADDADD R0E H3 > R6 set r0 LSL 2 >0|R0; WE; 3: 3,1 SUBADD H2XQ R0r0 >LSL R62 |R0; set >0; 4: 4,1 ADDSUB R0XQ H5r0 >LSL R62 |R0; set >0; 5: 5,1 SUBSUB H4E R0 > R6 set r0 LSL 2 >0|R0; WE; 6: ADD R0 H7 > R6 |R0; set 6,1 CLOAD!0x004 I I >2 ; 7: SUB H6 R0 > R6 |R0; set 7,1 KEEP I I ;

+

addw

0, 0, 0, 1,

stfb wfbi

$6, 1, 0, 16 0, 0, 1, 32

TinyRISC Frame Buffer DMAC

Context Memory

RC Array

Figure 2- Programming Model in MorphoSys

accommodating this incremental load. This leads to a hierarchical scaling behavior, in each level of which there are both distributed and centralized recourses. The real billion-transistor architecture coming around year 2007 can be made up of 8 M2 together with the counterpart of TinyRISC and operand network at a high level.

3. Mapping FFT on M2 In order to utilize the maximum parallelism of MorphoSys and the parallelism in an application like FFT, an efficient mapping approach is very important. In this section FFT algorithms is described briefly and the parallelism in the algorithm is pointed out. Consequently an efficient mapping scheme is presented.

3.1. FFT algorithm Discrete Fourier Transform (DFT) is defined as [12]: N −1 (1) kn X [k ] = ¦ x[n]W N ,

2.4. Scalability There is a misunderstanding in many places, that scalability is equal to even distribution of resources. Actually, this is not true in large scale. For example, as the tile number of RAW processor grows, the scalar operand network latency increases, and more severely, traffic load per tile increases, which will finally destroy the scalability. In order to avoid this load increase, one would expect to keep the atomic size of 8X8 basic array and instead use an incremental level of inter-array communication network

k = 0,1,..., N − 1.

n =0

In which W N is e − j ( 2π / N ) . Fast Fourier Transform (FFT) is a fast algorithm for computing DFT to reduce number of multiplications from N 2 to N lg N . Extensive Matlab fixedpoint simulations were done to determine the scaling factors and dissolve bit precision issues. As a rule of thumb one bit is needed for each stage of FFT; therefore at least 13 bits are needed to get a good SNR.

3.2. Overview of Mapping FFT on M2

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

We map onto M2 an all radix-2 Decimation in Time FFT algorithm. Decimation in Time algorithm requires bitreversed presorting before FFT. The reason why we don’t use radix-4 is its higher data communication overhead and programming difficulty, although it requires fewer stages. Figure 3 shows the simplified Data Flow Graph (DFG) for computing 8-point FFT. This diagram can be easily generalized for larger numbers of N. From the DFG, we can detect two different types of parallelisms in different stages: the first several stages involve no communication between RCs if data are decomposed evenly across RCs, which leads to a perfect parallelism; the last several stages involve in butterfly pattern of communication. We will find out later that butterfly permutation can be very efficiently implemented on M2 network. Different applications, ask for different sizes of FFT. E.g. IEEE 802.11a needs FFT of length 64 while in DVB-T size of FFT can be 8192 points. Considering M2’s memory hierarchy, specifically the total number of registers, computation of FFT of length up to 128 can be handled by registers; without using local memory. For FFT with length larger than 128, our mapping scheme mainly tries to exploit the 2nd level of storage, i.e., local memory, with a total size of 2Mb enough for almost all practical cases. Details of different steps of FFT mapping are as follow. x[0]

X[0]

x[4]

X[1] WN0

−1

x[2]

0 N

−1

2 N

−1

W

x[6] WN0

−1

W

x[1] x[5] WN0

−1

x[3]

W

0 N

x[7] WN0

−1

WN2

−1 −1

X[2] X[3]

W

0 N

−1

W

1 N

−1

2 N

−1

3 N

−1

W

W

X[4] X[5] X[6] X[7]

Figure 3. 8-point FFT stages

3.3. Bit-reversed presorting

3.4. FFT butterfly stages The FFT butterfly of stage k is formulated as: = X k −1 [n] + Twiddle * X k −1 [n + d ]

X k [ n]

(2)

X [n + d ] = X k −1 [n] − Twiddle * X k −1 [n + d ] k

for d = 2 k −1 and n ∈ [0,2 k −1 ]

To perform the FFT, consecutive chunks of data of size 128 should be loaded to RC registers. Data layout in RC array is in such a way that data 0 will go to RC(0,0), data 1 to RC(1,0), and …, until data 64 will go to another register pair(one for real and for imaginary) of RC(0,0). As formula 2 states, for stages 1, 2, and 3, FFT butterfly needs communication between RCs in the same row and for stages 4, 5, and 6 FFT butterfly needs communication between RCs in the same column. For stages after 6, “d” will be a factor of 64 so butterfly would be between data in same RC with no communication. Figure 4 shows FFT butterfly stages. Thus, data communication within 2-D array is further reduced to 1-D communication only. As a basic operation, complex multiplication takes 4 MAC operations, 3 cycles each. Right-shift scaling is performed for precision control. Redundant twiddle factors table is used in FB to accelerate the task of feeding RC array with appropriate coefficients by eliminating communication. The cost is to save 8448 twiddle factors instead of 4096. If FFT size is more than 128, results are saved in local memory after the first 7 stages are performed. Stages 7,8, … inside RC

First 3 stages

COL 0 . . . . . . . . . . .COL 7 ROW 0

1

ROW 1

8

1

. .

. . . . . . . . . . . . . .

2

ROW 3

.

. . . . . . .

4

ROW 4

.

. . . . . . .

5

ROW 5

.

. . . . . . .

6

ROW 6

.

. . . . . . .

7

ROW 2

2

3

4

5

6

7

ROW 7 57 58 59 60 61 62 63 64

Communiction along rows

3

8

.. .. .. .. .. ..

58

. . . . . . 59

. . . . . . 60

. . . . . . 61

. . . . . . 62

. . . . . . . . . . . .

63 64

Communication along columns

Figure 4. First 7 stages of FFT

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

57

Second 3 stages

M2 Frame Buffer (FB) is capable of performing presorting using 64 Degree of Parallelisms (DOP). As a module, data delivered to FFT subsystem is in the second order of FB, i.e., data is stored in continuous address order within banks. The presorting algorithm that we present is a 2 level relocation. Denote FFT length N = 2 v , 1. Read data in row i of FB (first order) 2. Write it back to row bit-reverse ( i ), according to v-6 bits, while the FB crossbar switch is set in a bitreversing order. Step 1 will take care of reversing v-6 bits and step 2 will bit reverse the first 6 bits if we use first order of FB addressing. The problem is that M2 FB crossbar (note that this is not the scalar operand network we mentioned above) is not a complete crossbar and the 6-bit presorting permutation is in such a way that banks in one column will all go to the same row, which is not realizable by FB crossbar. However, this permutation is easily feasible in 2

cycles. First cycle data of one column are distributed in different rows and columns using a temporary row and second cycle the desired permutation is achieved. Details of these 2 crossbar switch configurations are not shown here but can be found in [13]. The overhead is the crossbar configuration data, which for each permutation is 64*6=384 bits. Presorting can be done in 512 cycles. Using a complete crossbar can accelerate presorting to 256 cycles, but the hardware cost is not worthwhile.

Then the next steps will be to load data from local memory to registers, to perform twiddle multiplications, and butterfly operations with 64 DOP. In these steps all 64 RCs work in pure SIMD.

3.5. Reordering After the FFT is performed, the data will be in the first order, i.e., address is continuous across banks. However, the module after FFT needs data in second order. Therefore, we need to reorder the data in FB. M2 FB has the capability to accomplish this task very fast utilizing a reordering buffer. This concludes our scheme for mapping 8K FFT on M2.

3.6. Mapping statistics The TinyRISC code and RC contexts for this mapping has been manually written. The number of contexts needed for FFT includes 42 row-context planes and 30 columncontext planes. TinyRISC code for computation part of 8K FFT is 498 lines, 32 bits each. The whole program is simulated in a cycle-accurate simulator called “Mulate”. Cycles used for twiddle factors and context loading are just one time overhead so they are not included in processing time.

3.7. Scalability according to FFT size: A very good feature of the proposed FFT engine is its scalability, which also demonstrates the scalability of memory hierarchy of M2. FFT of any size can be mapped to M2 with the same mapping methodology as long as there is enough space in local memory for twiddles and data. Table 2 shows different cycle numbers and processing time for different sizes of FFT on a M2 platform. Table 2. Cycle numbers and processing time for FFT FFT Size

Cycle Count

Processing Time(µsec)

8192 4096 2048 1024 512 256 128 64

26692 12321 5729 2613 1222 520 225 111

59.3 27.4 12.7 5.8 2.7 1.15 0.5 0.25

In order to modify the size of FFT only some loop parameters should be changed. Worth to mention, for IFFT of the same size, minor modification of TinyRISC code should be made; however, contexts, twiddles, and crossbar configuration data can be reused.

4. Benchmark comparisons

Streamed multimedia applications are inherently computation intensive and favor from data level parallelism. Multimedia processors incorporate large number of processing units and huge memory bandwidth to achieve high performance. Very Long Instruction Word (VLIW), Vector Processing, SIMD Extensions, and SuperScalar are main design themes for DSP processors; for instance Texas Instruments’ TMS320C62x™ is based on VelociTI™; an advanced 8-slot VLIW architecture [15] (TI’s C64x family performs better than C62x but benchmarks for radix-2 FFT are not available for that). We compare the result on M2 with these state of the art architectures. ASIC implementations and IP cores for FPGAs are also available for multimedia applications and FFT in specific [14][16]. There are not many benchmarks on 8K FFT, so 1024-point FFT on different platforms are compared in table 3 in order to have more data for comparison. Note that VIRAM ISA uses special instructions (“vahlfdn” and “vahalfup”) to accelerate FFT butterfly operations. Table 4 compares benchmarks for 8192-point FFT (From cycle count point of view, while M2 can computer 8192-point FFT in 17988 cycles TI’s C62x performs the same thing in 215165 cycles). Some may argue that TI DSP uses less number of transistors which is right, but the VLSI technology is delivering much more resources on a chip these days and the challenge is to have a design that can use the resources to achieve better performance and considers power issues. As the FFT size grows, M2’s performance approaches that of ASIC counterparts. Mapping experience of other applications on MorphoSys shows the ability of MorphoSys to extract parallelism from wide range of algorithms especially multimedia, computer graphics and wireless communication and DSP. Table 3. Processing time comparison for 1024-point FFT Platform

Processing Time(µsec)

M2 VIRAM Imagine(Float.) TMS320C6201 Virtex II

5.8 26.4 7.4 104 7.31

Table 4. Processing time comparison for 8192-point FFT Platform

Processing Time(µsec)

M2 TMS320C62x Amphion CS 2420TK (TSMC 0.18µ; 153 MHz)

59.3 1064 159.8

5. Conclusion A new generation of MorphoSys architecture is described, comparison with other architectures is given, and a mapping of a very fast parallel complex FFT is presented. The FFT engine is designed and tested for sizes between 64 and 8192; moreover, it is scalable for even larger sizes. We

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

also present a fast presorting scheme. M2 outperforms other streamed multimedia processors and its processing time is comparable to ASIC implementations for large FFT size, whereas M2 features lower power, lower cost, and higher flexibility. Though using better technology is a factor, the boost in performance of M2 compared to M1 is mainly caused by more extensive interconnection network, better memory hierarchy, better design like allowing multi-cycle MAC operations and wide interleaved frame buffer.

Acknowledgments This work was sponsored by DARPA (DoD) under contract F-33615-97-C-1126, the National Science Foundation (NSF) under grant CCR-0083080, and State of California CoRe funded research in co-operation with Broadcom Corporation. Also we would like to thank other members of MorphoSys research group.

References [1] H. Singh, Lee, Lu, Bagherzadeh, Kurdahi, “MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation–Intensive Applications,” IEEE Transactions on Computers, vol. 49, No. 5, May 2000, pp. 465-481. [2] Lee, Singh, Lu, Bagherzadeh, Kurdahi, “Design and Implementation of the MorphoSys Reconfigurable Computing Processor,” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 24, No. 2-3, Kluwer Academic Publishers, Mar 2000, pp. 147-164. [3] Pan, Bagherzadeh, Kamalizad, Koohi, “Design and Analysis of a Programmable Single-Chip Architecture for DVB-T BaseBand Receiver,” Proceedings of Design, Automation and Test in Europe Conference and Exhibition, March 2003, pp. 468 -473.

[5] Schlansker, Rau, “EPIC: Explicitly Parallel Instruction Computing,” IEEE Computer, Vol. 33, No. 2, Feb 2000 pp. 37 45. [6] Taylor, et al. “The Raw microprocessor: a computational fabric for software circuits and general-purpose programs,” IEEE Micro, Vol. 22, No. 2, Mar/Apr 2002, pp. 25 -35. [7] Rixner, et al. “A bandwidth-efficient architecture for media processing,” Microarchitecture, Proceedings of 31st Annual ACM/IEEE International Symposium on, Nov/Dec 1998, pp. 3-13. [8] Kozyrakis, Patterson, “Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks,” Microarchitecture, Proceedings of 35th Annual IEEE/ACM International Symposium on, 2002, pp. 283-293. [9] Taylor, Lee, Amarasinghe, Agarwal, “Scalar Operand Networks: On-chip Interconnect for ILP in Partitioned Architectures,” Proceedings of the Ninth International Symposium on high-Performance Computer Architecture, Feb. 2003, pp. 341353. [10] Venkataramani, Najjar, Kurdahi, Bagherzadeh, Bohm, “A Compiler Framework for Mapping Applications to a Coarsegrained Reconfigurable Computer Architecture,” CASES 2001, Atlanta, GA, November 2001. [11] Maestre, Kurdahi, Bagherzadeh, Singh, Hermida, Fernandez, “Kernel scheduling in reconfigurable computing,” Proceedings of Design, Automation and Test in Europe Conference and Exhibition, Mar 1999, pp. 90-96. [12] Alan V. Oppenheim, Ronald W. Schafer, “ Discrete-time Signal Processing,” New Jersey, Prentice Hall, 1989. [13] Kamalizad, “Several DVB-T cores Mapping into MorphoSys Architecture,” M.Sc. Thesis, University of California Irvine, Jan., 2003. [14] www.amphion.com [15] www.ti.com [16] www.xilinx.com [17] Kozyrakis, “Scalable Vector Media Processors for Embedded Systems,” Ph.D. Thesis, Technical Report UCB-CSD-02-1183, University of California at Berkeley, May, 2002.

[4] Agarwal, Hrishikesh, Keckler, Burger, “Clock rate versus IPC: the end of the road for conventional microarchitectures,” Computer Architecture, Proceedings of the 27th International Symposium on, 2000, pp. 248 -259.

Proceedings of the 15th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’03) 0-7695-2046-4/03 $17.00 © 2003 IEEE

Suggest Documents