A New Scalable DSP Architecture for Mobile Communication Applications Matthias H. Weiss and Gerhard P. Fettweis ? Mobile Communications Systems Chair Department of Electrical Engineering Dresden University of Technology, 01062 Dresden, Germany fweissm,
[email protected]
Abstract. Digital Signal Processors (DSPs) are employed in all elds of
mobile communication applications, e.g. in speech, audio, modem, video, or graphic processing. Due to dierent processing requirements, today, for each application dierent DSPs have to be employed. Thus, in contrary to existing DSP solutions, our work is targeted towards designing expandable and customizable DSP architectures. To achieve this goal, DSP memory and register les must be scalable. Instead of taking large register les or matrix memories our DSP architecture employs a group memory. In this memory several data elements are treated as one group and share one common address. This allows us to keep the address generation unit simple. By employing a register le which is optimized for digital signal processing algorithms our DSP architecture becomes fully scalable. Using this architecture DSP design can be substantially facilitated and design cycles can be widely reduced. The usefulness of this paradigm is demonstrated by showing its performance by means of major DSP algorithms such as lters or FFTs.
1 Introduction Mobile communication systems often employ Digital Signal Processors (DSPs) to carry out software driven digital signal processing tasks. Examples are sound-, video-, modem-, or speech- applications. In these systems DSPs are either added separately or directly as a System On Chip (SoC) component. In this SoC domain, system tasks can be partitioned onto dierent SoC components such as micro controllers, dedicated Application Speci c Integrated Circuits (ASICs), or ?
This work has been sponsored in part by the Deutsche Forschungsgemeinschaft (DFG) within the Sonderforschungsbereich (SFB) 358.
Proceedings of the Fourth Australasian Computer Architecture Conference, Auckland, New Zealand, January 18{21 1999. Copyright Springer-Verlag, Singapore. Permission to copy this work for personal or classroom use is granted without fee provided that: copies are not made or distributed for pro t or personal advantage; and this copyright notice, the title of the publication, and its date appear. Any other use or copying of this document requires speci c prior permission from Springer-Verlag.
DSPs. If the DSP needs to provide more performance but cannot be extended, tasks have to be supported by an ASIC, although DSP exibility is often desired. Thus, a DSP architecture is required which can be adapted towards system needs.
Requirements This becomes obvious in applications with high data rates,
such as high data rates modems, video- or graphic-codecs. Here, data rates are in a similar range as the processor clock. Thus, the processing has to be speeded up by exploiting parallelism. This can be achieved by exploiting coarse grain parallelism, e.g. by implementing dierent algorithm blocks in dierent hardware units. The data ow graph is imitated by the hardware implementation. However, if the application changes, the hardware has to be changed also. A dierent approach is to exploit ne grain parallelism, e.g. by employing a processor architecture with parallel datapaths. This provides a software programmable solution and can even be more ecient than a hardwired solution [Kim et al., 97]. Since here parallelism is exploited on the instruction level, different algorithms can exploit one common architecture. Thus, in the SoC domain a programmable processor architecture is needed, which provides scalability and expandability.
Current DSP architectures One way to expand DSP architectures is to
duplicate the complete core. An example for such a multi core solution is TI's C80x. This architecture comprises four DSP units, four memories, and one RISC core. The connection is performed by a crossbar switch. However, this leads to an expensive communication network and requires a complicated programming paradigm. Even TI reduced this architecture downto two cores by providing the C82x architecture. A dierent approach is taken by media processors. These processors such as TI's C6x or Philip's TriMedia are typically based on a Very Long Instruction Word (VLIW) Instruction Set Architecture (ISA), employ register les, and provide packed arithmetic. Here, the communication problem is solved via an expensive global register le, which even had to be split into two parts in TI's C6x case. Another way to solve this communication problem is to use a matrix memory as suggested by [Wittenburg et al., 97] or [Trenas et al., 98]. However, memory is a crucial system issue and often standard memory components are preferable. Thus, MicroUnity employed a wide standard memory, which contains grouped data [Hansen, 96]. Here, several data elements share one common address and are accessed in groups only. The processing is based on the Single Instruction Multiple Data (SIMD) principle. However, in MicroUnity's MediaProcessor a special communication unit is employed which accesses the group register le. In case of extending this architecture, this communication unit has to be redesigned. In this paper we present a scalable DSP architecture, which avoids the problems mentioned above. By using a group memory, slicing the architecture, exploiting SIMD properties, and employing a modular VLIW ISA, this architecture can easily be adapted towards application's needs. In Section 2 we introduce our scalable DSP architecture and present methods to support dierent algorithms. Furthermore, algorithm classes are introduced, which employ the grouping prin-
ciple. In Section 3 for these classes the impact to our DSP architecture is examined when parallelism is exploited. Finally, in Section 4 results of variations of this DSP architecture and benchmarks are presented.
2 Algorithm - Architecture Interaction To satisfy the target application's needs the DSP architecture must be easily adaptable. Thus, the basic architecture must be expandable by additional datapaths or function units. To provide this property, we applied the method of orthogonalization. Both algorithms and architecture are orthogonalized into a data transfer (Section 2.1) and a data manipulation part [Fettweis, 97]. While data transfer includes all memory-register, register-register, or memory-memory transfers, data manipulation includes datapath functionality such as fused vs. divided Multiply/Accumulation (MAC), Galois vs. integer arithmetic [Drescher et al., 98], or the Add Compare Select (ACS) functionality to support the Viterbi algorithm. This approach allows us to classify algorithm's data transfer behaviour independently from data manipulation. Thus, we can concentrate on data transfer to nd a common memory architecture which can be optimized towards the target application's needs. (Section 3.5).
2.1 Data Transfer One of the major dierences between dierent DSP architectures is the way data is read from memory to the manipulation units and vice versa. The Multiply/Accumulate (MAC) unit for instance requires three data at the input and provides one output. If this MAC unit is duplicated, six inputs and two outputs must be covered. In general, to avoid eight memory accesses register les are introduced. However, a register le is a crucial design issue for power and area ecient implementations [Faraboschi et al., 98]. Thus, TI splits their register le into two parts while in the HiPar-DSP the generality of the register le is reduced [Wittenburg et al., 97]. In contrary to this general register le approach, our register le comprises only connections targeted towards DSP applications. These connections support the following data transfer classes, which are described in detail in Section 3: { Vector Data Transfer (VDT), { Sliding Window Data Transfer (SWDT), and { Shue Data Transfer (SDT). Each of these classes has dierent requirements for the connectivity in our register le. Thus, to provide a scalable architecture, the register le has to support these classes and still has to be scalable.
2.2 Scalable DSP Architecture In general, a DSP architecture comprises function units for both data processing, such as datapaths (ALUs, MACs) or memory units, and for program processing
Simple Address Generation
Group Memory
MAC
Accu
MAC
Accu
Local Communication
SIMD PCU
Regs
Local Communication
Regs
Local Communication
Inter Communication Unit Regs
MAC
Accu
Slice
Fig.1. Sliced Architecture such as a Program Control Unit, a Loop Unit, or an Address Generation Unit. To provide a scalable DSP architecture, data processing units must be customizable to allow adoption to system needs, while program processing units must be widely independent from architectural changes to avoid expensive redesigns between two DSP architectures. We solved the rst issue via slicing the architecture, while the latter was achieved by our new modular Instruction Set Architecture (ISA) called Tagged Very Long Instruction Word (TVLIW) [Weiss and Fettweis, 96]. In our sliced architecture (Fig.1), each slice contains one element of the group memory, several local registers, and datapaths, e.g. MAC and shifter. Several slices can be attached as required by systems needs. Obviously, this sliced architecture is advantageous for vector processing. Here, each slice computes one element of a vector. However, not all digital signal processing algorithms can be eciently computed by vector processing. Examples are lters or transformations. These algorithms require communication between elements. Thus, our scalable register le consists of a communication unit and local registers.
3 Parallelization and Algorithm/Architecture Mapping 3.1 Grouping Principle and Notation To provide a scalable memory architecture we employ a group memory. As depicted in Fig. 2 one group consists of a number of elements and has a xed width. The notation is as follows: A block of data yk can be divided into groups Gr-fgroup#g.felement#g with elements 2 (1; #ofslices). The expression ykGr-1.1 denotes the rst element in the rst group in the block yk . Thus, by accessing one element in a group, all elements which are part of this group are accessed as well. Several groups build a block, which can be processed by block oating mechanisms [Kobajashi and Fettweis, 98]. Groups can be manipulated by a number of parallel function units denoted by ffunctiong-felement#g. Thus, MAC-1(a,b,c) denotes the Multiply ACcumulate functionality a + b c applied to element number 1.
data element
a) data block data element
b) data group data block a) Single datapath DSP b) Multiple datapaths DSP
Fig.2. Structure of the group memory To reduce the number of memory accesses groups can be stored in group registers. One group register contains one local register in every slice. Thus, group registers have the same width as the group memory. To permute elements within this group, we provide 2 units:
{ Permute neighboring elements only via a Local Communication Unit (LCU), { Permute elements generally via an Inter Communication Unit (ICU). In general, this ICU is a crossbar network. This can be carried out in a general manner as in MicroUnity's case [Hansen, 96]. In our approach, we further minimize the ICU's complexity as it will be described in Section 3.5. This method exploits the fact, that not all algorithms require inter communication. Thus, in a rst step all algorithms are classi ed into one of the following Data Transfer Classes.
3.2 Vector Data Transfer The most common usage of the Vector Data Transfer Class P are matrix operations. An example is the vector-matrix multiplication: yk = Ni=0?1 ci xk;i . By duplicating this equation, parallelism can be easily exploited:
yk =
X c x
N ?1 i=0
i
k;i ;
yk+1 =
X c x
N ?1 i=0
i
k
;i :
( +1)
By grouping elements xk;i and xk+1;i one group access per parallel computation is required. If the group size is a multiple of the vector size communication is needed only to distribute the shared coecient ci . This principle is typically exploited by vector computers.
3.3 Sliding Window Data Transfer A common example for Sliding Window P Data Transfer represents the Finite Impulse Response (FIR) lter: yk = Ni=0?1 ci xk?i . Here, a common way to
Table 1. Ratio between parallel execution and pipeline ll for a 5 taps FIR lter Pipeline Fill (tpf ) Parallel Execution (tpe )
tpf =tpe
2 MACs 4 MACs 8 MACs 1 3 7 5 5 5 0.2 0.6 1.4
exploit parallelism in a DSP architecture is to compute several lter outputs
y[: : : ] in parallel:
yk =
X c x
N ?1 i=0
i
k?i ;
yk+1 =
X c x
N ?1 i=0
i
k
?i :
( +1)
(1)
With this approach, coecient ci can be shared. By inserting a delay register into the architecture, two MACs can be kept busy with only 2 memory accesses per iteration. This method is applied in almost all dual MAC DSPs such as TI's C6x, Lucent's 16k or Siemens' Carmel DSP. In [Weiss et al., 97] we extended this principle even to a four MAC DSP architecture. However, in case the number of MACs is increased further, lling and ushing the chain of delay registers becomes a dominant factor as shown in (Table 1). This problem can be avoided by reading all input data x[: : :] at once. In our memory architecture both data and coecients are grouped, i.e. several elements are combined into one group and are accessed in parallel. Thus, to compute the rst iteration i = 0 of (1), two groups (Gr-2 and Gr-3) must be read before starting the rst iteration. While group Gr-3 can directly be used as a vector, in the coecient group Gr-2 one element has to be distributed to all slices. This requires one bus in the ICU. ; xGr-3.1 ykGr-1.1 = MAC-1(ykGr-1.1 ; cGr-2.1 0 k?0 ) (slice 1) Gr-1.2 Gr-2.1 Gr-3.2 Gr-1.2 yk = MAC-2(yk ; c0 ; x(k+1)?0 ) (slice 2) (2) ykGr-1.3 = MAC-3(ykGr-1.3 ; cGr-2.1 ; xGr-3.3 0 (k+2)?0 ) (slice 3) .. . With this approach, pipeline loss can be avoided. The next iteration i = 1 requires the coecient c1 which is an element of the previous read group Gr-2. For the input data, one additional element (xGr-4.8 k?1 , assuming that a group contains eight elements) is needed, while the remaining elements slide to the left (or right) and, therefore, can be reused. Thus, for iteration i = 1 only one additional group (Gr-4) must be read. During the following iterations the memory can be switched o until a new group of coecients or elements has to be read. With this approach the number of memory accesses is independent from the degree of parallelism (here: the number of MACs). To support sliding window algorithms such as lters local communications are required. Registers holding groups Gr-3/x[: : :] must be connected locally to support Sliding Window Data
-1 -1 -1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
(a)
(
-1 -1
-1
-1
-1 -1
-1
-1
-1
(b)
Fig.3. Flowgraph of a 8 point (a) Singleton vs (b) Cooley Tukey FFT Transfer. For distributing the coecient ck to all MACs one bus in the ICU is required. Finally, the group register holding the backup data Gr-4/x[: : :] must provide one element per iteration to the right or left edge of the sliding window register. This can be achieved either by an additional bus in the ICU or by a local connection in the LCU, which provides Sliding Window Data Transfer also. The implementation can be chosen depending on other algorithm's needs. With this approach, both Sliding Window Data Transfer and Vector Data Transfer are independent from the architecture's parallelism. Thus, the DSP architecture can be extended, e.g. by adding slices, without modifying the fundamental architecture.
3.4 Shue Data Transfer Common algorithms of the Shue Data Transfer class are transformations such as the FFT, DCT, DST, the Viterbi, or Sorting algorithms [Fettweis and Bitterlich, 91]. In contrary to Vector Data Transfer this data transfer class requires communication inherently which is typically solved by dedicated networks such as hypercubes [Gupta and Kumar, 93], by vectorizing techniques for large FFTs [Swarztrauber, 1982], by partitioning [Ju, 97], or employing higher radixes [Bi and Jones, 89], [Gold and Bially, 73]. These techniques are typically targeted towards large FFTs (1024 and more complex points) and assume local memory. In the following we derive techniques to compute also small FFTs by employing the DSP's group memory.
FFT based on a de Bruijn network Similar to the FIR example (Section 3.3)
parallelism in the FFT can be exploited by duplicating equations. We start by taking the FFT suggested by Singleton [Singleton, 67]: yk = x2k + x2k+1 Wk;m;l yk+n=2 = x2k ? x2k+1 Wk;m;l ; where m denotes the number of stages, l = 1; 2; : : : ; m the current stage, k = (i ((k2m?l )=2l?1 )) 1; 2; : : : ; (n=2) the position in this stage, and Wk;m;l = e the twiddle factor. Here, operations are complex. Singleton's FFT representation leads to a de Bruijn network as depicted in Fig. 3a.
Gr-1.1
Gr-1.2
Group Gr-1
Gr-2.1
Gr-2.2
Group Gr-2 Communication Network
MAC-1
MAC-2
Gr-1.1
Gr-1.2
Group Gr-1
Gr-2.1
Gr-2.2
Group Gr-2
Fig.4. Element distribution when accessing groups By duplicating these equations, 2 butter ies can be computed in parallel (3): yk =x2k + x2k+1 Wk;m;l yk+n=2 =x2k ? x2k+1 Wk;m;l (3) yk+1 =x2k+2 + x2k+3 Wk+1;m;l yk+1+n=2 =x2k+2 ? x2k+3 Wk+1;m;l Here, four complex input values are needed: x2k+m ; 8m 2 (0; : : :; 3). To exploit our memory architecture, these values can be grouped. However, output values are not adjacent. In this radix-2 case, two groups can be build: yk+m ; 8m 2 (0; 1) and yk+n=2+m ; 8m 2 (0; 1). Thus, to achieve similar sized groups, input values are also distributed onto two groups: x2k+m ; 8m 2 (0; 1) and x2k+m ; 8m 2 (2; 3). With the help of the group memory the number of memory accesses are independent from the degree of parallelism, e.g. independent from the parallel computation of butter ies within one stage. Now, however, elements within a group have to be rearranged. If these equations are distributed over two MACs (MAC-1-2), the following groups (GrGroup#.Element#) must be accessed: Gr-3.1 ykGr-1.1 = MAC-1(xGr-1.1 ; xGr-1.2 2k 2k+1 ; Wk;m;l ) Gr-1.1 Gr-3.1 ; xGr-1.2 ykGr-2.1 2k+1 ; Wk;m;l ) +n=2 = MAC-1(x2k (4) Gr-2.2 Gr-3.2 ykGr-1.2 = MAC-2(xGr-2.1 +1 2k+2 ; x2k+3 ; Wk+1;m;l ) Gr-2.1 Gr-2.2 Gr-3.2 ykGr-2.2 +1+n=2 = MAC-2(x2k+2 ; x2k+3 ; Wk+1;m;l ) Here, the rst group Gr-1 contains elements xGr-1.1 and xGr-1.2 2k 2k+1 . When this group is read, the elements have to be delivered to MAC-1, while the next group Gr-2 must be delivered to MAC-2 (Fig. 4). Thus, a global communication network at the input is required. Writing, on the other hand, does not need a communication network. This way of exploiting parallelism makes use of the Singleton FFT's main advantage: each stage has the same structure. Thus, only one xed ICU (communication network) structure is needed and programming stays the same for each stage. However, besides Singleton FFT's disadvantage of not comprising the inplace property (in contrary to the original Cooley-Tukey FFT), this approach
has one major drawback: If shuing is done at the input, i.e. the ICU communicates input values, and decimatation-in-time is employed, input data at the rst stage must reside in bit-reversed order. Thus, with this structure bit-reversed inor output cannot be chosen. This is a major drawback, since the FFT is typically employed in pairs (FFT and iFFT). Here, the FFT may have bit-reversed order at the output and the iFFT at the input - before and after the FFT-iFFT block data has normal order.
FFT based on a perfect shue network By giving up the Singleton FFT's
consistent structure property, the duplication of (3.4) can be performed in a dierent way. Instead of computing the ith+1 equation, the nth/4 equation can be computed in parallel (5). Gr-3.1 ; xGr-1.2 ykGr-1.1 = MAC-1(xGr-1.1 2k 2k+1 ; Wk;m;l ) Gr-3.1 Gr-1.1 ykGr-2.1 ; xGr-1.2 2k+1 ; Wk;m;l ) +n=2 = MAC-1(x2k Gr-3.2 Gr-2.2 Gr-2.1 ykGr-1.2 +n=4 = MAC-2(x2k+n=2 ; x2k+n=2+1 ; Wk+n=4;m;l ) Gr-3.2 Gr-2.2 Gr-2.1 ykGr-2.2 +3n=4 = MAC-2(x2k+n=2 ; x2k+n=2+1 ; Wk+n=4;m;l )
(5)
This structure needs a shue at the input similar to the Singleton case (4) but creates dierent groups. These groups consist of elements ykGr-1.1 and ykGr-1.2 or +n=4 Gr-2.2 ykGr-1.2 and y , respectively. By dierently duplicating equations in the next +n=4 k+3n=4 is and xGr-4.2 stage using the equation n=8, a group Gr-4 with elements xGr-4.1 2k 2k+n=4 required at the input. This group Gr-4 contains the same elements in the same order as the stored group Gr-1. Gr-6.1 ykGr-4.1 = MAC-1(xGr-4.1 ; xGr-5.1 2k 2k+1 ; Wk;m;l ) Gr-4.1 Gr-6.1 ; xGr-5.1 ykGr-5.1 2k+1 ; Wk;m;l ) +n=2 = MAC-1(x2k Gr-6.2 Gr-5.2 Gr-4.2 ykGr-4.2 +n=8 = MAC-2(x2k+n=4 ; x2k+n=4+1 ; Wk+n=8;m;l ) Gr-4.2 Gr-5.2 Gr-6.2 ykGr-5.1 +5n=8 = MAC-2(x2k+n=4 ; x2k+n=4+1 ; Wk+n=8;m;l ) For this stage no communication at the input is required. Of course, this type of duplication by n=split can only be performed, if n > split. Thus, this method is similar to using the Cooley Tukey FFT and partitioning the owgraph as shown in Fig. 3b.
3.5 Scheme for Minimizing Inter Communication
The main advantage of this approach is the possibility to minimize the ICU's complexity. For instance, if eight butter ies are computed in parallel for a 64 complex point FFT, the ICU is employed only 50 % of the time. Thus, a reduction of the ICU's complexity, namely the number of global busses, may reduce the computation performance. However, only half of the FFT's computation time is aected. This becomes even more severe, if the overall application is taken into account. If, for instance, a target application consists of only 30 % algorithms
16 MAC - 8 bus
10000
add. cycle requirement in %
log10(cycles)
2-bus/16 MACs vs. 8-bus/16 MACs 4-bus/16 MACs vs. 8-bus/16 MACs 4-bus/8 MAC vs. 8-bus/16 MACs
120
16 MAC - 4 bus 16 MAC - 2 bus 8 MAC - 4 bus 1000
100
100 80 60 40 20 0
10 16
32
64 128 256 512 FFT size (complex points)
1024
16
32
64
128 256 512 1024 2048 4096 FFT size (complex points)
Fig.5. (a) Absolute and (b) Additional cycle counts for variation of DSP architecture with Sliding Window Data Transfer the ICU can be halved by loosing only 15 % of the overall performance to compute the application. Thus, algorithms of the Shue Data Transfer class fall - depending on their size - in part into the Vector Data Transfer class as well.
4 Application and Results To show the applicability of the minimization scheme in Section 3.5 in this chapter some results are given for varying our sliced architecture. Since both Vector and Sliding Window Data Transfer are already independent from the number of parallel slices we investigated variations of the FFT as example for a shue algorithm. As a basis we employ a DSP architecture with 16 parallel slices, each consisting of one MAC unit, four input registers, and one accumulator. This is the structure of the rst derivative of this processor architecture called Mobile Multimedia Modem- or M 3 -DSP [Fettweis et al., 98]. This DSP must be able to provide up to 3000M MAC/s @ 100 Mhz. It allows the complete software implementation of a new OFDM-based 25Mbit/s wireless ATM modem. For implementing the FFT we choose a Radix-2 implementation to use easier pipelining. This pipelined implementation allows to compute one radix-2 butter y in 3 cycles with 3 overlapped read and 2 write cycles. With this hardware we computed dierent sized FFTs by varying the number of busses in the ICU as depicted in Fig. 5a. Furthermore, we compared the 16 MAC implementation with eight busses and the eight MAC implementation with four busses. For shorter FFTs this solution has a similar performance as the 16 MAC solution with only two busses in the ICU. In Fig. 5b these limited solutions are directly compared against a full 16 MAC/8 bus solution. Here, the 16 MAC/4 bus solution requires between 15 and 25 percent more cycles depending on the FFT size. However, in contrary to an eight
Table 2. Benchmarks for the M -DSP vs. TI's C6x, HiPAR, and Butter y DSP 3
M 3 DSP TI's C6x HiPar Butter y DSP @100 Mhz @ 200 Mhz @ 100 Mhz @ 50 Mhz 1024 complex point FFT 2200 cycles/ 20815 cycles/ (Radix-2) 22 s 104 s 42 s 54 s complex FIR, 32 coe., 1204 cycles/ 6410 cycles/ N.A. N.A. 100 samples 12 s 32 s BCH code(216,124,25) 244 cycles/ N.A. N.A. N.A. 2.4 s MAC solution the 16 MAC/4 bus solution is far more advantageous - especially for bigger FFTs. Thus, even for algorithms with Shue Data Transfer our DSP architecture is scalable without changing the hardware, i.e. the ICU. With these properties our M 3 -DSP implementation outperforms current DSP solutions as TI's C6x, high-end mediaprocessors as the HiPar DSP, and achieves similar performance as customized high-end ASICs as the Butter y-DSP (Table 2).
5 Conclusions In this paper we presented a new scalable DSP architecture for mobile communication applications. By employing group memory and group registers, the DSP architecture can be split into slices. Thus, according to system needs the number of slices can be adapted which allows this DSP architecture to suit dierent system requirements. This adaptation cannot be obtained with conventional DSP architectures. Furthermore, we showed that this architecture ef ciently supports major algorithm classes, such as vector, sliding window, or shue algorithms. These classes cover a wide range of DSP algorithms such as lters, transformations, Viterbi algorithms, or matrix operations. Finally, we presented benchmarks for a derivative of this architecture containing 16 slices - the Mobile Multimedia Modem DSP - which shows signi cant performance improvement against current DSP solutions while avoiding complex hardware requirements. In our future work we will develop an integrated design environment, which supports the DSP design at system level. This should enable system designers to explore the DSP's design space already in an early phase of system speci cation.
6 Acknowledgement We like to thank our colleagues for their support, in particular Volker Aue, Wolfram Drescher, Frank Engel, Shiro Kobayashi, Thomas Richter, Attila Roemer, Paul Schwann, and Ulrich Walther.
References Bi, Guan and Jones, E.V. (89). A pipelined FFT processor for word-sequential data. IEEE Trans. on Acoustics, Speech and Signal Processing, 37(12):1981{ 1985. Drescher, W., Mennenga, M., and Fettweis, G. (98). An architectural study of a digital signal processor for block codes. In Proc. of ICASSP '98, volume 5, pages 3129{3133, Seattle, WA, USA. Faraboschi, P., Desoli, G., and J.A.Fisher (98). The latest word in digital and media processing. IEEE Signal Processing Magazine. Fettweis, G. (97). Design methodology for digital signal processing. In Proc. of ASAP '97, Zurich, Switzerland. Fettweis, Gerhard and Bitterlich, Stefan (91). Optimizing computation and communication in shue-exchange processors. In Proc. of IEEE Midwest Symposium on Circuits and Systems. Fettweis, G., M.Weiss, W.Drescher, U.Walther, F.Engel, and S.Kobayashi (98). Breaking new grounds over 3000 MOPS: A broadband mobile multimedia modem DSP. In Proc. of ICSPAT98, Toronto, Canada. Gold, Ben and Bially, Theodore (73). Parallelism in Fast Fourier Transform hardware. IEEE Trans. on Acoustics And Electroacoustics, Au-21(1):5{16. Gupta, Anshul and Kumar, Vipin (93). The scalability of FFTs on parallel computers. IEEE Trans. on Parallel And Distr. Systems, 4(8):922{931. Hansen, Craig (96). MicroUnity's mediaprocessor architecture. IEEE Micro, 16(4):34{40. Ju, Chwen-Jye (97). FFT-based parallel systems for array processing with low latency. In Proc. of ICSPAT '97, pages 1064{1070. Kim, Kyosun, Karri, Ramesh, and Potkonjak, Miodrag (97). Synthesis of application speci c programmable processors. In Proc. of DAC '97, pages 353{358. Kobajashi, Shiro and Fettweis, Gerhard P. (98). A block- oating-point system for multiple datapath DSP. In Proc. of SiPS '98, Boston, MA, USA. Singleton, Richard C. (67). A method for computing the Fast Fourier Transform with auxiliary memory and limited high-speed storage. IEEE Trans. on Acoustics And Electroacoustics, AU-15(22):91{98. Swarztrauber, Paul N. (1982). Parallel Computations, chapter Vectorizing the FFTs, pages 51{83. Academic Press. Trenas, M.A., Lopez, Juan, and L-Zapata, Emilio (98). A memory system supporting the ecient SIMD computation of the two dimensional DWT. In Proc. of ICASSP '98. Weiss, Matthias H. and Fettweis, Gerhard P. (96). Dynamic codewidth reduction for VLIW instruction set architectures in digital signal processors. In 3rd. Int. Workshop in Signal and Image Processing (IWSIP `96), pages 517{520. IEEE. Weiss, M. H., Walther, U., and Fettweis, G. P. (97). A structural approach for designing performance enhanced DSPs: 1-MIPS GSM fullrate vocoder casestudy. In Proc. of ICASSP 97, volume 5, pages 4085{4088. IEEE. Wittenburg, J.P. et al. (97). HiPAR-DSP: A parallel VLIW RISC processor for real time image processing applications. In Proc. of ICA3PP '97, pages 155{162.