DESIGNING PERFORMANCE ENHANCED DIGITAL ... - CiteSeerX

0 downloads 0 Views 139KB Size Report
memory, an Arithmetic Logic Unit (ALU) , and a Multiply. Accumulate (MAC) datapath. This DSP core, as depicted in Fig. 1, can be extended by further datapaths ...
DESIGNING PERFORMANCE ENHANCED DIGITAL SIGNAL PROCESSORS USING LOOP TRANSFORMATIONS Matthias H. Weiss, Dirk Fimmel, Renate Merker, and Gerhard P. Fettweis Department of Electrical Engineering Dresden University of Technology, 01062 Dresden, Germany fweissm,[email protected], ffimmel,[email protected]

1. INTRODUCTION Several multimedia systems employ Digital Signal Processors (DSPs) to carry out software driven digital signal processing tasks. Examples are sound or video cards, modem or speech processing units [1]. Often, DSPs are added as a component to a System On Chip (SoC). Although DSPs could still be modified in this SoC domain, they are mainly employed as fixed DSP cores. Thus, if the DSP needs to provide more performance but cannot be extended, tasks have to be supported by an Application Specific Integrated Circuit (ASIC) [2], although DSP’s flexibility would be needed. Thus, a reconfigurable DSP architecture is required, which can be adapted towards system needs. One possible way is to parameterize the overall DSP archiThis work has been sponsored in part by the Deutsche Forschungsgemeinschaft (DFG) within the Sonderforschungsbereich (SFB) 358

Addr.2

Digital Signal Processors (DSPs) are employed in all fields of multimedia applications such as speech, audio, modem, video, or graphics. Often, they are embedded into multimedia systems as a System on Chip (SoC) component. Although DSPs could still be modified in the SoC domain, they are mainly employed as fixed DSP cores. Possible adaptations to the embedding system are not carried out. Thus, if a multimedia system requires more hardware performance, either additional DSP cores or dedicated ASICs have to be added, which limits flexibility. The main reason is that today’s DSP cores are not designed for extensions. Thus, our work is targeted to design a family of scalable DSP architectures. One class of this DSP family comprises an extendable architecture with standard DSP components such as one MAC unit and one dual port memory. In this paper we present methods to extend this DSP core automatically without changing the core itself. With these methods, the DSP architecture can be reconfigured and, thus, easily adapted to different system requirements.

tecture [3]. However, here hardware has to be fully synthesized which increases both chip size and system cost. Furthermore, the Instruction Set Architecture (ISA) has to be retained to allow the reuse of software tools and programs. However, even standard DSPs as TI’s C60 (vs. C54x) or Lucent’s DSP16K (vs. DSP1600) differ in their ISA if performance enhanced architectures are provided. Thus, a DSP architecture is required, which can be tailored to different system needs without changing the complete hardware and ISA. Our work is targeted to design a family of scalable DSP architectures. One class of this DSP family is an architecture comprising only standard components such as a dual port memory, an Arithmetic Logic Unit (ALU) , and a Multiply Accumulate (MAC) datapath. This DSP core, as depicted in Fig. 1, can be extended by further datapaths and registers depending on system needs. By using a scalable and flexible VLIW based ISA [4], a common software interface can be provided. To assure software reuse algorithms have to be compiled to tailored DSPs with different hardware extensions. To provide this, we applied the method of orthogonalization. Both algorithms and architecture are separated into a data transfer and data manipulation part [5]. On algorithm level data transfer can be handled by using loop transformations, which is the focus of this paper.

Addr.1

ABSTRACT

Extendable Registerfile + Bussystem

Memory

ALU

MAC

MAC

Accu

Accu

Accu

DSP core

Extensions

Fig. 1: Extendable DSP core

Delay

(1 1)

(0 2)

S1

S2

S3

Accu

MAC

MAC

Accu

Accu

(0

MAC

Addr.2

MAC

Memory

Addr.1

Addr.4

Addr.3

Addr.2

Addr.1

1)

S0

(1 0)

Memory

Accu

Fig. 3: Example of a Reduced Dependency Graph (RDG) Fig. 2: (a) 4 port vs. (b) dual port memory architecture In Section 2 procedures are presented to support the DSP design from system level down to a tailored DSP architecture. Taking a speech compression application in Section 3 as an example we apply these methods to derive a 4 MAC DSP architecture. 2. SYSTEM EXPLORATION AND DSP DESIGN To define an appropriate DSP architecture for a given application several procedures must be applied to analyze system requirements (Section 2.1), extract major algorithms from a common representation (Section 2.2), and define suitable hardware extensions and exploit parallelism (Section 2.3). Finally, memory accesses between succeeding algorithms may be reduced (Section 2.4). 2.1. First Step: System Exploration In a first step the system requirements must be analyzed. To meet these requirements major target algorithms may be hardware supported. This can be achieved by supplying dedicated ASICs or by extending the DSP core. Since the later choice is software accessible, the hardware extension may be used in a general manner, since the same hardware extension can be employed by different algorithms. This common usage must be taken into account when defining common hardware extensions. Hence, in this step major algorithms are extracted, the required data manipulation is extracted and, depending on the system requirements, the degree of parallelism is defined. 2.2. Second Step: Common Representation To define common hardware extensions the extracted target algorithms must be described in a common way. Therefore, we split the algorithm description into the data transfer part, which represents the required I/O behaviour, and a data manipulation part, which represents the supported arithmetic. This approach allows us to describe even algorithms, which require Galois field arithmetic [6]. However, in this paper we focus on data transfer. To demonstrate the use of this common representation we use the FIR filter

y = k

Xa N

i

=1

i 

x; k

i

as an example. Different methods to exploit parallelism can be applied each having different impact on the DSP core. In order to employ several datapaths the algorithm must be distributed, e.g. on 2 (or more) Multiply Accumulator Units (MACs). First, we try to split the summation into 2 halfs each summation carried out by 1 MAC:

y =y 1+y 2 = k

k

k

X2 a N=

i

=1

i 

x; k

i

X N

+

= 2+1

i

a x;: i 

k

i

N=

However, to keep both MACs busy, at least 4 memory accesses are necessary (Fig.2a). To overcome this problem 2 filter outputs can be computed at once:

y = k

Xa N

=1

i 

x ;  y +1 = k

i

k

i

Xa N

=1

i 

x( +1); : k

i

i

Here, coefficient a i can be shared. By inserting a delay register into the architecture 2 MACs can be kept busy with only 2 memory accesses per iteration (Fig.2b). These equations can also be expressed as a loop program combining both:

for k = 0 to K step 2 for i = 1 to N if (i >= 1) y = y ;1 + a  x ; if (i

Suggest Documents