DSP Architectures for Next-Generation Wireless Communications

DSP Architectures for Next-Generation Wireless Communications Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles [email protected]

Chris Nicol Bell Laboratories Australia Lucent Technologies [email protected]

1 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless Trends S u b s crib ers in (000 ) 1 ,6 0 0,0 00

1 ,4 0 0,0 00

W ire line C A G R - 5 % G lo b a l P en etratio n (2 01 0) - 20 %

1 ,2 0 0,0 00

G lob al W irelin e G ob al W ire le ss

Subscribers (000)

1 ,0 0 0,0 00

8 0 0,0 00

6 0 0,0 00

4 0 0,0 00

W irele ss C A G R 21 % G lo b a l P en etra tio n (20 10 ) - 21 % (C e llu lar+P C S + W L A S + O the r)

2 0 0,0 00

G lob a l P op - 7 bill C AG R 1 9 95 -20 10 - 1 .4 % 2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

0

World-wide deployment of mobile communications is exceeding expectations 2 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

1

DSP Evolution and Markets Disk

DSP Market

$270 M Other

Wireless

$2B market, 30% growth rate Modem V.34 V.90 xDSL

$1.01B

Cellular Infrastructure Mobile Handsets Cordless GPS

$727 M Source: Forward Concepts 1996

Consumer & Automotive

M68000 ($200)

10K Power

Power (mw/MIP)

80286 ($200)

1K

80386 ($300)

(mw/MIP)

DSP-1 ($150)

Pentium ($300)

DSP-32C ($250)

100

10

Pentium (MMX) ($700) DSP16A ($15)

DSP1600 ( 1000

+

D

A0


14

Lode Core Architecture


Domain specific instruction set Basic instruction set for general purpose DSP e.g. MAC, min, max, etc. Extra instructions for performance with every new generation e.g. “square distance and accumulate N-1

D=

Σ || x(i) - y(i) ||2 i=0

One 32 bit instruction: a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;

Bus network and instruction set design go together CISC, thus compiler unfriendly 30 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

15

Control & Pipeline for DSP’s RISC: load/store machine memory access with load/store instructions (DLX, MIPS, D10V) Decode

Fetch

Execute

Memory Access

Write Back

Memory access / branch Execution/ address generation

Excellent for complex decision making! DSP: register-memory architecture (TI, Lucent, HX, Lode) Memory Execute Access

Decode

Fetch

Write Back

Execution Memory access

Excellent for number crunching! 31 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Pipeline RISC compared to DSP r0 = *p0; // load data a0 = a0 + r0; // execute

RISC:example Fetch

Decode Fetch

Execute Decode Fetch

Memory Access Execute Decode

Too expensive for DSP Memory Access Execute

Memory Access

DSP: memory intensive applications: Fetch

Decode Fetch

Memory Access Decode Fetch

Execute Memory Access Decode Fetch

Execute Memory Access Decode

Penalty: data dependent branch is expensive

Execute Memory Access

Execute

32

ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

16

Other control features Hardware looping: • Because software branch is expensive • “Zero overhead hardware loops” (for tight FIR loops) hardware supported

Interrupts: hardware with shadow registers for extremely fast context switching. Special instruction cache: • Single instruction “repeat” buffer • Multiple instruction cache: under programmers control! • E.g. Lucent DSP16210:31x 32 instruction cache Predictable worst case execution time! 33 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Low Power DSP’s DSP 1600 Core

C54x 1V DSP

(Lucent - 1609 low cost consumer 16-bit)

(Texas Instruments - ISSCC 1997)

0.35µ 3LM CMOS

0.25µ 3LM CMOS

80 M 16b MAC/s at 3.3V 1.4 mW/MHz at 3.3V

65 M 16b MAC/s at 1.0V 0.21 mW/MHz at 1.0V

30 µW stand-by power

4.0 mW stand-by power

Dual Vt process 34 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

17

BUT: DSP Software Development • Complex DSP architecture not amenable to compiler technology • Algorithms are modeled in high level language (e.g. C++) • Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support HLL algorithmic

hand coded assembler

optimize & debug

prototype

production

code

code

model

Long, frustrating time to market Fragile legacy code

Part II

Still used in handhelds, but change in basestations,


Mobile Wireless Evolution First Generation SERVICE

Mobile Telephone Service: Carphone Analog Cellular Technology

TECHNOLOGY Macrocellular Systems Past

Second Generation Digital Voice + and Messaging/Data Services Fixed Wireless Loop Digital Cellular Technology + IN emergence Microcellular & Picocellular: capacity, quality Enhanced Cordless Technology Now

Third Generation Integrated High Quality Audio and Data. Narrowband and Broadband Multimedia Services + IN integration Broader Bandwidth Efficient Radio Transmission Information Compression Higher Frequency Spectrum Utilization IN + Network Management integration

Fourth Generation TelePresencing Education, training and dynamic information access

Wireless- Wireline and Broadband Transparency Knowledge-Based Network Operations Unified Service Network

Year 2000-2005

NMT TACS Analog AMPS

GSM IS-54/ 136 TDMA IS-95/ cdmaOne PDC DECT

WCDMA UWC-136 TDMA cdma2000

Year 2010?

Global roaming

We are entering the decade of wireless data communications - and World-War 3G 36 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

18

Mobile Data Services • Carriers invest >$500 per subscriber but subscriber voice calls (and therefore revenues) are reducing. • Data currently 3% of wireless traffic - projected to >50% by 2005 • Wireless Internet : Average internet connection 30 mins • Text Messaging: Saturating 2G voice networks 2.5 Generation Mobile Standards [1] GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user. EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.

3G - IMT2000 Proposals 144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary. Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE). UMTS, CDMA-2000 are both CDMA proposals. 37 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Evolution of Mobile Wireless Network Architecture Internet / Advanced Services PSTN

MSC

Mobile Switches

BSC

Base Stations

Packet Mode Servers

Wireless Control Servers

Circuit Mode Servers

High Speed Data, Multimedia, Voice over IP, etc.

(Feature Control, Network Management, Billing, etc.)

(Voice, Low Speed Data, etc.)

Network Servers

Packet Connectivity (ATM / IP)

…

… 2G Network

PSTN

Radio Clients

IP-based 3G Network

Mobile networks are being upgraded in preparation for the delivery of high speed data services. 38 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

19

Mobile Wireless Infrastructure

Macro-cell GSM Basestation

Micro-cell GSM Basestation

(6-12 TRX)

(2 TRX) 39


2G Basestation Baseband Processing • Multiple DSPs used for baseband processing. • RISC Microcontroller for timing, framing, I/O control • Software upgradable over the network • DSPs dominate cost and power consumption Channel Equalization

Tx Rx Tx Rx

Channel De/coding Encryption

RISC Micro AFE DSP DSP RAM DSP DSP DSP Controller AFE DSP DSP RAM DSP DSP DSP

Future trend - integrate baseband processing low cost Pico BTS

I/O I/O

T1/E1 I/O I/O

Tx/Rx baseband processing board for 2-carrier GSM basestation

ASIC

40


20

3G Basestation Baseband Processing Increased DSP performance needed in next-generation basestation • Increased Receiver Algorithm Sensitivity • Antenna Arrays - Smart Antennas • Multi-Standard Basestations using Software Radio Architecture • 3G - constraint length 9, rate 1/2 convolutional coding for voice. • 3G - constraint length 4, Turbo codes for data

Code generator Code generator channelisation code

High Performance DSPs + Custom Logic needed for 3G (Viterbi decoding and Turbo decoding)

Synchronisation cell search slot syn, frame syn. (DSP)

channelisation scambling codecode scambling (ASIC)) code (ASIC))

Sliding correlator

RAKE combiner

despreading (ASIC)

reassemble multipath (DSP, ASIC)

Code tracking delay-lock-loop (ASIC, DSP)

SIR measurement

Power control

fast power control (DSP)

Decoder Deinterleaver (DSP)

Channel estimation

Path search

(DSP)

(ASIC)

Viterbi algorithm Turbo decoding (DSP, ASIC)

Courtesy: Bing Xu: Bell Labs Australia

41


Receiver Algorithms for GSM Basestation • Enhanced Receiver Sensitivity • Larger Cells in Suburban Areas = Reduced network cost • Mobile transmits with less power = Increased battery life Existing Receiver Estimating Wireless Channel

Equalizing Multi-path Effects

Channel Decoding

Speech Decoding

New Iterative Receiver Speech Statistics

1.3dB improvement

Estimating Wireless Channel

Equalizing Multi-path Effects

Challenge - requires 6x DSP MIPS of existing receiver in basestation

Channel Decoding

Speech Decoding

Courtesy: Magnus Sandell: Bell Labs UK

42


21

Smart Antennas • A multiple antenna element system • Combined with a base station architecture and signal processing techniques designed to dynamically select or form the “optimum” beam pattern per user

Omnidirectional Cell Site

Three Sector Cell Site

Intelligent Antenna Cell Site

Increased cost in RF electronics and enhanced DSP requirements.

43


Fixed Multi-Beam Versus Adaptive Beam Adaptive Beam

Fixed Multi-Beam

Mobile 1 Interferer Mobile Mobile 1

Direct Ray

Direct Ray Mobile 2

Reflected Ray

Select from--or use--multiple “fixed” antenna beams to optimize performance.

Mobile 2

Reflected Rays

Interferer

Adaptively “weight” and combine multiple antenna elements to optimize performance. 44


22

Digital Radio Trends - Software Radio Antennas Linear amplification Combining

multi-standard basestation

A/D AMP RF/ Analog Processing

Digital Processing

Network Interface

Network

RF/IF DSPs - higher speed, more powerful

Higher dynamic range Smaller

Filtering Demodulation Rake receiver Channel coding Diversity . . .

Amplifiers Mixers Filters . . .

Modulation Equalization Correlator Encryption 45


Wideband Receiver Architecture C C C H H H 1 2 3

...

fRF

C C C H H H 1 2 3

C H M

freq

RF-IF & Filter

fBB

High Speed A/D

...

C H M

C H 1

freq

freq

CH1 Digital Channeliser

. . .

CH1 Baseband Processing

CHM

C C C H H H 1 2 3

f IF

...

C H M

freq

Increased DSP performance needed for Software Radio

. . .

CHM

C H M

freq


23

Turbo Codes For 3G Wireless (UMTS and CDMA2000) • Voice service: BER requirement 10-3 • Data service: BER requirement 10-5 • Parallel concatenation of convolutional codes is used to give the codes structure so they can be decoded • Pseudorandom interleaving is used to give the codes performance which approaches that for random coding • Resulting encoder structure: Two Recursive Systematic Convolutional(RSC) Codes

Interleaver

Input

Systematic Output

Encoder #1

MUX Encoder #2

Parity Output 47


Turbo Decoding • • • •

Key idea: iterative decoding (up to 10 iterations for 3G) There is one decoder for each elementary encoder. Each decoder estimates the a-posteriori probability (APP) of each data bit. The APP’s are used as a priori information by the other decoder. Deinterleaver APP APP

Decoder #1

systematic data parity data

Interleaver

DeMUX

Decoder #2

hard bit decisions

Interleaver


24

Soft-Output Decoding Algorithms Requirements for Turbo:

Trellis-Based Estimation Algorithms

– Accept Soft-Inputs in the form of a priori probabilities (APP) – Produce APP estimates of the data. – “Soft-Input Soft-Output” Today’s High-performance DSPs are highly MAC-focussed (for filtering in modem applications). Some DSPs provide hardware support for efficient implementation of Viterbi - none support SOVA or log-MAP Iterative channel estimation also uses Soft-Input Soft-Output decoders.

Viterbi Algorithm

MAP Algorithm

SOVA

max-log-MAP

Improved SOVA

log-MAP

Sequence Estimation

Symbol-by-symbol Estimation

SOVA and log-MAP use modified Add-Compare-Select operations - not only select the maximum path metric - but also need to keep the difference. 49 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

The Maximum A Posteriori (MAP) Algorithm Log-Likelihood Ratio:

L(d ) = ln

é p ( y d = 1) ù é Pr (d = 1 y ) ù L (d y ) = ln ê + L (d ) ú = ln ê ëê p ( y d = 0 ) ëê Pr (d = 0 y )ú

Pr[d = 1] Pr[d = 0]

• A Priori value of Pr[d=1],Pr[d=0] • Output of decoder contains additional extrinsic information • The sum of the a priori information and the extrinsic information will be the a priori information for the next-stage of decoding, for both 2nd decoder or 1st decoder in the next iteration L (u k

é Pr [u

= + 1 y ]ù

é ê

{s k ) = ln ê ú = ln ê ë Pr [u k = 0 y ] ê

å p (s ′ , s , y ) ù

′ , s : u k = 1}

å p (s ′ , s , y ) ë {s ′ , s : u k = 0 }

1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3) the state transitions from state s’ at time k-1 to state s at time k, 4) We want to evaluate this LLR for every k

Break the probability computation into:

(

)

(

p (s′, s, y ) = p s′, y j < k ⋅ p (s, y k s′) ⋅ p y j > k s

)

Gamma: γ k (s′, s ) = p(s, y k s′) Alpha: α k −1(s ) = p (s′, y j < k ) Beta: β k = p (y j > k s ) 50


25

Gamma, Alpha and Beta Calculations Gamma: Calculated from known bits up to k, needs to be stored

γ k (s′, s ) = p(s, y k s′) = P(s s ')⋅ p(y k s, s′) = P(uk ) ⋅ p(y k uk ) where P (uk ) is calculated from the a priori information and p(y k uk ) is calculated from the received bits Alpha: Calculated by a forward recursion through the trellis based on Gamma

α k (s ) = γ k (s′, s ) ⋅ α k −1(s′) s′

Beta: Calculated by a backward recursion from the end of the trellis β k −1(s′) = γ k (s′, s ) ⋅ β k (s ) s

Gamma

Alpha

Beta Dummy Beta’s

Window algorithm

51


Log MAP and MAX-log MAP Compute logarithms of alpha, beta and gamma, which means we compute:

(

ln e δ 1 + e δ 2

)

( ) ln (e δ 1 + e δ 2 ) ∝ max (δ 1 , δ 2 )

ln e δ 1 + e δ 2 ∝ max (δ 1 , δ 2 ) + f c ( δ 1 − δ 2

Log-MAP: MAX-Log-MAP:

)

Correction function (impl. table)

-1

10

MaxlogAPP LogAPP -2

10

BER

-3

10

MAX-log MAP suffers approx 0.5dB from log MAP. For log-MAP, small correction table needed (approx 6 non-zero values). Absolute difference used as table look-up. We need the difference!

-4

10

-5

10

-6

10

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

Courtesy: Bing Xu: Bell Labs Australia

52


26

High Performance DSP Requirements • Very high levels of DSP integer performance • Support for complex real-time synchronous applications (latency, predictable throughput, synchronization) • Large memory and I/O bandwidth.

Some DSP Applications

100K

3-D graphics?

• Scalability to meet wide range of cost, power, performance.

10K MOPS

24 ch. modem

1000 • Cost & power efficient solution.

GSM term

100 • Friendly, compiler driven, programming environment.

V.34

10

16 HR GSM

Soft radio MPEGII 1G eth. xcvr encode 3G Wireless set-top box

ADSL DAB 6M rcvr

itio tr a d

PCS ADSL K56 term 500k

1997

1999

nal D

SP

2001

53


Compiler Driven VLIW Instruction format: cond/branch

ex1

ex2

ex3

…..

exn

Data memory Register Array Interconnect ex1 (alu)

ex2 (alu)

ex3 (mpy)

ex4 (ld/st)

exn (ld/st)

Large orthogonal register set, regular interconnect Atomic RISC-like operations => heavily pipelined, high freq. clock 54 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

27

Explicitly Parallel Instruction Computing Execution Clusters Data memory Register Array

Register Array Interconnect ex1 (alu)

ex2 (alu)

Interconnect ex4 (alu)

ex3 (ld/st)

ex6 (ld/st)

ex5 (mpy)

Execution Sets fetch set 1

1

1

0

1

0

1

0

exec. set 55 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Explicitly Parallel Instruction Computing Predication (guarded) exec.

cond

any instruction

- eliminates branches - improves compiler efficiency - eliminates branches - removes pipeline bubbles - fill delayed branch slots with predicated instructions

Instruction modifiers -

modifier instr1 instr2 instr3 instr4

allows shorter instruction length extend register addressing predication execution set identifier looping extended operations 56


28

Texas Instruments ‘C6201 Program Memory (16K x 32) 256

Instruction Dispatch & Decode Register Bank A (16 x 32)

ALU

shift

mpy

Register Bank B (16 x 32)

add

ALU

shift

mpy

add

Data Memory (32K x 16)

8-way VLIW with two execution clusters 256 bit (8x32) instruction fetch with variable length execute set Each 32 bit instruction individually predicated 11 stage pipeline 1600 MIPS, 400 MMACs @ 200 MHz

57


FIR Filter on TI ‘C6x Hand-coded assembly: 32-tap FIR filter loop: || ||[b0] ||[b0] || || || ||

ldw ldw sub b mpy mpyh add add

.d1t1 .d2t2 .s2 .s1 .m1x .m2x .l1 .l2

*a4++,a5 *b4++,b5 b0,1,b0 loop a5,b5,a6 a5,b5,b6 a7,a6,a7 b7,b6,b7

•

Outer Loop: 23 cycles, 180 bytes

– 1 cycle in inner loop

•

All 8 exec units used in inner loop - maximum efficiency

– 2 MACs per cycle

Assembly syntax more difficult to learn. Hard to get full use of all 8 execution units at once. Software pipelining difficult to implement, and requires longer prolog/epilog (larger code size). Courtesy: Gareth Hughes: Bell Labs Australia

58


29

Viterbi on TI ‘C6x • 16-state Viterbi decoder for GSM

3-cycle 2-ACS Inner-Loop LOOP: [b1] ||[b1] ||[!a2] ||[!a2] || || ||

b sub sth add cmpgt cmpgt mpy

.s1 .s2 .d1 .d2 .l1 .l2 .m1x

LOOP b1,1,b1 b12,*+a6[8] b0,b14,b14 a11,a10,a1 b11,b10,b0 1,b5,a4

[a2] ||[!a2] ||[a1] ||[b0] || || ||

sub sth add mpy mpy sub ldh

.s1 .d1 .s2 .m2 .m1 .l2x .d2

a2,1,a2 a12,*a6++ 2,b0,b0 1,b11,b12 1,a10,a12 a7,b5,b10 *++b9,b5

shl ||[a1] mpy || add || sub || add || mpy || ldh || ldh ; end of LOOP

.s2 .m1 .s1 .l1x .l2 .m2 .d2 .d1

from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly – 32 cycles per GSM timeslot (8 butterflies)

x8

–

Cycle

0

.D1

STH new 8

LDH old0

.D2

ADD tr

LDH old1

MPY mj

*MPY b0

MPY a0

MPY a8

*MPY b8

.M1 .M2

b14,2,b14 1,a11,a12 a7,a4,a10 b13,a4,a11 b13,b5,b11 1,b10,b12 *b4++[2],a7 *a5++[2],b13

2

MPY instructions used to move data

4

6

STH new 0 STH new 8 LDH mj

MPY mj

*MPY b0

MPY a0

MPY a8

*MPY b8

CMPGT t0

.L2

CMPGT t8

ADD b8

SUB a8

CMPGT t8

ADD b8

.S1

B JLOOP

ADD a0

SUB k

B JLOOP

ADD a0

.S2

SUB j

SHL tr

*ADD t0,t8

SUB j

SHL tr

1

3

7

9

.D1 .D2

STH new 0 STH new 8

CMPGT t0

5 LDH old0

10

12

LDH mj

SUB b0

LDH old0 LDH old1

MPY mj

*MPY b0

MPY a0

MPY a8

*MPY b8

CMPGT t0

SUB b0

SUB a8

CMPGT t8

ADD b8

SUB k

B JLOOP

ADD a0

*ADD t0,t8

SUB j

SHL tr

13

15

LDH old0

16

STH new 0 STH new 8

18

STH new 0 STH new 8

ADD tr

11

STH new 0 STH new 8

14

STH new 0 STH new 8

LDH old1

.L1

Cycle

SUB b0

8 LDH old0

ADD tr

LDH mj

20 LDH old0

22

LDH old1

LDH mj

MPY mj

*MPY b0

MPY a0

MPY a8

*MPY b8

CMPGT t0

SUB b0

SUB a8

CMPGT t8

ADD b8

SUB a8

SUB k

B JLOOP

ADD a0

SUB k

*ADD t0,t8

SUB j

SHL tr

*ADD t0,t8

19

21

17 LDH old0

STH new 0 STH new 8

26

28

30

LDH sd1

STH m[2]

STH m[3]

SUB m

LDH sd0

STH m[5]

STH m[4]

ADD m0

SUB -m0

SUB old

SUB -m1

SUB m1

ADD tr

B JLOOP

SUB I

MVK j

23

25

27

29

31

LDH old0

STH new 0

STH m[0]

STH m[1]

LDH old1

STH trans

STH m[1]

STH m[6]

LDH old0

ADD old

ADD SP

LDH mj

ADD tr

LDH old1

LDH mj

ADD tr

LDH old1

LDH mj

ADD tr

LDH old1

LDH mj

ADD tr

LDH old1

.M1

MPY a0

MPY mj

*MPY b0

MPY a0

MPY mj

*MPY b0

MPY a0

MPY mj

*MPY b0

MPY a0

MPY mj

*MPY b0

.M2

*MPY b8

MPY a8

*MPY b8

MPY a8

*MPY b8

MPY a8

*MPY b8

.L1

24

STH new 0 STH new 8

ADD tr

MPY a8

CMPGT t0

SUB b0

CMPGT t0

SUB b0

CMPGT t0

SUB b0

CMPGT t0

SUB b0

.L2

SUB a8

CMPGT t8

ADD b8

SUB a8

CMPGT t8

ADD b8

SUB a8

CMPGT t8

ADD b8

SUB a8

CMPGT t8

ADD b8

.S1

SUB k

B JLOOP

ADD a0

SUB k

B JLOOP

ADD a0

SUB k

B JLOOP

ADD a0

SUB k

B JLOOP

ADD a0

.S2

*ADD t0,t8

SUB j

SHL tr

*ADD t0,t8

SUB j

SHL tr

*ADD t0,t8

SUB j

SHL tr

*ADD t0,t8

SUB j

SHL tr

MPY mj SUB new

Utilization of execution units in Viterbi decoder

MVK k B JLOOP

59


Lucent / Motorola Star*Core SC140 Program / Data Memory Program Sequencer Instruction Dispatcher

Data Registers (16)

Address Registers (27)

MAC MAC MAC MAC AAU AAU

ALU

ALU

ALU

ALU

BFU BFU BFU BFU

6-way VLIW with 128 bit (8x16) instruction fetch Prefix instructions for high performance without sacrificing code density Each execution set (parallel instructions + prefix) predicated 5 stage pipeline 1800 MIPS, 1200 MMACs @ 300 MHz

60


30

Viterbi on Star*Core GSM (K=5, 16 states) [ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 [ add2 d0,d4 sub2 d6,d2 sub2 d4,d0 add2 d2,d6 [ max2vit d4,d2 max2vit d0,d6 [ vsl.4w d2:d6:d1:d3,(r2)+n0 vsl.4f d2:d6:d1:d3,(r3)+n0

•

] ] ]

x4

]

Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction: max2vit d4,d2

max2vit d0,d6

SR

•

Hardware support for Viterbi algorithm: – max2vit instruction. – vsl instruction

•

1 cycle per butterfly through software-pipelining

vsl.4w d2:d6:d1:d3,(r2)+n0

D1

decisions

D3

decisions

D2

path metrics

D6

path metrics

Results written to memory

Courtesy: Gareth Hughes: Bell Labs Australia

61


Log-MAP on Star*Core d0: a

Star*Core code for log-MAP Butterfly

d1: b

d6: x

Cycle 1 move.w (r0)+,d0

d0: a+x

d1: b+x d4: b-x

max

Cycle 2 add d0,d6,d0

sub d6,d0,d5

Cycle 3 sub d6,d1,d4

add d1,d6,d1

d5: a-x

Cycle 4 sub d0,d4,d2

sub d1,d5,d3

Cycle 5 max d0,d4

max d1,d5

Cycle 6 abs d2

abs d3

max

d2: d0-d4 d4: max(d0,d4)

move.w (r1)+,d1

d3: d1-d5

d5: max(d1,d5)

n0: |d2|

Cycle 7 move.l d2,n0

d2: n0: |d3| r6

Cycle 8 move.l d3,n0

move.w (r6+n0),d2

Cycle 9 add d4,d2,d4

move.w (r6+n0),d3

Cycle 10 add d5,d3,d5 d4: d4+d2

d5: d5+d3

d3:

r6

Courtesy: Gareth Hughes: Bell Labs Australia

Cycle 11 move.2w d4:d5,(r2)+

This code uses 2 of the 4 ALUs and can be software pipelined to achieve 6 cycles per LOG-MAP Butterfly 62


31

Parallel DSP Architectures Arch.

Parallelism

Compile? Power ?

ûû

S/scalar Dynamic instruction level VLIW

Static instruction level

SIMD

Highly regular, data dependent

MIMD

Task level

û ûû û

MIMD with VLIW / SIMD provides high order parallel execution

The future of high performance DSPs is MIMD 63 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Daytona: A Multiprocessor DSP Architecture I/O Interfaces

I/O External Interfaces Memory

Chip Buffered I/O

Arbitration Synchronization

I/O Subsystem

split transaction bus (128 bits) Programmable Processing Element (PE)

Programmable Processing Element (PE)

Hardware Accelerator

Scalable Architecture - multiple programmable DSPs on a single chip 1 Bus supports different programmable DSPs and Microcontrollers 64 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

32

Split Transaction Bus Separate Address and Data busses - each with pipelined protocol Multiple outstanding transactions - varying size/priority Separate Bus Arbitration ID

Arbiter

Address addr Bus (100MHz)

(round-robin)

ID Data Bus (128 bits 100MHz) Memory Controller

ID

ID

data data

ID

addr

Arbiter data

ID

(round-robin)

addr

PE 65 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Memory Hierarchy in MIMD DSPs Multiple copies of 1 application (e.g. odd/even slot channel equalisation)

• Multiple copies of same software - Shared memory multiprocessing Flat Memory Architecture vs. Hierarchical Memory Architecture Inefficient SRAM

SRAM

DSP

DSP

2 copies of software

DRAM Cache

Cache

DSP

DSP

1 copy of software

Mix of different applications (e.g. equalisation, convolutional decoding) • Heterogenous mix of applications 66 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

33

Shared Memory Multiprocessing 64 Semaphores provided for process synchronization L-1 cache coherency using a snoopy protocol (modified MESI used) Coherent Transaction

Memory Controller

Access to shared data uses coherent transaction. Caches “snoop” the address and query their tag RAMs. DSP A cache hit prevents the memory controller from Access servicing the request. to shared

hit

DSP

DSP

DSP

Snoop (miss)

Snoop (hit)

Snoop (miss)

data

67


Daytona Multiprocessor DSP Chip Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL Host Interface I/O & Memory Controller

64-b 4-MAC SIMD DSP

64-b 4-MAC SIMD DSP

Chip Characteristics

32-b RISC

32-b RISC

Core Area

Cache Memory

Cache Memory

Speed

120mm

2

100 MHz

128-b Split Transaction Bus Test & JTAG Port

Power Cache Memory

Cache Memory

Arbiter

32-b RISC

32-b RISC

Semaphore

64-b 4-MAC SIMD DSP

64-b 4-MAC SIMD DSP

Paper 4.2, ISSCC2000

Tech

4W 0.25um

68


34

Photomicrograph of Daytona Test Chip en t( El em

en t(

PE )

Pr oc es si ng

El em

I/O Subsystem

Pr oc es si ng

Pr oc es si ng

El em

en t(

PE )

8KB Re-configurable Memory

Split Transaction Bus

LRU

BUS INT

HDS

DLL SPARC

PE )

Arbiter Semph

Vector Unit ( RVU)

Paper 4.2, ISSCC2000

69


Acknowledgements The following people contributed to the work in this tutorial: Low Power DSPs for Wireless Wanda Gass: Texas Instruments Mihran Touriguian: Atmel

High Performance DSPs for Wireless Infrastructure Bryan Ackland: Bell Labs US - High Perf. DSP Architecture Gareth Hughes: Bell Labs Australia - LU DSP16210, ‘C6x and Starcore benchmarks Bing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAP Ran-Hong Yan: Bell Labs UK - 3G Wireless Daytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.


35

References [1] P. Lapsley, J. Bier, A. Shoham, E. Lee, “DSP Processor Fundamentals,” IEEE Press, New York, 1997. [2] D. Skillikorn, “A Taxonomy for Computer Architectures,” Computer Magazine, Nov. 1988. [3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H. Suzuki, R. Asahi, “An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor,” IEEE Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503. [4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988. [5] W. Lee et al., “A 1V DSP for Wireless Communications,” Proceedings IEEE International Solid-State Circuits Conference, pp. 92-93, February 1997. [6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983 [7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/ [8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987. [9] TMS320C54x User’s Guide, available from the Texas Instruments Literature Response Center. [10] I. Verbauwhede, M. Touriguian, “A Low Power DSP Engine for Wireless Communications,” Journal of VLSI Signal Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers. [11] I. Verbauwhede, M. Touriguian, “Wireless digital signal processors,” Chapter in Digital Signal Processing for Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999. [12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, “A High Performance DSP Architecture for Next Generation Mobile Phone Systems,” 1998 IEEE DSP Workshop. [13] Lode specifications, available from www.atmel.com [14] M.W. Oliphant, “The Mobile Phone meets the Internet”, IEEE Spectrum pp. 20-28, Aug. 1999. [15] L. C. Godara, “Application of Antenna Arrays to Mobile Communications: Part 1”, Proc. IEEE, Vol 85, No. 7. pp 1031-1060, July 97


References (cont) [16] G. D. Forney, Jr., “Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of Intersymbol Interference”, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972. [17] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes (1)”, Proc. ICC’93, May 1993. [18] J. Hagenauer, P. Hoeher, “A Viterbi Algorithm with Soft-Decision Outputs and its Applications”, Proc. Globecom 89, Nov. 1989, pp.47.1.1-47.1.7 [19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974. [20] J. Turley, H. Hakkaraainen, “TI’s new ‘C6x DSP Screams at 1600 MIPS”, Microprocessor Report, Vol 11, No. 2, pp 14, Feb 1997 [21] “Starcore Launched First Architecture”, Microprocessor Report, V12, No. 14. pp 22, Oct 1998 [22] B. Ackland & P. D’Arcy, “A New Generation of DSP Architectures”, Proc. IEEE CICC99, Paper 25.1.1 [23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, “A 3.2 GOPs Multiprocessor DSP for Communication Applications”, Proc. IEEE ISSCC2000, Paper 4.2


36

DSP Architectures for Next-Generation Wireless Communications

DSP Architectures for Next-Generation Wireless Communications

Suggest Documents

APPLICATION-DRIVEN DESIGN OF DSP ARCHITECTURES AND ...

DSP Architectures - Electrical and Computer Engineering - University

WIRELESS COMMUNICATIONS

DSP Based Radio Receiver Architectures - Springer

Wireless Communications

Compressed Code Execution on DSP Architectures - CiteSeerX

Evolutionary Space Communications Architectures for Human/Robotic ...

Micropayments for Wireless Communications - CiteSeerX

Wireless Communications for Industrial Applications

Optical Wireless Communications for Broadband

Programmable VLIW and SIMD architectures for DSP and ...

WIRELESS COMMUNICATIONS STANDARDS - Wireless Systems ...

Radio-communications architectures - HAL - INRIA

Novel Communication Architectures for Wireless ... - IEEE Xplore

Evaluating Wireless Architectures for GDS Applications

New interconnection architectures for wireless networks - Hal

Protocols and Architectures for Wireless Sensor Netwoks

Novel Communication Architectures for Wireless ... - IEEE Xplore

Novel Architectures and Algorithms for Wireless

Modern Wireless Communication Chapter 7 Wireless Architectures

Standards Development for Wireless Communications for Urban ...

Teaching Digital Communications: a DSP Approach

Guidelines for designing wireless communications for rural ...

Wireless Broadband Multimedia Communications