DSP's for wireless communications ... Wireless communications (baseband
processing) ..... Combined with a base station architecture and signal processing.
DSP Architectures for Next-Generation Wireless Communications Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles
[email protected]
Chris Nicol Bell Laboratories Australia Lucent Technologies
[email protected]
1 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless Trends S u b s crib ers in (000 ) 1 ,6 0 0,0 00
1 ,4 0 0,0 00
W ire line C A G R - 5 % G lo b a l P en etratio n (2 01 0) - 20 %
1 ,2 0 0,0 00
G lob al W irelin e G ob al W ire le ss
Subscribers (000)
1 ,0 0 0,0 00
8 0 0,0 00
6 0 0,0 00
4 0 0,0 00
W irele ss C A G R 21 % G lo b a l P en etra tio n (20 10 ) - 21 % (C e llu lar+P C S + W L A S + O the r)
2 0 0,0 00
G lob a l P op - 7 bill C AG R 1 9 95 -20 10 - 1 .4 % 2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
0
World-wide deployment of mobile communications is exceeding expectations 2 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
1
DSP Evolution and Markets Disk
DSP Market
$270 M Other
Wireless
$2B market, 30% growth rate Modem V.34 V.90 xDSL
$1.01B
Cellular Infrastructure Mobile Handsets Cordless GPS
$727 M Source: Forward Concepts 1996
Consumer & Automotive
M68000 ($200)
10K Power
Power (mw/MIP)
80286 ($200)
1K
80386 ($300)
(mw/MIP)
DSP-1 ($150)
Pentium ($300)
DSP-32C ($250)
100
10
Pentium (MMX) ($700) DSP16A ($15)
DSP1600 ( 1000
+
D
A0
28 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
14
Lode Core Architecture
29 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Domain specific instruction set Basic instruction set for general purpose DSP e.g. MAC, min, max, etc. Extra instructions for performance with every new generation e.g. “square distance and accumulate N-1
D=
Σ || x(i) - y(i) ||2 i=0
One 32 bit instruction: a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;
Bus network and instruction set design go together CISC, thus compiler unfriendly 30 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
15
Control & Pipeline for DSP’s RISC: load/store machine memory access with load/store instructions (DLX, MIPS, D10V) Decode
Fetch
Execute
Memory Access
Write Back
Memory access / branch Execution/ address generation
Excellent for complex decision making! DSP: register-memory architecture (TI, Lucent, HX, Lode) Memory Execute Access
Decode
Fetch
Write Back
Execution Memory access
Excellent for number crunching! 31 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Pipeline RISC compared to DSP r0 = *p0; // load data a0 = a0 + r0; // execute
RISC:example Fetch
Decode Fetch
Execute Decode Fetch
Memory Access Execute Decode
Too expensive for DSP Memory Access Execute
Memory Access
DSP: memory intensive applications: Fetch
Decode Fetch
Memory Access Decode Fetch
Execute Memory Access Decode Fetch
Execute Memory Access Decode
Penalty: data dependent branch is expensive
Execute Memory Access
Execute
32
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
16
Other control features Hardware looping: • Because software branch is expensive • “Zero overhead hardware loops” (for tight FIR loops) hardware supported
Interrupts: hardware with shadow registers for extremely fast context switching. Special instruction cache: • Single instruction “repeat” buffer • Multiple instruction cache: under programmers control! • E.g. Lucent DSP16210:31x 32 instruction cache Predictable worst case execution time! 33 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Low Power DSP’s DSP 1600 Core
C54x 1V DSP
(Lucent - 1609 low cost consumer 16-bit)
(Texas Instruments - ISSCC 1997)
0.35µ 3LM CMOS
0.25µ 3LM CMOS
80 M 16b MAC/s at 3.3V 1.4 mW/MHz at 3.3V
65 M 16b MAC/s at 1.0V 0.21 mW/MHz at 1.0V
30 µW stand-by power
4.0 mW stand-by power
Dual Vt process 34 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
17
BUT: DSP Software Development • Complex DSP architecture not amenable to compiler technology • Algorithms are modeled in high level language (e.g. C++) • Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support HLL algorithmic
hand coded assembler
optimize & debug
prototype
production
code
code
model
Long, frustrating time to market Fragile legacy code
Part II
Still used in handhelds, but change in basestations,
35 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless Evolution First Generation SERVICE
Mobile Telephone Service: Carphone Analog Cellular Technology
TECHNOLOGY Macrocellular Systems Past
Second Generation Digital Voice + and Messaging/Data Services Fixed Wireless Loop Digital Cellular Technology + IN emergence Microcellular & Picocellular: capacity, quality Enhanced Cordless Technology Now
Third Generation Integrated High Quality Audio and Data. Narrowband and Broadband Multimedia Services + IN integration Broader Bandwidth Efficient Radio Transmission Information Compression Higher Frequency Spectrum Utilization IN + Network Management integration
Fourth Generation TelePresencing Education, training and dynamic information access
Wireless- Wireline and Broadband Transparency Knowledge-Based Network Operations Unified Service Network
Year 2000-2005
NMT TACS Analog AMPS
GSM IS-54/ 136 TDMA IS-95/ cdmaOne PDC DECT
WCDMA UWC-136 TDMA cdma2000
Year 2010?
Global roaming
We are entering the decade of wireless data communications - and World-War 3G 36 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
18
Mobile Data Services • Carriers invest >$500 per subscriber but subscriber voice calls (and therefore revenues) are reducing. • Data currently 3% of wireless traffic - projected to >50% by 2005 • Wireless Internet : Average internet connection 30 mins • Text Messaging: Saturating 2G voice networks 2.5 Generation Mobile Standards [1] GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user. EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.
3G - IMT2000 Proposals 144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary. Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE). UMTS, CDMA-2000 are both CDMA proposals. 37 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Evolution of Mobile Wireless Network Architecture Internet / Advanced Services PSTN
MSC
Mobile Switches
BSC
Base Stations
Packet Mode Servers
Wireless Control Servers
Circuit Mode Servers
High Speed Data, Multimedia, Voice over IP, etc.
(Feature Control, Network Management, Billing, etc.)
(Voice, Low Speed Data, etc.)
Network Servers
Packet Connectivity (ATM / IP)
…
… 2G Network
PSTN
Radio Clients
IP-based 3G Network
Mobile networks are being upgraded in preparation for the delivery of high speed data services. 38 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
19
Mobile Wireless Infrastructure
Macro-cell GSM Basestation
Micro-cell GSM Basestation
(6-12 TRX)
(2 TRX) 39
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
2G Basestation Baseband Processing • Multiple DSPs used for baseband processing. • RISC Microcontroller for timing, framing, I/O control • Software upgradable over the network • DSPs dominate cost and power consumption Channel Equalization
Tx Rx Tx Rx
Channel De/coding Encryption
RISC Micro AFE DSP DSP RAM DSP DSP DSP Controller AFE DSP DSP RAM DSP DSP DSP
Future trend - integrate baseband processing low cost Pico BTS
I/O I/O
T1/E1 I/O I/O
Tx/Rx baseband processing board for 2-carrier GSM basestation
ASIC
40
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
20
3G Basestation Baseband Processing Increased DSP performance needed in next-generation basestation • Increased Receiver Algorithm Sensitivity • Antenna Arrays - Smart Antennas • Multi-Standard Basestations using Software Radio Architecture • 3G - constraint length 9, rate 1/2 convolutional coding for voice. • 3G - constraint length 4, Turbo codes for data
Code generator Code generator channelisation code
High Performance DSPs + Custom Logic needed for 3G (Viterbi decoding and Turbo decoding)
Synchronisation cell search slot syn, frame syn. (DSP)
channelisation scambling codecode scambling (ASIC)) code (ASIC))
Sliding correlator
RAKE combiner
despreading (ASIC)
reassemble multipath (DSP, ASIC)
Code tracking delay-lock-loop (ASIC, DSP)
SIR measurement
Power control
fast power control (DSP)
Decoder Deinterleaver (DSP)
Channel estimation
Path search
(DSP)
(ASIC)
Viterbi algorithm Turbo decoding (DSP, ASIC)
Courtesy: Bing Xu: Bell Labs Australia
41
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Receiver Algorithms for GSM Basestation • Enhanced Receiver Sensitivity • Larger Cells in Suburban Areas = Reduced network cost • Mobile transmits with less power = Increased battery life Existing Receiver Estimating Wireless Channel
Equalizing Multi-path Effects
Channel Decoding
Speech Decoding
New Iterative Receiver Speech Statistics
1.3dB improvement
Estimating Wireless Channel
Equalizing Multi-path Effects
Challenge - requires 6x DSP MIPS of existing receiver in basestation
Channel Decoding
Speech Decoding
Courtesy: Magnus Sandell: Bell Labs UK
42
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
21
Smart Antennas • A multiple antenna element system • Combined with a base station architecture and signal processing techniques designed to dynamically select or form the “optimum” beam pattern per user
Omnidirectional Cell Site
Three Sector Cell Site
Intelligent Antenna Cell Site
Increased cost in RF electronics and enhanced DSP requirements.
43
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Fixed Multi-Beam Versus Adaptive Beam Adaptive Beam
Fixed Multi-Beam
Mobile 1 Interferer Mobile Mobile 1
Direct Ray
Direct Ray Mobile 2
Reflected Ray
Select from--or use--multiple “fixed” antenna beams to optimize performance.
Mobile 2
Reflected Rays
Interferer
Adaptively “weight” and combine multiple antenna elements to optimize performance. 44
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
22
Digital Radio Trends - Software Radio Antennas Linear amplification Combining
multi-standard basestation
A/D AMP RF/ Analog Processing
Digital Processing
Network Interface
Network
RF/IF DSPs - higher speed, more powerful
Higher dynamic range Smaller
Filtering Demodulation Rake receiver Channel coding Diversity . . .
Amplifiers Mixers Filters . . .
Modulation Equalization Correlator Encryption 45
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Wideband Receiver Architecture C C C H H H 1 2 3
...
fRF
C C C H H H 1 2 3
C H M
freq
RF-IF & Filter
fBB
High Speed A/D
...
C H M
C H 1
freq
freq
CH1 Digital Channeliser
. . .
CH1 Baseband Processing
CHM
C C C H H H 1 2 3
f IF
...
C H M
freq
Increased DSP performance needed for Software Radio
. . .
CHM
C H M
freq
46 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
23
Turbo Codes For 3G Wireless (UMTS and CDMA2000) • Voice service: BER requirement 10-3 • Data service: BER requirement 10-5 • Parallel concatenation of convolutional codes is used to give the codes structure so they can be decoded • Pseudorandom interleaving is used to give the codes performance which approaches that for random coding • Resulting encoder structure: Two Recursive Systematic Convolutional(RSC) Codes
Interleaver
Input
Systematic Output
Encoder #1
MUX Encoder #2
Parity Output 47
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Turbo Decoding • • • •
Key idea: iterative decoding (up to 10 iterations for 3G) There is one decoder for each elementary encoder. Each decoder estimates the a-posteriori probability (APP) of each data bit. The APP’s are used as a priori information by the other decoder. Deinterleaver APP APP
Decoder #1
systematic data parity data
Interleaver
DeMUX
Decoder #2
hard bit decisions
Interleaver
48 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
24
Soft-Output Decoding Algorithms Requirements for Turbo:
Trellis-Based Estimation Algorithms
– Accept Soft-Inputs in the form of a priori probabilities (APP) – Produce APP estimates of the data. – “Soft-Input Soft-Output” Today’s High-performance DSPs are highly MAC-focussed (for filtering in modem applications). Some DSPs provide hardware support for efficient implementation of Viterbi - none support SOVA or log-MAP Iterative channel estimation also uses Soft-Input Soft-Output decoders.
Viterbi Algorithm
MAP Algorithm
SOVA
max-log-MAP
Improved SOVA
log-MAP
Sequence Estimation
Symbol-by-symbol Estimation
SOVA and log-MAP use modified Add-Compare-Select operations - not only select the maximum path metric - but also need to keep the difference. 49 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
The Maximum A Posteriori (MAP) Algorithm Log-Likelihood Ratio:
L(d ) = ln
é p ( y d = 1) ù é Pr (d = 1 y ) ù L (d y ) = ln ê + L (d ) ú = ln ê ëê p ( y d = 0 ) ëê Pr (d = 0 y )ú
Pr[d = 1] Pr[d = 0]
• A Priori value of Pr[d=1],Pr[d=0] • Output of decoder contains additional extrinsic information • The sum of the a priori information and the extrinsic information will be the a priori information for the next-stage of decoding, for both 2nd decoder or 1st decoder in the next iteration L (u k
é Pr [u
= + 1 y ]ù
é ê
{s k ) = ln ê ú = ln ê ë Pr [u k = 0 y ] ê
å p (s ′ , s , y ) ù
′ , s : u k = 1}
å p (s ′ , s , y ) ë {s ′ , s : u k = 0 }
1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3) the state transitions from state s’ at time k-1 to state s at time k, 4) We want to evaluate this LLR for every k
Break the probability computation into:
(
)
(
p (s′, s, y ) = p s′, y j < k ⋅ p (s, y k s′) ⋅ p y j > k s
)
Gamma: γ k (s′, s ) = p(s, y k s′) Alpha: α k −1(s ) = p (s′, y j < k ) Beta: β k = p (y j > k s ) 50
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
25
Gamma, Alpha and Beta Calculations Gamma: Calculated from known bits up to k, needs to be stored
γ k (s′, s ) = p(s, y k s′) = P(s s ')⋅ p(y k s, s′) = P(uk ) ⋅ p(y k uk ) where P (uk ) is calculated from the a priori information and p(y k uk ) is calculated from the received bits Alpha: Calculated by a forward recursion through the trellis based on Gamma
α k (s ) = γ k (s′, s ) ⋅ α k −1(s′) s′
Beta: Calculated by a backward recursion from the end of the trellis β k −1(s′) = γ k (s′, s ) ⋅ β k (s ) s
Gamma
Alpha
Beta Dummy Beta’s
Window algorithm
51
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Log MAP and MAX-log MAP Compute logarithms of alpha, beta and gamma, which means we compute:
(
ln e δ 1 + e δ 2
)
( ) ln (e δ 1 + e δ 2 ) ∝ max (δ 1 , δ 2 )
ln e δ 1 + e δ 2 ∝ max (δ 1 , δ 2 ) + f c ( δ 1 − δ 2
Log-MAP: MAX-Log-MAP:
)
Correction function (impl. table)
-1
10
MaxlogAPP LogAPP -2
10
BER
-3
10
MAX-log MAP suffers approx 0.5dB from log MAP. For log-MAP, small correction table needed (approx 6 non-zero values). Absolute difference used as table look-up. We need the difference!
-4
10
-5
10
-6
10
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Courtesy: Bing Xu: Bell Labs Australia
52
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
26
High Performance DSP Requirements • Very high levels of DSP integer performance • Support for complex real-time synchronous applications (latency, predictable throughput, synchronization) • Large memory and I/O bandwidth.
Some DSP Applications
100K
3-D graphics?
• Scalability to meet wide range of cost, power, performance.
10K MOPS
24 ch. modem
1000 • Cost & power efficient solution.
GSM term
100 • Friendly, compiler driven, programming environment.
V.34
10
16 HR GSM
Soft radio MPEGII 1G eth. xcvr encode 3G Wireless set-top box
ADSL DAB 6M rcvr
itio tr a d
PCS ADSL K56 term 500k
1997
1999
nal D
SP
2001
53
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Compiler Driven VLIW Instruction format: cond/branch
ex1
ex2
ex3
…..
exn
Data memory Register Array Interconnect ex1 (alu)
ex2 (alu)
ex3 (mpy)
ex4 (ld/st)
exn (ld/st)
Large orthogonal register set, regular interconnect Atomic RISC-like operations => heavily pipelined, high freq. clock 54 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
27
Explicitly Parallel Instruction Computing Execution Clusters Data memory Register Array
Register Array Interconnect ex1 (alu)
ex2 (alu)
Interconnect ex4 (alu)
ex3 (ld/st)
ex6 (ld/st)
ex5 (mpy)
Execution Sets fetch set 1
1
1
0
1
0
1
0
exec. set 55 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Explicitly Parallel Instruction Computing Predication (guarded) exec.
cond
any instruction
- eliminates branches - improves compiler efficiency - eliminates branches - removes pipeline bubbles - fill delayed branch slots with predicated instructions
Instruction modifiers -
modifier instr1 instr2 instr3 instr4
allows shorter instruction length extend register addressing predication execution set identifier looping extended operations 56
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
28
Texas Instruments ‘C6201 Program Memory (16K x 32) 256
Instruction Dispatch & Decode Register Bank A (16 x 32)
ALU
shift
mpy
Register Bank B (16 x 32)
add
ALU
shift
mpy
add
Data Memory (32K x 16)
8-way VLIW with two execution clusters 256 bit (8x32) instruction fetch with variable length execute set Each 32 bit instruction individually predicated 11 stage pipeline 1600 MIPS, 400 MMACs @ 200 MHz
57
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
FIR Filter on TI ‘C6x Hand-coded assembly: 32-tap FIR filter loop: || ||[b0] ||[b0] || || || ||
ldw ldw sub b mpy mpyh add add
.d1t1 .d2t2 .s2 .s1 .m1x .m2x .l1 .l2
*a4++,a5 *b4++,b5 b0,1,b0 loop a5,b5,a6 a5,b5,b6 a7,a6,a7 b7,b6,b7
•
Outer Loop: 23 cycles, 180 bytes
– 1 cycle in inner loop
•
All 8 exec units used in inner loop - maximum efficiency
– 2 MACs per cycle
Assembly syntax more difficult to learn. Hard to get full use of all 8 execution units at once. Software pipelining difficult to implement, and requires longer prolog/epilog (larger code size). Courtesy: Gareth Hughes: Bell Labs Australia
58
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
29
Viterbi on TI ‘C6x • 16-state Viterbi decoder for GSM
3-cycle 2-ACS Inner-Loop LOOP: [b1] ||[b1] ||[!a2] ||[!a2] || || ||
b sub sth add cmpgt cmpgt mpy
.s1 .s2 .d1 .d2 .l1 .l2 .m1x
LOOP b1,1,b1 b12,*+a6[8] b0,b14,b14 a11,a10,a1 b11,b10,b0 1,b5,a4
[a2] ||[!a2] ||[a1] ||[b0] || || ||
sub sth add mpy mpy sub ldh
.s1 .d1 .s2 .m2 .m1 .l2x .d2
a2,1,a2 a12,*a6++ 2,b0,b0 1,b11,b12 1,a10,a12 a7,b5,b10 *++b9,b5
shl ||[a1] mpy || add || sub || add || mpy || ldh || ldh ; end of LOOP
.s2 .m1 .s1 .l1x .l2 .m2 .d2 .d1
from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
– 3 cycles per butterfly – 32 cycles per GSM timeslot (8 butterflies)
x8
–
Cycle
0
.D1
STH new 8
LDH old0
.D2
ADD tr
LDH old1
MPY mj
*MPY b0
MPY a0
MPY a8
*MPY b8
.M1 .M2
b14,2,b14 1,a11,a12 a7,a4,a10 b13,a4,a11 b13,b5,b11 1,b10,b12 *b4++[2],a7 *a5++[2],b13
2
MPY instructions used to move data
4
6
STH new 0 STH new 8 LDH mj
MPY mj
*MPY b0
MPY a0
MPY a8
*MPY b8
CMPGT t0
.L2
CMPGT t8
ADD b8
SUB a8
CMPGT t8
ADD b8
.S1
B JLOOP
ADD a0
SUB k
B JLOOP
ADD a0
.S2
SUB j
SHL tr
*ADD t0,t8
SUB j
SHL tr
1
3
7
9
.D1 .D2
STH new 0 STH new 8
CMPGT t0
5 LDH old0
10
12
LDH mj
SUB b0
LDH old0 LDH old1
MPY mj
*MPY b0
MPY a0
MPY a8
*MPY b8
CMPGT t0
SUB b0
SUB a8
CMPGT t8
ADD b8
SUB k
B JLOOP
ADD a0
*ADD t0,t8
SUB j
SHL tr
13
15
LDH old0
16
STH new 0 STH new 8
18
STH new 0 STH new 8
ADD tr
11
STH new 0 STH new 8
14
STH new 0 STH new 8
LDH old1
.L1
Cycle
SUB b0
8 LDH old0
ADD tr
LDH mj
20 LDH old0
22
LDH old1
LDH mj
MPY mj
*MPY b0
MPY a0
MPY a8
*MPY b8
CMPGT t0
SUB b0
SUB a8
CMPGT t8
ADD b8
SUB a8
SUB k
B JLOOP
ADD a0
SUB k
*ADD t0,t8
SUB j
SHL tr
*ADD t0,t8
19
21
17 LDH old0
STH new 0 STH new 8
26
28
30
LDH sd1
STH m[2]
STH m[3]
SUB m
LDH sd0
STH m[5]
STH m[4]
ADD m0
SUB -m0
SUB old
SUB -m1
SUB m1
ADD tr
B JLOOP
SUB I
MVK j
23
25
27
29
31
LDH old0
STH new 0
STH m[0]
STH m[1]
LDH old1
STH trans
STH m[1]
STH m[6]
LDH old0
ADD old
ADD SP
LDH mj
ADD tr
LDH old1
LDH mj
ADD tr
LDH old1
LDH mj
ADD tr
LDH old1
LDH mj
ADD tr
LDH old1
.M1
MPY a0
MPY mj
*MPY b0
MPY a0
MPY mj
*MPY b0
MPY a0
MPY mj
*MPY b0
MPY a0
MPY mj
*MPY b0
.M2
*MPY b8
MPY a8
*MPY b8
MPY a8
*MPY b8
MPY a8
*MPY b8
.L1
24
STH new 0 STH new 8
ADD tr
MPY a8
CMPGT t0
SUB b0
CMPGT t0
SUB b0
CMPGT t0
SUB b0
CMPGT t0
SUB b0
.L2
SUB a8
CMPGT t8
ADD b8
SUB a8
CMPGT t8
ADD b8
SUB a8
CMPGT t8
ADD b8
SUB a8
CMPGT t8
ADD b8
.S1
SUB k
B JLOOP
ADD a0
SUB k
B JLOOP
ADD a0
SUB k
B JLOOP
ADD a0
SUB k
B JLOOP
ADD a0
.S2
*ADD t0,t8
SUB j
SHL tr
*ADD t0,t8
SUB j
SHL tr
*ADD t0,t8
SUB j
SHL tr
*ADD t0,t8
SUB j
SHL tr
MPY mj SUB new
Utilization of execution units in Viterbi decoder
MVK k B JLOOP
59
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Lucent / Motorola Star*Core SC140 Program / Data Memory Program Sequencer Instruction Dispatcher
Data Registers (16)
Address Registers (27)
MAC MAC MAC MAC AAU AAU
ALU
ALU
ALU
ALU
BFU BFU BFU BFU
6-way VLIW with 128 bit (8x16) instruction fetch Prefix instructions for high performance without sacrificing code density Each execution set (parallel instructions + prefix) predicated 5 stage pipeline 1800 MIPS, 1200 MMACs @ 300 MHz
60
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
30
Viterbi on Star*Core GSM (K=5, 16 states) [ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 [ add2 d0,d4 sub2 d6,d2 sub2 d4,d0 add2 d2,d6 [ max2vit d4,d2 max2vit d0,d6 [ vsl.4w d2:d6:d1:d3,(r2)+n0 vsl.4f d2:d6:d1:d3,(r3)+n0
•
] ] ]
x4
]
Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction: max2vit d4,d2
max2vit d0,d6
SR
•
Hardware support for Viterbi algorithm: – max2vit instruction. – vsl instruction
•
1 cycle per butterfly through software-pipelining
vsl.4w d2:d6:d1:d3,(r2)+n0
D1
decisions
D3
decisions
D2
path metrics
D6
path metrics
Results written to memory
Courtesy: Gareth Hughes: Bell Labs Australia
61
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Log-MAP on Star*Core d0: a
Star*Core code for log-MAP Butterfly
d1: b
d6: x
Cycle 1 move.w (r0)+,d0
d0: a+x
d1: b+x d4: b-x
max
Cycle 2 add d0,d6,d0
sub d6,d0,d5
Cycle 3 sub d6,d1,d4
add d1,d6,d1
d5: a-x
Cycle 4 sub d0,d4,d2
sub d1,d5,d3
Cycle 5 max d0,d4
max d1,d5
Cycle 6 abs d2
abs d3
max
d2: d0-d4 d4: max(d0,d4)
move.w (r1)+,d1
d3: d1-d5
d5: max(d1,d5)
n0: |d2|
Cycle 7 move.l d2,n0
d2: n0: |d3| r6
Cycle 8 move.l d3,n0
move.w (r6+n0),d2
Cycle 9 add d4,d2,d4
move.w (r6+n0),d3
Cycle 10 add d5,d3,d5 d4: d4+d2
d5: d5+d3
d3:
r6
Courtesy: Gareth Hughes: Bell Labs Australia
Cycle 11 move.2w d4:d5,(r2)+
This code uses 2 of the 4 ALUs and can be software pipelined to achieve 6 cycles per LOG-MAP Butterfly 62
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
31
Parallel DSP Architectures Arch.
Parallelism
Compile? Power ?
ûû
S/scalar Dynamic instruction level VLIW
Static instruction level
SIMD
Highly regular, data dependent
MIMD
Task level
û ûû û
MIMD with VLIW / SIMD provides high order parallel execution
The future of high performance DSPs is MIMD 63 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Daytona: A Multiprocessor DSP Architecture I/O Interfaces
I/O External Interfaces Memory
Chip Buffered I/O
Arbitration Synchronization
I/O Subsystem
split transaction bus (128 bits) Programmable Processing Element (PE)
Programmable Processing Element (PE)
Hardware Accelerator
Scalable Architecture - multiple programmable DSPs on a single chip 1 Bus supports different programmable DSPs and Microcontrollers 64 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
32
Split Transaction Bus Separate Address and Data busses - each with pipelined protocol Multiple outstanding transactions - varying size/priority Separate Bus Arbitration ID
Arbiter
Address addr Bus (100MHz)
(round-robin)
ID Data Bus (128 bits 100MHz) Memory Controller
ID
ID
data data
ID
addr
Arbiter data
ID
(round-robin)
addr
PE 65 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Memory Hierarchy in MIMD DSPs Multiple copies of 1 application (e.g. odd/even slot channel equalisation)
• Multiple copies of same software - Shared memory multiprocessing Flat Memory Architecture vs. Hierarchical Memory Architecture Inefficient SRAM
SRAM
DSP
DSP
2 copies of software
DRAM Cache
Cache
DSP
DSP
1 copy of software
Mix of different applications (e.g. equalisation, convolutional decoding) • Heterogenous mix of applications 66 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
33
Shared Memory Multiprocessing 64 Semaphores provided for process synchronization L-1 cache coherency using a snoopy protocol (modified MESI used) Coherent Transaction
Memory Controller
Access to shared data uses coherent transaction. Caches “snoop” the address and query their tag RAMs. DSP A cache hit prevents the memory controller from Access servicing the request. to shared
hit
DSP
DSP
DSP
Snoop (miss)
Snoop (hit)
Snoop (miss)
data
67
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Daytona Multiprocessor DSP Chip Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL Host Interface I/O & Memory Controller
64-b 4-MAC SIMD DSP
64-b 4-MAC SIMD DSP
Chip Characteristics
32-b RISC
32-b RISC
Core Area
Cache Memory
Cache Memory
Speed
120mm
2
100 MHz
128-b Split Transaction Bus Test & JTAG Port
Power Cache Memory
Cache Memory
Arbiter
32-b RISC
32-b RISC
Semaphore
64-b 4-MAC SIMD DSP
64-b 4-MAC SIMD DSP
Paper 4.2, ISSCC2000
Tech
4W 0.25um
68
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
34
Photomicrograph of Daytona Test Chip en t( El em
en t(
PE )
Pr oc es si ng
El em
I/O Subsystem
Pr oc es si ng
Pr oc es si ng
El em
en t(
PE )
8KB Re-configurable Memory
Split Transaction Bus
LRU
BUS INT
HDS
DLL SPARC
PE )
Arbiter Semph
Vector Unit ( RVU)
Paper 4.2, ISSCC2000
69
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Acknowledgements The following people contributed to the work in this tutorial: Low Power DSPs for Wireless Wanda Gass: Texas Instruments Mihran Touriguian: Atmel
High Performance DSPs for Wireless Infrastructure Bryan Ackland: Bell Labs US - High Perf. DSP Architecture Gareth Hughes: Bell Labs Australia - LU DSP16210, ‘C6x and Starcore benchmarks Bing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAP Ran-Hong Yan: Bell Labs UK - 3G Wireless Daytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.
70 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
35
References [1] P. Lapsley, J. Bier, A. Shoham, E. Lee, “DSP Processor Fundamentals,” IEEE Press, New York, 1997. [2] D. Skillikorn, “A Taxonomy for Computer Architectures,” Computer Magazine, Nov. 1988. [3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H. Suzuki, R. Asahi, “An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor,” IEEE Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503. [4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988. [5] W. Lee et al., “A 1V DSP for Wireless Communications,” Proceedings IEEE International Solid-State Circuits Conference, pp. 92-93, February 1997. [6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983 [7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/ [8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987. [9] TMS320C54x User’s Guide, available from the Texas Instruments Literature Response Center. [10] I. Verbauwhede, M. Touriguian, “A Low Power DSP Engine for Wireless Communications,” Journal of VLSI Signal Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers. [11] I. Verbauwhede, M. Touriguian, “Wireless digital signal processors,” Chapter in Digital Signal Processing for Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999. [12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, “A High Performance DSP Architecture for Next Generation Mobile Phone Systems,” 1998 IEEE DSP Workshop. [13] Lode specifications, available from www.atmel.com [14] M.W. Oliphant, “The Mobile Phone meets the Internet”, IEEE Spectrum pp. 20-28, Aug. 1999. [15] L. C. Godara, “Application of Antenna Arrays to Mobile Communications: Part 1”, Proc. IEEE, Vol 85, No. 7. pp 1031-1060, July 97
71 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
References (cont) [16] G. D. Forney, Jr., “Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of Intersymbol Interference”, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972. [17] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes (1)”, Proc. ICC’93, May 1993. [18] J. Hagenauer, P. Hoeher, “A Viterbi Algorithm with Soft-Decision Outputs and its Applications”, Proc. Globecom 89, Nov. 1989, pp.47.1.1-47.1.7 [19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974. [20] J. Turley, H. Hakkaraainen, “TI’s new ‘C6x DSP Screams at 1600 MIPS”, Microprocessor Report, Vol 11, No. 2, pp 14, Feb 1997 [21] “Starcore Launched First Architecture”, Microprocessor Report, V12, No. 14. pp 22, Oct 1998 [22] B. Ackland & P. D’Arcy, “A New Generation of DSP Architectures”, Proc. IEEE CICC99, Paper 25.1.1 [23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, “A 3.2 GOPs Multiprocessor DSP for Communication Applications”, Proc. IEEE ISSCC2000, Paper 4.2
72 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
36