fying its areand power overhead with a VHDL implementation . We find that the .... JPEG image de-compression resulting in a 800-by-600 pixel image. 136 million ...... http://www.ti.com/sc/docs/products/dsp/c6000/c64xmptb.pdf. [4] J. Fridman ...
HardwareSupport to ReduceOverhead in Fine-Grain M
ediaCodes
DeependraTalla,LizyKJohn, . andDougBurger TechnicalReport L – aboratory forComputerArchitecture TheUniversity oTexas f atAustin Abstract The - growingimportanceom f ediaandmedia-likecodesha csausedgeneral-purposeprocessorstoincorporate SIMD-likeextensions,such aM s MX, SSE,andAlt iVec.While thesemediaextensionsdoimproveperformance,significantparallelisminthesecodesremainsu nexploited.Inthispaper,weproposeandevaluatea programmableloopengine(PLEforshort)thatexecutesmedia codesa ndkernelsefficientlybymovingmostof theoverheadassociatedwiththemediaprogramsintohardwar e.Westudyarangeom f ediacodestodetermine whichfeaturesandcharacteristicsotfheseprogramsare sufficientlyfrequenttomeritimplementationinhardware.Thecommonclassesoof perationsthatwefindare multiplenestedloopcontrol , addressgeneration,data transformations, and streamingmemoryoperations We . quantifythefractionofinstructionsthatcanbe liminatedforourbenchmarks,showingthatitisover65%onave rage.Weevaluateadesign ofonePLE,showing howitcanbientegratedcleanlywith general-purpose a pi peline,defininganinstructionsetinterface,andquanti fyingitsareaandpoweroverheadwithaVHDLimplementation We . findthatthePLErequiresamere10%of thearearequiredbytheMMXandSSEextensions. Further more,thePLEimprovesperformanceosafetofmediakernels,over thatof 4a-way SIMD processor, by more t han 7Xon average,and accelerates saetoffivemedia applicationsbyana verageo54%. f Forallkernels and3of5applications,a4-wayprocessorwiththePLEoutperformsan8-waySIMDprocessorwithoutthePLE.Foral the l kernelsa nd2ofthe5applications,a4-way processorenhancedwith the PLEoutperformseven 16-way a processor.
1
Introduction The growingimportanceoafudioa ndvideoapplicationshasresulte
majorinstructionsetarchitectures. Sincegeneral-purpose copiousfine-grainparallelism,mediaextensionspermita general-purposepipeline. Exampleom f ediaextensionsare MAX,Compaq’sMVI,MIPS’sMDMX,andM otorola’sAltiVec
dinmedia-specificsupportinmost
processorsarenotwellsuitedtoexecutingcodeswith higherrateoef xecutionwithinamostlyunmodified Intel’sMMX,SSE1,andSSE2,Sun’sVIS,HP’s [1][2].
data)extensionshavebeenimplementednotonlyincommercial
general-purposeprocessors,butalsoinDSP
processorssuchastheTMS320C64processorfromTexasInstrume AnalogDevices[4].
SuchSIMD(singleinstructionmultiple
nts[3]andtheTigerSharcprocessorfrom
Formanycodes,however,theoverheadassociatedwiththeloo
psthemselveslimitsthepar-
allelismthatcanbeachievedthroughSIMD-likemediaext
ensions. Inthispaper,wefocusonusinghardware
supporttoreducethetimespentonexecutingtheinstruction
tshatsupporttheSIMDinstructions,ratherthanfo-
cusing oanccelerating the computational,fine-grain parallel
instructionsthemselves.
Wefirstanalyze range a omedia f workloadstodetermine tions. Wefindthatmorethan75%ofinstructionsthatare
thea mountofserialoverheadintheseapplicaexecutedinthemediainstructionstreamarenotthe
computationinstructions,butareinstructionsthataremana
gingdataandcontrolforthecomputation. Amdahl's
ThisresearchwassupportedinpartbyaStateoTexas f Adv supportedbytheNationalScienceFoundationundergrantsEIATivoli,MotorolaandIBMCorporations.D.Burgerisuppo REER Award,anIBMUniversity partnership Award, and Slo a
ancedTechnology programgrant.L.K.Johniaslsopartially 9807112andECS-0113105andbyDell,Intel,Microsoft, rtedbyagrantfromtheIntelResearchCouncil,anNSFC anFoundationFellowship.
A-
lawdictatesthattheseoverheadinstructionswilllimitt
hefine-grainparallelism thatmaybextractedfrom thes
mediacodes.WeshowanexampleofthisphenomenoninTable1,which
showsexecutiontimeofaone-
dimensionaldiscretecosinetransform(1-DDCT)kernel
thatisusedintheJPEGimagestandardandtheMPEG
videostandard.The1-DDCTisessentially multiply a o
two f 8x8matrices.In non-MMX a implementation,ifall
ofthe compute unitswere fully utilized,themultiplication tionothe f
would require 512cycles. With MMX,the full utiliza-
functionalunitswouldcompletetheDCTin128cycles.Wer
anthiscodeonaPentiumIIIprocessor
bothwithandwithoutprefetching,andwithandwithoutMMX. provideperformanceboosts,thelevelofparallelism
WeseethatwhileprefetchingandMMXboth thatisexploitedifsarfrom ideal.WithoutMMX,paotenti
five-fold speedup remainseven after prefetching, and with MMX
more , than 13-fold improvementin performance
ispossible,ifthe computationswere tboexecuted athe t
peak computation rate.
Sincewaerestudyingmediaworkloadsrunningongeneral-purposepr portshould supportefficientexecutionobafroad rangeome f
ocessors,anyaddedhardwaresup-
dialoops. Theworkload a nalysiswperesentinthis
paperbreaksoverheadinstructionsandnecessarysupportingmecha
nismsdownintofourmajorclasses,which
coverthevastmajorityooverhead f instructionscontainedin tion,datatransformations,
thesemedialoops:
andstreamingmemoryaccesses
loopprocessing,addressgenera-
Loop . processinginvolvesmaintainingtheloopindi-
cesandissuing thebranchesthatcontrol theloops.Addressgene
rationconsistsocomputing f a ddressesformulti-
dimensionalarrayswithvariousconstantstridesforeac
hdimensionalindex. Datatransformationsreorderdata
accessedfrommemoryforfastconsumptionbytheSIMD
instructions. Themajortransformationsthatwefound
include packing,unpacking,andmulticastingooperands. f WeevaluateahardwaredesigncalledaProgrammableLoop
Engine(PLE)intendedtoreducethetime
spentexecutinggeneralmedialoops.Itishardware a unitth
attakessingle a 140-bytecommand.Itcanthenexe-
cuteuptofivenestedloops,accessinguptofourstreamsodf
ata(threeinputandoneoutput)withdifferent
strides,transformingthedatainwaysrequiredbythecom
putationalgorithmandwritingtheresultstreamback.
ThePLEcontainshardwarelooping,hardwareaddressgener
ationandaddresstransformation,andfetchmecha-
nismstsotream datainandout.ThePLEcanbceleanlyint
egratedwith superscalar a pipeline,requiringminimal
controltoinitiateandterminatehardware-supportedloopexe
cution.Thesinglemacro-instructionneedstobe
fetched and decodedonlyonce pernestedloopstructure,resul
ting isubstantial an reduction ofetch/decode f activ-
ity and corresponding powersavings.
Table 11-D . DCTwithoutandwith MMXon Pentium a III
processor
Pentium III Maximumcompileroptimizations Perfectmemorysystem/prefetching Fullcomputeutilization
e
Cycles 3500 2737 ~512
IPC 1.47 1.88 -
2
Pentium IIIwith MMX Cycles 2375 1578 ~128
IPC 1.04 1.56 -
al
Insection2ofthispaper,wea nalyze suite a omedia f appli
cationsandkernelstoquantifytheloopover-
headrequiredtosupportmediacomputation.Wedecomposethatove
rheadinto
discreteclasses,withthegoalof
designingasingle,simplehardwareenginethatcanprovidebroad
coverageothe f applications'medialoops. In
section3we , describeourPLEdesign,andshowhowitisprogr
ammedbyasinglelargeinstruction. Wealso
describehowthatinstructioncanbiencorporatedintoca
onventionalout-of-orderprocessorpipeline. InS ection
4,wemeasuretheperformanceimprovementsthatthePLEprovi
desoverpa ipelinethatalreadyusesSIMDISA
extensions. Fora4-waybaseline,weshowspeedupso1f 3x,11x,
8xand2.3xonthefourkernelswevaluate,
andperformanceimprovementso3.5x, f 8%,94%,61%,and1%onthe
fiveapplicationswm e easure.Similaror
higherspeedupsareobtainedwith8-wayand16-waybaselinec
onfigurations.InS ection5wequantifythearea
andpoweroverheadotfheproposedPLEbyanalyzingitsVHDL,
aswellastovalidatethetimingassumptions
thatweusedinoursimulationenvironment. InS ection6w ,
ediscusssomeothe f copiousrelatedwork,andwe
conclude the paperin S ection 7.
2.
AnalyzingLoopOverheadsin MediaCode Welisttheninebenchmarksthatweusedtostudythecharact
eristicsom f ediacodeinTable2We . in-
cludethedynamicinstructioncountstoshowthatthecounts
aresufficientlyhightoa mortizecold-starteffects
thesimulationenvironment.Oursuiteincludesseveralcommonmedi decrypt)andfourcommonkernels(
cfa, dct, motest,and
a pplications( scale).The
andvideoprocessingstandardssuchaJsPEG,MPEG,H.263,e
g711,aud,jpeg,ijpeg
kernelsusedaremajorcomponentsinimage
tc.Severalofthesebenchmarksarealsopartof
mediabenchmark suitessuch aM s ediaBench [10].For those b
enchmarksthatare notincludeditnhe M ediaBench
suite,we havemade the source code available on-line [9]. Table.2.Description othe f multimediabenchmarks Benchmark
Instruction count
Description
cfa
Colorfilterarrayinterpolationofa2millionpixel filter(16-bit data)
imagewitha5x5
dct
2-Ddiscrete cosine transform of million 2a pixel image (
16-bitdata)
160 million
motest
Motionestimation routine oframe na o2m f illionpixels
(8-bit data)
136 million
scale
Linearscalingoan fimage om 2f illionpixels(8-bit dat
g711
G.711speechcodingstandard(A-lawtou-lawandvice-versa) sionson million 2 audio samples(8-bitdata)
conver-
aud
Audioeffectson2millionaudiosamples(echo,signalmi filtering) (16-bit data)
xingand
jpeg
JPEGimage compressiono800-by-600 na pixel image
208 million
ijpeg
JPEGimage de-compressionresulting i800-by-600 an pixel image
136 million
decrypt
IDEAdecryptionon192,000 bytesodata f
125 million
a)
3
349 million
3million 63 million 283 million
in and ,
Toprovidehardwaresupportthatcanaccelerateawidera
ngeom f ediacodeswithasimpledesign,we
mustfirstunderstandwhatiscommonacrossthesemediacodes.
Ananalysisoof urbenchmarksuiteledtothe
followingfourcategorieso“computational f support”operations:(
1) loopprocessing,(2)addressgeneration,(3) show anexampleothe f relativefrequenciesothese f . To
datatransformation, and(4)streamingmemoryaccesses categoriesinamedialoop,weshowtheassemblycodeoafone-
dimensionaldiscretecosinetransform(1-D
DCT)routineinF ig.1.ThisassemblycodeifsorPentium
IIIprocessorbasedontheP6microarchitecture[17],
andwascompiledusingmaximumcompileroptimizations(inclu
dingloopunrolling)andintrinsicsbyIntel
C/C++ compilerversion 4.5. Intheassemblycode,wehavemarkedthe
truecomputation
performthecomputationalessenceotfhe1-DDCT,whichare
themultiplyandtheaccumulateoperations.The
Pentium III Assembly instruction
instructionsinboldface.Theseinstructions
M – MXcode
Comment
Category oOverhead f
lea ebx,DWORDPTR [ebp+128] ;load/address overhead 2 mov DWORDPTR [esp+28],ebx ;load/address overhead 2 $B1$2: xor eax,eax ;address overhead 2 mov edx,ecx ;address overhead 2 lea edi, DWORDPTR [ecx+16] ;load/address overhead mov DWORDPTR [esp+24],ecx ;load/address overhead $B1$3: movq mm1,MMWORDPTR [ebp] ;loadoverhead 4 pxor mm0,mm0 ;initializationoverhead 3 pmaddwdmm1, MMWORDPTR[eax+esi] ;TrueD ataParall el Computation movq mm2,MMWORDPTR [ebp+8] ;loadoverhead 4 pmaddwdmm2, MMWORDPTR[eax+esi+8] ;TrueD ataPara llel Computation add eax, 16 ;address overhead 2 paddw mm1,mm0 ;TrueD ataparallel computation No paddw mm2,mm1 ;TrueD ataParallel Computation No movq mm0,mm2 ;load relatedoverhead 4 psrlq mm2, 32 ;SIMDreductionoverhead 3 movd ecx,mm0 ;SIMDloadoverhead 4 movd ebx,mm2 ;SIMDloadoverhead 4 add ecx,ebx ;SIMDconv. Overhead 3 mov WORDPTR [edx],cx ;storeoverhead 4 add edx,2 ;loopoverhead 1 cmp edi,edx ;branch relatedoverhead 1 jg $B1$3 ;loopbranchoverhead 1 $B1$4: mov ecx, DWORDPTR [esp+24] ;load/address overhead add ebp,16 ;loop/address overhead 1/2 add ecx, 16 ;address overhead 2 mov eax, DWORDPTR [esp+28] ;load/address overhead cmp eax,ebp ;branch relatedoverhead 1 jg $B1$2 ;loopbranchoverhead 1
Fig.1.Optimized assembly code forthe 1D
2 2
Not Overhead Not Overhead Overhead t Overhead t
2
2
-DCTroutine (essentially a8nx8matrixmultiply).
The major component of discrete cosine transform is an 8x8 ma tions necessary fortransposing the secondmatrix, whichwoul
trix multiply. In this example,we have not included the instru hdave increased the supporting overhead instructions further.
4
c-
restoftheinstructionsintheloopare
overhead/supportinginstructions such , atsheinstructionsnecessarytoper-
formaddresscomputationsotroaccomplishmultiplelevels
ofloopingasrequiredbythea lgorithm.Wehave
brokenthemdownintothefourclassesoof verheadads efined structionsmayfallintomorethan
inthepreviousparagraph.Wenotethatsomein-
oneoour f definedcategories;sinceanoperationoregist na
ermaybuesedfor
bothloopmanagementand addresscalculation. Itisclearthat,inDCTatleast,themajorityof
theinstructionsintheloopareoverheadinstructions.The
highfractionotfheseinstructionsintheloopips artlydu
etotheprogrammingconventionsogf eneral-purpose
processors,abstractionsandcontrolflowstructuresusedin
programming,andmismatchbetweenhowdatais
used icnomputationsversusthe sequence inwhichdataist
oredinmemory.
Ifthe SIMDunitsare widened,someothe f loopinstructions
can perform moreoperationsperinstruction.
In Fig.1,however,only theinstructionsibnold can benefit
from widerarithmeticunits.However,asone sees,the
vastmajorityoinstructions f itnheinstructionstream a
rejustsupportingthecore computationinstructions.Hence
itisessentialtoimprovetheirperformanceitfheoverall
performanceitsogetbetter.ItisclearfromAmdahl
lawthat,giventhefractionooverhead f intheloop,widerSI
’s
MDunitswillprovideonlyincrementalperformance
benefits. .
Weshowedthe1-DDCTexampletoillustratethebreakdownof
computationversusoverhead.InFig.2,
weshowsaimilarbreakdownforsixothe f ninebenchmarkswe
studied. Wedidnotinclude
cryptintheseresultsexperimentbecausethesourcecodeforthes
tehreebenchmarksincludesinitializationrou-
tinesandfileI/O.WeranthesixbenchmarksonaPenti
umIIIprocessorwithMMXsupport.Infiveofthe
benchmarks( g711wastheexception),allcorecomputationwasexecutedusingS decomposedtheexecutionintothefourdefinedclasses,exce
IMDinstructions.InFig.2we
ptthatwewereunabletoseparateaddressandloop
arithmetic.Forthesecodes,theoverheada ndsupportinginstruc
tionsaccountfor75%-85%ofthedynamicin-
struction stream;the true computationinstructionsare sm a
Streamingmemory Addresstransformation
all percentage.
Loopbranches Truecomputation
Address/looparith.
100% 80% 60% 40% 20% 0% cfa
Fig.2.
dct
jpeg, ijpeg,and
scale
motest
aud
Breakdown odynamic f instructionsinto variousclasses
5
g711
de-
2.1 Sourcesof overheadinstructions Inthissubsectionwdeiscussthereasonsthatsuch high a fr
actionothese f medialoopsareoverhead.Es-
sentially,mediaapplicationsuse nestedloopsand use multipl
setrides.Currentgeneral-purpose processors(GPPs)
havelittlesupporttohandlemultipleloopsotrheabundanta
ddressgenerationefficiently.Hardwaretogenerate
multiplea ddresssequencesins otoverlycomplicated,butcur
rentISAsrequirealargenumberoifnstructionsto
produce them,asthe available addressingmodesare limited trackof
Furthermore, . there insotenough supportforkeeping
multipleindices/stridesefficientlyinGPPs.Keepingt
rackomultiple f loopnests/boundsinvolvescoma
binationoseveral f addressingmodesandinstructions. Thus, sionstoextractdata-levelparallelisminmultimediapr
eventhoughGPPsareenhancedwithSIMDextenograms,thereiasmismatchbetweentherequirementsof
mediaapplications(foraddressgenerationandnestedloops)and
thecapabilitiesoG f PPstosupportthenested
loops,memoryoperations,andaddressgenerationefficientl
y.Weelaborateontheloopmanagementandaddress
generationoverhead below. Loopmanagement: Desktop/workstationmultimediaapplicationssuchasstream (MPEG1/2/4andMotionJPEG),audioencoding/decoding(ADPCM,G.7 (H.323,H.261,etc),3Dgames,andimageprocessing(JPEG,fi
11,MP3,etc),videoconferencing ltering)typicallyoperateonsub-blocksinalarge
1-o2r -dimensionalblockofdata.Audioapplicationsoperat
eonchunksoof ne-dimensionaldatasamplesaat
time.Imageandvideoapplicationsoperateonsub-blocksotf
wo-dimensionaldataaatime.Forexample,the
DCTalgorithmoperateson8x8segmentsodf atainimages typicallyusemultiplenestedloopswith
ingvideoencoding/decoding
ofsizeslike1600x1200pixels.Thesealgorithms
staticallyknownloopboundariestoprocessthesub-blocksa ndt
streams.Todeterminethedeptholfoopnests,weanalyzed
hedata
ourbenchmarksuiteandanumberofothercommon
mediacodes,and show the resultsiT n able 3.
Table.3.Loopnestdepth and common addressing sequencesikney Nested loops
Multimedia/DSPalgorithm Discrete Cosine Transform (JPEG& MPEGcoding) MotionEst./Comp.(MPEG,H.263, etc) WaveletTransform (JPEG2000) ColorSpace Conversion(JPEG, MPEG,3D graphics) Scaling and matrix operations(image/video) FastFourier transform ColorFilterArray,medianfiltering, correlation Convolution,FIR,and IIR filtering Edge detection,alpha saturation(image/video) Up/Downsampling, 3-D transformation(graphics) Quantization(JPEG,MPEG) ADPCM,G.711 (speech)
5
mediaapplications AddressingSequences
Sequent ial andsequential withmultiple offsets/strides 5 Sequential andsequentia with l multiple offsets/strides >5 Sequential andsequential wit hmultiple offsets/strides >4 S equential,sequential withoffsets,and shuffled 3 Sequential andsequential withmultiple offsets/strides >3 Shuffledand bit-reversed 2 5– Sequential andsequential withmultiple offsets/strides 3 4– Sequential,sequential withoffsets,and reflected 2 5– Sequential andsequential withmultiple offsets/strides 3 5– Sequential andsequential withmultiple offsets/strides 2 4– Sequential andsequential withmultiple offsets/strides 2 3– Sequential andsequential withmultiple offsets/strides
6
Addressgenerationandstridemanagement:
Thedivisionodf ataintosub-blocksresultsinthedatabei
cessedwithdifferentstridesavarious t instancesin
ngac-
thealgorithm.Managingmultiplestridesresultsin ume
softwareinstructions.Ingeneral,addressingsequencesin
rous
mediaprogramsmaybeclassifiedintosequences
showninFig.3.Wehaveincludedtheprevalentaddressings
equencesintheanalyzedmediaapplicationsinTa-
ble By 3s. upporting sequential accesses,multiplestrided
accesses,reflected a nd shuffleddatatransformationsi
hardware,we can overthe bulk othe f addressing anddatastre
aming acrossthisspectrum ofmediacodes.
2.2 Effectof prevalentoverheadinstructions TheneteffectoftheoverheadinstructionsitshattheS I
MDexecutionunitsareoftenidle.Inthissection,
weevaluatewhatutilizationothe f SIMDexecutionunits
isachievedforsaubsetofourmediaapplications.Peak
throughputcanbaechievedithe f requiredoperandsareavai able.In Table 4,
lableasoonatsheSIMDexecutionunitsareavail-
weshowhowwell the SIMDunitsareutilizedotnheP entium
IIIbaselinewith MMX.The utili-
zationachievedbythemediaextensionsilsow(1-12%),despit
teheabundanceodf ata-levelparallelisminthese
codes,becausethesupportingunitsarenotabletofeedtheri
ghtdata,intherightpackedform,attherequired
rate. Theresultsothis f section show both thattheproperties
oftheoverheadinstructionsarecommonacrossa
broadrangeom f ediaapplications,andthatsignificantpoten
tialforimprovedperformanceexistsduetothelow
executionothe f SIMDunits. Inthenextsection,wedescrib
teheProgrammableLoopEngine(PLE),whichim-
plementsthosecommonoverheadoperationsinhardware,reducing
theserialoverheadandallowingagreater
utilization othe f SIMDunits. Givensequence a olength f LifA , isaddressminthe range0 ≤m ≤L- 1,mostmultimediaandDSPkernels m canbceonsidered tboceomposedoprimitive f addressingse quencessuchathe s following: (i) Sequential addressing: A A , A , …A , 0 1 2 N-1 (ii) Sequential withoffset(k)/stride addressing: A , 1+kA, 2+k…, , A N-1+k 0+kA (iii) Shuffledaddressing(base rN/r , =p):A , pA, 2p…, , A 1A, p+1A, 2p+1…, , A 2A, 2p+2…, ,A 0A AN-1 (iv) Bit-reversedaddressing(e.g. N =8): A , 4A, 2A, 6A, 1A, 5A, 3A, 7 0A (v) Reflectedaddressing: A , N-1A, 1A, N-2…, , A mA, N-m…, , A N/2-1A, N/2 0A
Fig.3. Typical accesspatternsinmultimediaand DSPker Table.4.Execution statisticsand utilization omedia f prog Benchmark cfa dct scale motest aud g711
nels[13] rams
Pentium III –MMX& SSE Actual Fraction opeak f Inst.Count Cyclecount throughput 404,290,544 188,798,806 2,170,274 156,734,613 220,320,505 59,066,806
231,616,932 123,944,326 20,756,929 113,623,185 150,386,375 64,006,729
7
5.16 % 6.2 % 2.31 % 3.38 % 11.97 % 1.12 %
, 2p+2…,
n
3
HardwareSupportforEfficientMediaProcessing Themediaapplicationcharacterizationpresentedinthepre
vioussections,especiallyFig2,presentsa
compelling case forproviding hardware supportformultiplel
evelsolooping f and addressgeneration.Such facili-
tiescommoninDSPprocessorscanbeintegratedinacost-e
ffectivefashiontogeneral-purposecores.Whilea
varietyoimplementations f arepossible,wepresentanexa
mplearchitecturetostudytheeffectivenessothe f pro-
posedhardware accelerationmechanismsiG n PPs.Were
ferto thisatshe PLE(programmableloopengine)archi-
tecture.Fig.4illustratestheproposedscheme.Theshaded
unitsarethenewadditions.Essentially,wea dd new a
instruction(calledProgrammableLooporPLinstruction)eq whichindicatesthemultiplenestedloopsthatarerequir
uivalenttoamultidimensionalvectorinstruction edinthecomputation,thedifferentaddressstrides,loop
bounds,etc.OncethePLinstructionisencountered,pro
gramexecutionproceedsintheshadedregionwithout
using thefetch,decode,rename and issue blocksothe f origi
nal processor.The detailsothe f added instruction and
the a dded unitsare provided below.Whilewheavemadesomechoice
rsegarding thenumberoflevelsolooping f
orsimultaneousdatastreams,ourobjectiveios nlytoprovet hardwaresupportforthesefunctionalities.
heeffectivenessothe f generalconceptofproviding SimpleASICscanperformthesetasksefficiently;howe
programmability and flexibilityiweakness as othat f appr
oach,andhencewdefoavor parogrammable processor.
Butthepointisthatprogrammabilitydoesnotneedtoextend ity can bleimited forthese structuredmediaapplications
toeveryaspectofthefunctionality.Programmabilto ,improveefficiency.
Thenewhardwareunits(shadedblocksinFig.4)arethe
addressgenerationunits,hardwarelooping,and
PLInstructionmemory &decoder. Theirfunctionsare desc •
ribed below:
Loopprocessing: Toeliminatebranchinstructionoverhead,PLEemployszero usingdedicatedhardwareloopcontrolandsupportsuptofivelevel
Decode
Rename
-overheadbranchprocessing oslfoopnesting.Wechosetohavefive
levelssincethatwouldbseufficienttohandlethemostc
Fetch
ommonalgorithmsandroutinesusedinmediaproc-
Issue
Read Registers MemoryAccess
HardwareLoop I nstr. Mem.& Dec.
Hardware Looping
ver,lossof
Execute
Writeback
Address Generation
Data Reorganization
Fig.4.A superscalar processorenhancedwith the PLEArc
8
hitecture
essing.Allbranchesandinstructionsrelatedtoloopincreme
ntsarehandledbythistechnique.Thisapproach
isfairly simpleand straightforward toimplementandhas
beenimplementedinmany conventional DSPproc-
essorssuch athe s Motorola56000 and TMS320C5x from TexasInstr •
uments[18].
Addresscalculation: ThecurrentPLEallowsforthreeinputdatastructures/s
treamsa nd producesoneoutput
structure.Thechoicewasmadebecausemanymediaalgorit
hmscanbenefitfromthiscapability(current
SIMDexecutionunitssometimesoperateonthreeinputregis
terstoproduceoneoutputvalue).Adedicated
hardwareunitwhereaddressarithmetichardwarewouldgenerat
eallinputandoutputaddressstreams/data
structuresconcurrentlywith the computations. •
DataReorganization: Inmanyalgorithms,thelogicalaccesssequenceodf atai
vs astlydifferentfromthe
physical storagepattern.Variouspermuteoperationsincludingpack, ple,thefirstelementineightcolumnsoamatrix f needs
tobpe ackedintoasinglerow(orSIMDregister).
Similarly,atimes,asingleelement(16-bitswide)nee
dstobebroadcastintoallthefoursub-wordsoaf
SIMDregister(64-bitswide).PLEefficientlyhandlest port.Asanexampleof mable patterns.
unpackinstructionsareused.Forexam-
hetaskoreordering f datawithexplicithardwaresup-
operationssupported,thereim s ulticastingodf atainto Multicastingeliminatestheneedfortransposingdatastructures,to
computations,andtoincreasereuseodf ataitemssoonafter
selected registersinprogramallowforreorderingothe f
fetchbyexploitingDLPinouterlevelloops.
Multicastingmeanscopyingone/manydataitemsintosevera
registers l obuffers r athe t sameitem.Forexam-
ple,daatavalueAmaybceopiedinto8registers(or8s
ectionsobafigSIMDregister)resultingipattern na
A,A,A,A,A,A,A,AortwoitemsAandBmaybecopiedto8
registersinthepatternA,A,B,B,A,A,B,Bor
A,B,A,B,A,B,A,Boranothersuch pattern. •
Hardwareloopinstructionstorageanddecoding:
Inordertoprogram/controlthehardwareunitsinthe
PLEarchitecture;aspecialinstructioncalledthePLin
structionifsormulated.ThePLinstructionies quiva-
lenttoamultidimensionalvectorinstruction.ThePLin
structionmemorystorestheseinstructionsoncethey
enterthe processor.
3.1 PLEimplementationdetails Themajorhardware additions,the looping circuitry and addr Loop support:
Figure 5
essgeneration circuitry are described below:
illustrates thelooping circuitry.Loopindexvaluesare producedevery
ontheloopboundforeachlevelofnesting(boundsforeach
clock cycle based
ofthefiveloopsarespecifiedinthePLinstruction).
Thevalueolafoopindexvariesfrom 1(lowerbound)to the
correspondingloopbound(upperbound),and resets
to itslowerbound once the upperbound irseached itnhe previous
cycle.Theexecutionothe f PLinstructionends
whentheoutermostloop(loop1inFig.5)reachesitsupperbound.
Onencounteringeitheranexceptionosartall,
theloopindicesarestoredandtheincrementlogicishal tion/stallisserviced. signalsthatare
Eachothe f five
ted;thecountingprocessis tartedoncetheexcep-
comparators(32-bitwide)operatesinparalleltogenerate
priorityencoded todeterminewhichoneotfhefiveloopcounterstoincreme 9
flag(1-bitwide) nt.Whenaloop
Loop1-count index-1
Loop2-count index-2
Loop3-count index-3
Loop4-count index-4
Loop5-count index-5
comparator-1
comparator-2
comparator-3
comparator-4
comparator-5
flag-1
flag-2
flag-3
flag-4
flag-5
priority encoder End-of-all-loops incL1 Increment-by-1 index-1
incL2 Increment-by-1 index-2
incL3 Increment-by-1 index-3
incL4 Increment-by-1 index-4
Increment-by-1 index-5
index-1
index-2
index-3
index-4
index-5
Fig.5.Blockdiagram ofthe hardware looping circuitry counteris
incremented-by-1 (circuitforincrementinga32-bitvalueby1),allthel
innerlevelarereset(forexample,ilfoop3is
oopcountersbelongingtoits
incremented-by-1,thenloop4andloop5areresettotheirlower
bound). Addressgeneration: ThePLEarchitecturesupportsthreeinputandoneoutputdata thefourdatastreamshasdedicated a addressgeneration
hardwareunit.Addressarithmeticoneachstream isper
formedbasedonthestridesandmaskvaluesindicatedint
stridesiselected.Thenewaddressvalueitshen
computedbasedontheselectedstrideandthepreviousaddress dressgeneration circuitry for saingledatastream/struc
value.Fig.6depictstheblockdiagramoftheadture.
last_valcomparators determinewhichothe f fourinnerlevelloopcountershaver
bound.Theoutermostloopcomparisonins otnecessarybecausethe
eachedtheirupper
PLinstructionfinishesexecutionathe t in-
stantwhentheoutermostloopcounterreachesitsupperbound
The .
flagsignalsbasedontheoutputfromthe
andmaskvaluesfromthePLinstruction.Ifnone
ofthe
flagsignalsaretrue,then
4)isselecteddependingon
last_valcomparators stride-5ius sedtoupdatethe
flag-(1–4). The
inc-condand
inc-combineblocksgenerate
prev-address;otherwise,theappropriate
stride-(1–
address-generateblockuses32-bit a addertoaddtheselectedstride
tothepreviousaddress.Oneitheranexceptionoastall, r o
nlythe
loopcountersarestoredbythehardwareloopingcircuitry.
Foreachofthefourdatastructures/streams,the
prev-addressvalueneedstobestoredatshe
last_valcomparators portion othe f logicishared,but the remaininghardware Loopinstruction decoder:
-
hePLinstruction.Foreachclockcycle,dependingon
themaskbitsandloopindexcounts,oneotfhefivepossible
The
structures/streams.Eachof
needstboreeplicated.
Astand-aloneinstructiondecoderforthePLinstructionsel
ifytheconventionalinstructiondecoderocfurrentGPPs.APL variouscontrolparametersarestoredinhardwareregiste
iminatestheneedtomod-
instructionneedstobedecodedonlyoncesince rsafterthedecodingprocess.Theimplementationothe f
PLinstruction decoderwasmerged into the addressgeneration
and looping circuitry.
10
Loop(2-5)-count
indice-(2-5)
last_valcomparators lastval-(2-5) mask-1
mask-2
mask-3
mask-4
inc-cond1
inc-cond2
inc-cond3
inc-cond4
inc-combine1
inc-combine2
inc-combine3
inc-combine4
flag-1 stride-(1-5)
flag-2
flag-3
flag-4
address-generate
prev-address updated-address
Fig.6Block diagram
ofaddressgeneration hardware (perdatastream)
3.2 RequiredISAsupport ThemajorISAaddition
isanewinstruction,thePLinstruction,whichconveysinform
ouslevelsoloops f a nd theirstridesttohehardware.Thest 7.ThePLinstructionfacilitatesmultiplestrides(onea
ructure othe f addedPLinstructioniesxplainedinFig. each t levelofloopnesting,i.e.,taotaloffivestride
eachotfhethreeinputstreamsandoneoutputstream.The
stridesindicateaddressincrement/decrementvalues
basedontheloop-nestlevel.Dependingonthemaskvaluesforeac
hstream (indicatedinthePLinstruction)and
theloop-nestlevel,oneothe f five possiblestridesiussed
to updatetheaddresspointer.Ifan applicationdoesnot
need five levelsonesting, f non-constantstridescan bgeener
atedwith theextralevelsolooping f [19].
Datatypesoef achstream/structurearealsoindicated
inthePLinstruction.CurrentSIMDextensions
providedatareorganizationinstructionsforsolvingthepro
blemofhavingdifferentelementsizesacrossthedata
structures(packing,unpacking,andpermute)andintroduceadditi
onalinstructionoverhead.Byprovidingthis
informationinthePLinstruction,specialhardwareintheP tionoperationsa ndthisiaslsoindicatedinthePLinst
ationonthevari-
LEperformsthisfunction.ThePLEperformsreducruction(forexample,multipleindependentresultsisinna
gleSIMDregisterarecombinedtogetherindotproductwhich
requireadditionalinstructionsincurrentDLP
techniques).Supportforsigned/unsignedarithmetic,saturation
shifting/scaling , ofinal f resultsias llindicatedin
11
s)for
Loop1-count
Loop2-count
Loop3-count
Loop4-count
Loop5-count
Starting Address of IS-1
Starting Address of IS-2
Starting Address of IS-3
Starting Addressof OS
OPR/ Legend RedOp / Shift LL /
Stride-1 I S-1
Stride-2 IS-1
Stride-3 IS-1
Stride-4 IS-1
Stride-5 IS-1
Stride-1 I S-2
Stride-2 IS-2
Stride-3 IS-2
Stride-4 IS-2
Stride-5 IS-2
Stride-1 I S-3
Stride-2 IS-3
Stride-3 IS-3
Stride-4 IS-3
Stride-5 IS-3
Stride-1 OS
Stride-2 OS
Stride-3 OS
Stride-4 OS
Stride-5 OS
Masks -
Masks -
IS-1 and IS-2
IS-3 a ndOS
IS input - stream OS output stream OPR operation code RedOp reduction operation LL loop - level to writeresults
Multicast anddatatypes ofeach streamwith remaining bits unused
32-bits
Fig.7. Structure othe f PLInstruction
thePLinstruction.Thiseliminatesadditionalinstruction
tshatareotherwiseneededforconventionalRISCproc-
essors. Withthesupportformultiplelevelsolfoopingandmultiple
strides,thePLinstructioniacomplex s in-
structionanddecodingsuchaninstructionicomplex as processi
ncurrentRISCprocessors.PLEinsteadhandles
thetaskodecoding f othe f PLinstruction.PLEhasitsown
instructionmemorytohold PL a instruction.Twoad-
ditional32-bitinstructionsarealsoaddedtotheISAofthe
general-purposeprocessorformarkingtheSTART
andSTOPofthePLEsectionotfhecode.These32-bitinstruc structionissuelogic)indicatethestartandthelengthof
tions(fetchedanddecodedbythetraditionalinthePLinstruction.WheneverP a Linstructioniesncoun-
teredinthedynamicinstructionstream,thedynamicinstruc
tionspriortothePLinstructionareallowedtofinish
afterwhichthePLEinstructiondecoderdecodesthePLinstruct
ion.Inourcurrentimplementation,wehaltthe
superscalarpipeline until theexecutionothe f PLinstruction units.Otherwise,arbitrationoresources f ins ecessaryto
iscompleted because the PLEusesexisting hardware allowforoverlapothe f PLinstructionandothersuper-
scalar instructions. Encodingalltheoverhead/supportingoperationsalongwiththeSIMD hastheadvantagethatthePLinstructioncanpotentially
true/corecomputationinstructions
replacemillionsodynamic f RISCinstructionsthathave
tobefetched,decoded,andissuedeverycycleinanormal
superscalarprocessor.SIMDinstructionsinGPPs
themselvesreducethenumberoifnstructionfetchesbecause
oneinstructionoperatesonmultipledata.ThePL
instruction a dditionallycapturesalltheoverheadoperationsa
longwith theSIMDcomputationoperationsthereby
drasticallyreducingrepeated(andunnecessary)fetchan the PLEarchitecture advantagessimilarto ASIC-basedac
d ecodeotfhesameinstructions.Thisresultsingiving celeration i[n20].
Itispossiblethatanexceptionoirnterruptoccurswhil
aePLinstructioniisnprogress.Thestateoaf ll
fiveloops,theircurrentcounts,andloopboundsare savedand
restoredwhentheinstruction returns.Thisisimi-
larto thehandlingoexceptions f duringmoveinstructionswith
REP(RepeatPrefix)in x86.The PLE
to hold the loopparametersforall the loops
.
12
has registers
3.2.1 PLinstruction encodingexample ThePLinstructionidensely as encodedinstruction a ndhencem justafewPLinstructions.Fig.8illustratestheactions
ostmediaalgorithmscan bperocessedin duringtheexecutionoafPLinstructionusingpseudo-
code.In scenario a inwhichalltheloopnestsa nddatastr
eamsare processed,the PLEexecutes(inhardware)the
followingequivalentnumberofdynamicsoftware instruction •
five branches
•
threeloadsandone store
•
fouraddressvaluegeneration(oneoneachstreamwitheac
(sin conventional ILPprocessors)
haddressgenerationrepresentingmultipleRISC
instructions) •
one SIMDoperation (2-way t1o6-way parallelism depending on
•
one accumulationoSIMD f resultand one SIMDreduction ope
•
fourSIMDdatareorganization (pack/unpack,permute,etc)
•
shifting &saturationoSIMD f results
eachdataelementsize) ration operations
CommonkernelssuchastheDCT,colorspaceconversion,motion
estimation,andfilteringcanbe
mappedtoeitheroneotwo r PLinstructions.Fig.9illustrat
esthePLinstructionmappingothe f 1-DDCTassum-
ing an8-waySIMDfor16-bitdata.Forthe1-DDCTroutine,
onlyfourofthefivepossibleloopnestsareneeded
withtheloopboundariesindicatedinthePLinstruction.Th starting addressoeach f othe f arrays. The thirdinput
setartingaddressoeach f stream isrepresentedbythe stream isnotused forthisalgorithm.The valueothe f st
iscomputedbasedontheloopindicesa ndthevalueotfheaddre
sspointerinthepreviouscycle.Theaddress
pointerisupdatedeach clock cycle choosingone stridedependi
ng otnhenestinglevel ofthe loops.
IS1= start_address_IS1;IS2= start_address_IS2; IS3= start_address_IS3; OS1= start_address_OS; increment_address ( level{) if (mask_IS1[level] ) IS1+= stride_IS1[level]; if (mask_IS2[level] ) IS2+= stride_IS2[level]; if (mask_IS3[level] ) IS3+= stride_IS3[level]; if (mask_OS [level] ) OS += stride_OS[level]; }
if elseif elseif elseif else
(i_5 ( + 1)= loop1_count) increment_address(4); (i_4 ( + 1)= loop2_count) increment_addres (i_3 ( + 1)= loop3_count) increment_addres (i_2 ( + 1)= loop4_count) increment_addres increment_address(5);
s(3); s(2); s(1);
SIMD_data_reorganization(R1,R2); SIMD_compute (MAC, R1,R2, R4); SIMD_data_reorganization(R4);
Fig.8. Pseudo-code illustrating operationsduring executionof
13
P a LE compute instruction
rides
1D_DCT( image[1200][1600], dct_coef[8][8],output[8 { < 1ifor = 0200/8; i(; i++) for = 0< 1j(;600/8; j++) for (k= 0; k< 8; k++){ temp_simd_vector= 0; for (l = 0< 8++) l;
][8] )
/*Sincetherei8-way s SIMDparallelism,theinner quired */
mostloopfoldsintooneiterationandinsotr
temp_simd_vector+= multicast(dct_coef[ *]kl[ output[ i*8 ]k*8 [= t]emp_simd_vector>> s_bits;
e-
image[ i*8+k]j*8+l [ ]);
} 0
1200/8
Starting Address of image
1600/8
Starting Address of dctcoeff
8
-------------NONE --------------
NONE
16 bytes
-22384 bytes
NONE
-126 bytes
-126 bytes
8
Starting Address of output
OPR =MAC Shift =s_bits LL = 4
-22400 bytes
NONE
3200 bytes
b2ytes
b2ytes
NONE
NONE
NONE
NONE
NONE
NONE
NONE
-22384 bytes
3200 bytes
IS-1 = 01111
IS-3 = 00000
IS-2 = 01111
OS = 01100
Multicast is used fordctcoefficients data types of each stream isset to 16-bitdata
Fig.9. PLinstructionmapping o1D-DCT f
4.
PerformanceEvaluation andResults To measure performanceothe f PLEarchitecture,wemodif
ied the
outorder)andsimulatedPLinstructionsusinginstructionannot
ations.WeusethesameSIMDexecutionunits’
configurationasinaPentiumIIIprocessor(two64-bit
SIMDALUsandone64-bitSIMDmultiplier.
showsthespeedupobtainedforeachotfhebenchmarksusingthe
the PLEenhanced architecture incorporate prefetching.
Table 5Speedups . o4-way, f 8-way and 16-way processorsenha cfa dct mot scale aud 1.00 13.30 1.24 26.00 1.24 51.03
1.00 11.70 1.29 22.14 1.66 42.07
Table5
PLEarchitecturewitha4-wayprocessorwith
SIMDextensionsathe s baseline. The baseline aw s ellas
4-way baseline 4-way + PLE 8-way baseline 8-way + PLE 16-way baseline 16-way + PLE
PISAversion oSimplescalar-3.0 f (sim-
1.00 8.30 1.33 16.00 1.35 31.00
1.00 2.30 1.96 4.58 3.35 9.12
14
1.00 3.53 1.47 5.49 2.11 6.13
g711 1.00 1.08 1.68 1.83 2.50 2.74
ncedwith PLE jpeg ijpeg decrypt 1.00 1.94 1.18 2.23 1.91 2.25
1.00 1.61 1.40 2.09 1.77 2.29
1.00 1.01 1.18 1.19 1.55 1.56
Incorporationohf ardwareloopinga nda ddressgeneration activityresultsinsignificantspeedupinallthekernelsand benefitis
decrypt.G711 and
coupledwithsavingsinrepeatedfetch/decode severalapplications.Oneapplicationthatdoesnot
decryptareapplicationsthathavethesmallestfractionoSIMD f
isthesuperscalarpipelinethataccountsforba ulkothe f
instructions(i.e.,it
executiontimeratherthanthePLEpipeline).Consider-
inggeometricmeanospeedups, f incorporatingthePLEenhanc
es4-way a processorby7.3xinkernelsand1.54x
inapplications.Wealsocomputedthespeedupsothe f 8-waye
nhancedarchitectureoverthe8-waybaselineand
the16-wayenhancedarchitectureoverthe16-waybaselinearchi speedupiknernelsis
tecture.Inthecaseotfhe8-wayprocessor,the
10xand the speedupianpplicationsi1s.63x(overthe 8-way baseline
way architecture,the speedupoverthe 16-way baselinei1s5.9x in
kernelsand 1.37xianpplications.
OnemayalsonotethattheincorporatingPLEsupportforlo processorperform betterthan a8n-way processorin al
).In thecaseothe f 16-
opingandaddressgenerationmakes4-way a the l kernelsand 3of5applications.Forall thekernelsand
2othe f 5applications, 4a-wayprocessorenhancedwith theP
LEoutperformseven 16-way a processor.Henceif
oneisearchingforcomplexityeffectiveenhancements,incorpo addresstransformationias neffectivewaytoachievei
ratinghardwarelooping,addressgeneration,and t.Wediscussthearea,timeandpowersavingsassociated
withourproposal in thenextsection. Powersavings: SincethePLinstructionids enselyencoded,fewPLinstructi dia-processingalgorithm.Thenumberofdynamicinstructionst
onsareneededforanyme-
hatneedtobfeetchedanddecodedishrunktre-
mendously,leadingtoareduceduseotfheinstructionfetch,
decode,andissuelogicinasuperscalarprocessor.
Theinstructionfetchandissuelogicareasignificant
consumeropf owerinspeculativeout-of-orderprocessors.
OncethePLinstructioniisnterpreted,theinstructionfe
tch,decode,andissuelogicinthesuperscalarprocessor
can bsehutdown forthedurationothe f loopnest.An indicat
ion othe f powersavingscan boebtained byexamin-
ingthesavingsinfetchanddecode.AsshowninFig.10,theuseo
the f vector-stylePLinstructioncaneliminate
morethanhalfotfheinstructionsfromtheoriginalprogra
m(65%onaverage).Theinstructionsrequiredtoim-
plementlooping,addresscomputations,andtransformationsar energy savingsitnhe fetch,decode and registerrenaming st
eremoved.Eacheliminatedinstructionresultsin ages.
%Reductionindynamicinstructions %eliminatedinstructions
99.90
99.90
99.90
99.90
91.00
100.00 80.00 60.00
42.60
41.70
40.00
11.30
20.00
0.20
0.00
cfa
dct
motest
scale
g711
aud
Fig.10.ReductionindynamicinstructionsbyusingthePLEarchi ingsproportionalto theinstructioncountsavingscanbexpectedforfetch, renamingenergy. 15
jpeg
ijpeg
decrypt
tecture.Powersa vdecodeand
5
HardwarecostofthePLEArchitecture Inthissection,wedescribeaVHDLimplementationoftheP
LE,whichwedevelopedtoestimatethe
PLE’sareaandpoweroverhead,aswellastovalidate
thetimingassumptionsthatweusedinoursimulationen-
vironment. Using S ynopsyssynthesistools[21],we used cell-based a methodology twoASICcell-librariesfromLSILogic[22][23].Table6list
to targettheVHDLmodelsto
tshelibrariesandtechnologiesusedforevaluating
the implementation cost.
Table.6.Cell-based Libraries(LSILogic)usedisnynthesi
Libraryname
s
Description
lcbg12-p (G12-p)
A0.18-micronL-drawn(0.13-micronL-effective)CMOSproce ss. Highestperformance solutiona1.8 tV withhighdri ve cellsoptimizedforlong interconnectsassociatedwithlarge designs.
lcbg11-p (G11-p)
A0.25-micronL-drawn(0.18-micronL-effective)CMOSproce Highestperformance solutiona2.5 t V.
WeusethedefaultwireloadmodelsprovidedbyLSILogic’sAS
ss.
IClibraries.TheSynopsyssynthesis
toolscomputetiminginformationbasedonthecellsinthedesign
andtheircorrespondingparametersdefinedin
theASICtechnologylibrary.Theareainformationprovidedbythe
synthesistoolsips riortolayoutandicsom-
putedbasedonthewireloadmodelsotfheassociatedcells
inthedesign.Averagepowerconsumptionim s eas-
uredbasedontheswitchingactivityofthenetsinthedesign.I
nourexperiments,theswitchingactivityfactor
originatesfromtheRTLmodelsgatheredbythetoolfromsimula
tion.Thearea,power,andtimingestimatesare
obtained afterperformingmaximum optimizationsforperfor
manceitnhesynthesistools.Moreinformation a bout
the detailsothese f toolscan bfeoundelsewhere [21]. Table7showsthecompositeestimatesotiming, f area,andpow
erconsumptionforthehardwarelooping
andaddressgenerationcircuitrywhenimplementedusingthec
ell-basedmethodology.Thepowerandareaesti-
matesiT n able 7correspond tclock ao frequencyo1G fH
z.Thehardware costofcommercial SIMDimplementa-
tions[25][26]isalso shown iT n able 7We . discusseach othe f
three categoriesbelow.
Area: Theoverallchiparearequiredforimplementingthehard
wareloops,addressgeneration(for
fourdatastreams),andthePLinstructiondecoder(mergedin mately0.31mm
toloopinganda ddressgenerationlogic)isapproxi-
2
inthe0.18-micronlibrary.In 0.29-micron a process,theincr
theVisualInstructionS et familywas15mm
(VIS)hardwareintotheSparcprocessorfamilywas4mm
2
a, ndAltiVecintotheP owerPCfamilywas30mm
AltiVechardwarewasexpectedto ccupy15mm processorwas106mm
2
2
MMX , intotheP entium
2 [25]. Ina0.25-micronprocess,the
2
In . a0.18-microntechnology,thediesizeoafPentiumIII
withtheMMXandSSEexecutionunitsrequiringapproximately
increaseinareaduetotheaddedhardwareunitsilses
easeinchipareaforimplementing
3.6mm
tshan10%ofSIMD-relatedhardwareandtheoverallin-
creaseicnhipareailsessthan 0.3%. 16
2
[26].Thus,the
all
Table 7Timing, . Area,and Powerestimatesforhardware (The instruction decoderwasmerged into the looping and addre
looping and addressgeneration ssgeneration)
Hardware Looping (5loops)
AddressGeneration (perstream)
Area (µm2)
Time (ns)
Power (mW)
Time (ns)
1.00 ns
72830 µm2
88.57 mW
1.74 ns
57398 µm2
85.16 mW
G11-p(0.25 µ)
1.49 ns
273249 µm2
249.30 mW
2.60 ns
165099 µm2
193.20 mW
VIS – MMX – 15 mm AltiVec – Pentium IIIprocessor – MMX + SSE in Pentium a IIIprocessor
15 mm
Power (mW)
G12-p(0.18 µ)
Area ocommercial f SIMDand GPPunitsforcomparison [
06 mm
Area (µm2)
3.6 m–m
Power: Thepowerconsumed
25][26]
2 m 4 m in 0.29-micron a process 2 in 0.29-micron a process 2 in 0.25-micron a process 2 in 0.18-micron a process 2 in 0.18-micron a process
bythelooping,addressgeneration(allfourstreams),and
tiondecoderisapproximately 430mWin the0.18-micronlibrary.
General-purposeprocessorswith speedsover1
GHztypicallyconsumeapowerrangingfrom50Wto150Wthus ,
thePLEhardwarewouldincreasepowerby
lessthan1%.Theoverallenergyconsumptionothe f P LEarchite scalarprocessorwithSIMDextensions,sincethePLin
thePLinstruc-
cturewouldbcelearlylessthanthatofsauperstructionsignificantlyreducesthetotalinstructioncount,
asexplained iF n ig.10. Timing: ThePLEhardwarecan biencorporatedinto high-speed a pro cal pathothe f processor(afterappropriate pipelining).In supporta
pipe stagestaochievefrequenciesgreater
6
Table 7some , othe f unitsinourimplementationdonot
1GHzclock,however,pipeliningthehardwareloopinglogicint
ogy)wouldpermitgigahertzfrequencies. Similarly,the
cessorwithoutelongating thecriti-
otwostages(in 0.18-micron a technol-
addressgenerationstageneedstobedividedintothree
than 1GHz.
RelatedWork Corbal etal. [36]proposedtoexploitDLPintwodimensionsinsteadofo
SIMDextensions.Vassiliadis
nedimensionasincurrent
etal. [37][38]haveconcurrentlyproposedtheComplexStreamedInstructi
(CSI)thatcanexploittwolevelsolfooping.Thoughtheya theircomplexinstructionscaneliminatetwoloops,
reabletoeliminatesomeoverheadbecauseeachof ourresultsshowthatmoreoverheadcanbeliminatedwith
loops. Leea ndStoodley[39]proposedsimplevectormicroprocessor ordersimpleprocessorsforscalarprocessinga ndvectors
onset
fsormediaapplications,buttheyusedin-
formediaprocessing.Butsuchanarchitecturecanper-
formwelloverlimiteddomainsonly,becausethescalarpr
ocessorisin-order.Ranganathan
out-of-orderexecutionibs eneficialtomediaapplications.T
etal. [5]observethat
hereareseveralcomponentsinmanymultimediaap17
5
plicationsthatcannotexploitDLP,butrequiregoodbranch
predictionandspeculationtoexploitILP,andhence
we also favorthe use othe f out-of-orderprocessor. Rixner etal. [40]developedtheImaginearchitectureforbandwidth-effic chitectureibsasedonclustersoALUs f processinglargedat
ientmediaprocessing.Thisar-
satreamsandibsuiltas co-processor a for haigh-end
multimediasystem.Themethodology adopteditspoutadditional com
putation units,whilethePLEapproachim-
provestheutilizationothe f existing computation unitsbryeduc
inglooping andaddressgenerationoverhead.An-
otherrelatedeffortisthe reconfigurable PipeRench cop
rocessor[41].
Thereareafewresearcheffortsinidentifyingthebot
tlenecksinexploitingsub-wordparallelismusing
SIMDextensions.Fridmandiscussesapproachestodataalignmen
ftorsub-wordparallelismintheTigerSharc
processorusingfoursub-wordM ACunitsin[28].ThakkarandH
uffdiscusstheneedfordataalignmentforSSE
extensionsi[n29].TheBurroughsScientificProcessor(BSP
[42] ) was pure-SIMD a array processorthathad spe-
cial-purposehardware(calledAlignmentnetworks)forpacking
andunpackingdata.Therealsohasbeen research
in specialized accessprocessorsand addressgenerationco
processors[13][35]anddecoupledaccessexecute proc-
essors[30,31,32,19,33,34],which also tried taoccelerate the
overhead componentoftheinstruction stream.
Vermueleneal. t [20]describedhowDCT,Reed-Solomoncodea nd tionscouldbenhancedwithahardwareacceleratorthatw
othersimilarmediaorientedopera-
orksinconjunctionwithaGPP.However,theaccel-
erator hastobedesignedforeachalgorithm.Retargetingtheaccel effort,while,in ourcase,we stillhave fully a progra
7
eratortoanotheralgorithmincurssignificant mmableengine.
Conclusions WhileSIMDextensionshaveimprovedmediaapplicationperforman
thatexistinthemediainstructionstream.
. Westudyarangeom f ediacodestodeterminewhichfeaturesand
characteristicsotfheseprogramsaresufficientlyfreque classesooperations f thatwefindare streamingmemoryoperations •
ce,therearea dditionalbottlenecks
nttomeritimplementationinhardware.Thecommon
multiplenestedloopcontrol
, addressgeneration,datatransformations,
We . note that:
Approximately75-85%ofinstructionsinthedynamicinstruction
streamofmediaworkloadsareonlysup-
portingtheactual/corecomputations.Theseinstructionsaremos
tlyperformingaddressgeneration,data
rangement,loopbranches,and loads/stores. •
and
The utilizationothe f SIMDcomputation unitsicnurrentSIM
Dextensionsivserylow becauseothe f copious
overhead/supportinginstructions.OurmeasurementsonaPentium nelsand applicationsillustrate SIMDutilization ranging
IIIprocessorwithavarietyom f ediaker-
from 1%to 12%.
Then,weproposea ndevaluateaprogrammableloopengine(aP
LE)that
ciently bymovingmostofthe overhead associatedwiththem
executesmediacodesandkernelseffi-
edialoopsintohardware.
18
rear-
•
Ontheaverage,65%ofallinstructionsintheinstructionst proposed hardware.This
•
reamcanbeliminatedwiththea dditionotfhe
leadstpoerformance and powersavings.
Incorporatinghardwareloopingandaddressgenerationintoa
4-wayprocessorwithSIMDextensions,results
in speedup a oup ft1o.54X in applicationsand 7.3Xinkernels. •
Forallkernelsand3of5applications,a
4-wayprocessorwiththePLEoutperformsa n8-waySIMDpro
sorwithouttheproposedhardware.Forallthekernelsand
ces-
2ofthe5applications,a4-wayprocessoren-
hanced with the PLEoutperformseven 16-way a processor. •
Thecostoaf ddingthePLEhardwaretoaSIMDGPPisnegli
giblecomparedtotheperformanceimprove-
ments.WefindthatthePLEhardwareunitsoccupylessth
an0.3%oftheoverallprocessorarea,consumes
lessthan1%ofthetotalprocessorpower,andonappropriat
peipeliningdoesnotelongatethecriticalpathof
G a PP. Oursolutionessentiallyapproximatesthe performanceohard f softwaresolutions.Thenecessitytorunavarietyow f orkl
waresolutions,butretainstheflexibilityof oadsincludingdesktop,database,media,Java,scien-
tificandtechnicalapplicationsjustifiesnotabandoningthea gg
ressivegeneral-purposecoreinfavorofamedia-
specificsolution.Therightsolutionitsoappendsimplehar
dwaresupportfortasksthatcanbedoneefficiently
andelegantly ihnardware.
References [1] R.B.Lee,“Multimediaextensionsforgeneral-pur poseprocessors,” Proc.IEEEWorkshoponSignalProcessingSystemspp. , 9-23,Nov.1997. [2] K.Diefendorff,P.K.Dubey,R.Hochsprung, and H.Scales, “AltiVec extensiontP o owerPCaccelerate media s processing,” IEEEMicro vol. , 20, no.2, pp. 85-95, Mar/Apr2000. [3] TMS320C64xDSP Technical Brief.Available: http://www.ti.com/sc/docs/products/dsp/c6000/c64xmptb.pdf. [4] J.FridmanandZGreenfield, . “TheTigerSHARCDS Parchitecture,” IEEEMicro vol. , 20,no.1,pp.66-76,Jan/Feb. 2000. [5] P.Ranganathan,S.Adve,andNJouppi, . “Perform anceoimage f andvideoprocessingwithgeneral-purp oseprocessors and mediaISA extensions,” Proc. IEEE/ACMSym. on Computer Architecture pp. , 124-135,May 1999. [6] E.Salami,J.Corbal,M.Valero,andR.Espasa, “AnEvaluationofdifferentDLPalternativesforthee mbeddeddomain,” Proc.WorkshoponMediaProcessorsand DSPsin conjunctionwith Micro -32Nov. , 1999. [7] R.Bhargava,L.K.John, B.L.Evans, andRRad . hakrishnan,“EvaluatingMMXtechnology usingDSPan dmultimedia applications,” Proc.IEEE/ACM Sym.onMicroarchitecture pp. , 37-46,Dec. 1998. [8] H.V.Nguyen,andL.K.John,“ExploitingSIMDp arallelisminDSPandmultimediaalgorithmsusingt he AltiVec technology,” Proc.ACM Int.Conf. on Supercomputing pp. , 11-20, Jun.1999. [9] Sample source code forthe Benchmarks. Link suppr essedforBLINDreview. [10]CLee, . M.PotkonjakandW.H.Smith,“MediaBenc h:AToolforEvaluatingandSynthesizingMultimedi andCommunicationsSystems”, Proc.of30 thIEEE/ACM Sym. on Microarchitecture, pp. 330-335,Dec 1997. [11]DBurger, . andTM. . Austin,“TheSimpleScalar toolset,”Version2.0. TechnicalReport1342 Univ. , ofWisconsinMadison, Comp. Sci. Dept, 1997. [12]JFritts, . andWWolf, . “Dynamicparallelmedia processingusingspeculativebroadcastloop(SBL),” Proc. Workshop onParallelandDistributedComputinginImageProcessing,VideoProcessi ng,andMultimedia(heldinconjunction withIPDPS'01) Apr. , 2001. [13]PT. .Hulina,L.D.Coraor,L.Kurian, andEJ. ohn,“DesignandVLSIimplementationoan faddress generationcoprocessor,” IEEProc.on Computersand DigitalTechniques vol. , 142,No. 2, pp. 145-151, Mar. 1995. [14]JE. .Smith,“Decoupledaccess/executecomputer architectures,” ACMTrans.onComputerSystems vol. , 2,No.4,pp. 289-308,Nov.1984.
19
[15]JE . S. mith,SWeiss, . andN.YP. ang,“Asimul ationstudyofdecoupledarchitecturecomputers,” IEEETrans.on Computersvol. , C-35,No. 8, pp.692-701, Aug. 1986. [16]JCorbal, . R.Espasa,andM.Valero,"Ontheeff iciencyorfeductionsinmicro-SIMDmediaextensions, ” Proc.Intl. Conf.on Parallel Architecturesand Compilation Techniques Sep. , 2001. [17]IntelArchitecture OptimizationReference Manua l. Available: http://developer.intel.com/design/pentiumii/ manuals/245127.htm. [18]PLapsley, . J.Bier,A.Shoham,andEA. . Lee. DSPProcessorFundamentals:ArchitecturesandFeatures Chapter , 8, IEEEPressseriesonSignal Processing, ISBN0-7803-3 405-1, 1997. [19]AR. .Pleszkun,andES. .Davidson,“Structured memory accessarchitecture,” Proc.IEEEIntl.Conf.onParallelProcessingpp. , 461-471, 1983. [20]FVermeulen, . L.Nachtergaele,F.Catthoor,D. Verkest,andHDe . Man,“Flexiblehardwareaccelera tionformultimedia orientedmicroprocessors,” Proc.IEEE/ACM Sym.onMicroarchitecture pp. , 171-177, Dec.2000. [21]SynopsisSoldDocumentation, version2000-0.5-1 Distributed . withSynopsysCADtools. [22]LSILogic ASICtechnologies.Available: http:// www.lsilogic/products/asic/technologies/index.html. [23]LSILogic ASKKDocumentationSystem.Distribute dwithLSILogic CADtools. [24]HG. . Cragon,andWJ..Watson,“TheTaI dvance dscientificcomputer.” IEEEComputerMagazine pp. , 55-64, Jan. 1989. [25]LGwennap, . “AltiVec vectorizesPowerPC,” MicroprocessorReport vol. , 12, no.6,May 11,1998. [26]Pentium III implementation(IA-32). Available: htt p://www.sandpile.org/impl/p3.htm. [27]KWilcox . and SManne, . “Alphaprocessors:Ahi story opower f issuesand look a athe t future,” Cool ChipsTutorialin conjunction withIEEE/ACM Sym. on Microarchitecture Nov. , 1999. [28]JFridman, . “Sub-wordparallelismindigitalsi gnalprocessing,” IEEESignalProcessingMagazine pp. ,27-35,vol.17, no. 2, Mar. 2000. [29]SThakkar . andTHuff, . “InternetstreamingSIMD extensions,”IEEEComputerMagazine,pp.26-34,vol 32, . no.12, Dec.1999. [30]JE. . Thornton,“ParalleloperationintheCont rolData6600,” FallJointComputersConference vol. , 26,pp.33-40, 1961. [31]RR. .Shively,“Architecture opafrogrammable digital signalprocessor,” IEEETrans.Computers vol. , C-31,pp.16-22, Jan. 1978. [32]JR. .Goodman,T.J,Hsieh,K.Liou,A.R.Ples zkun,P.B.Schechter,andHC. .Young,“PIPE:AVL SIdecoupledarchitecture,” Proc. IEEESym. on Computer Architecture pp. , 20-27, Jun. 1985. [33]Wm.A.wolf,“EvaluationoftheWMarchitecture ,” Proc.IEEE/ACMSym.onComputerArchitecture pp. , 382-390, May 1992. [34]YZhang, . andGB. . Adams,“Performancemodelin gandcodepartitioningfortheDSarchitecture,” Proc.IEEE/ACM Sym. on Computer Architecture pp. , 293-304,Jun. 1998. [35]AS. .Berrached,P.T.Hulina,andLD. .Coraor “Specification , ocafoprocessorforefficientacc essodata f structures,” Proc. Ann.HawaiiInt. Conf. on System Sciences pp. , 496-505, Jan.1992. [36]JCorbal, . M.Valero,andR.Espasa,“Exploitin ganewleveloD f LPinmultimediaapplications,” Proc.IEEE/ACM Sym. on Microarchitecture pp. , 72-79, Nov.1999. [37]SVassiliadis, . B.Juurlink,andEA. . Hakkenne s,“Complexstreamedinstructions:introductionand initialevaluation,” Proc.IEEEEuromicro Conf .,vol.1, pp.400-408,Sep.2000. [38]BJuurlink, . D.Tcheressiz,S.Vassiliadis,and H.Wijshoff,"Implementationandevaluationofthe complexstreamed instructionset,” Proc.Int.Conf.on Parallel ArchitecturesandCompilation Tec hniquesSep. , 2001. [39]CG. . Lee,andM.G.Stoodley,“Simplevectorm icroprocessorsformultimediaapplications,” Proc.31 stIEEE/ACM Sym. on Microarchitecture pp. , 25-36, Dec.1998. [40]SRixner, . W.J.Dally,U.J.Kapasi,B.Khaila ny,A.Lopez-Lagunas,P.R.Mattson,andJD. . Owens “A , bandwidthefficientarchitecture formedia processing,” Proc.32 ndIEEE/ACM Sym. on Microarchitecture pp. , 3-13, Dec,1998. [41]SC. . Goldstein,H.Schmit,M.Moe,M.Nudiu,S Cadambi, . R.R.Taylor,andRLaufer, . “PipeRench: Acoprocessor forstreaming multimedia acceleration,” Proc. 26 th IEEE/ACMSym. on Computer Architecture pp. , 28-39, May 1999. [42]DJ..Kuck,andRA. . Stokes,“TheBurroughssc ientificprocessor(BSP),” IEEETrans.onComputers vol. , 31,no.5, pp. 363-376, 1982. [43]TM. . Conte,P.K.Dubey,M.D.Jennings,R.B. Lee,A.Peleg,S.Rathnam,M.Schlansker,P.Song, andAWolfe, . “Challengestocombininggeneral-purposeandmultim ediaprocessors,” IEEEComputerMagazine p, p.33-37,Dec. 1997. [44]PR . anganathan,SAdve, . andN.Jouppi,“Reconf igurablecachesandtheirapplicationtomediaproc essing,” Proc. IEEE/ACM Sym. on Computer Architecture pp. , 214-224,Jun. 2000. [45]A. SMckee, . “Maximizingmemorybandwidthfor streamedcomputations,” Ph.D.Thesis School , ofEngineeringand AppliedScience,University oVirginia, f May 1995.
20
[46]ZA. . Ye,A.Moshovos,S.Hauck,andP. Banerjee,“CHIMAERA:Ahigh-performancearchitectur ewithatightlycoupledreconfigurable functionalunit,” Proc.IEEE/ACM Sym.on Computer Architecture pp. , 225-235, Jun. 2000. [47]HLieske, . J.Wittenburg,W.Hinrichs,H.Kloos M. , Ohmacht,P.Pirsch, "EnhancementsforS a econdGenerationParallel Multimedia-DSP," Proc.WorkshoponMediaProcessorsand DSPs inconjunctionwithMicro-32, Nov.1999. [48]Techreport, Link suppressedforblind review.
21