Optimizing Sparse Linear Algebra for Large-Scale

7 downloads 0 Views 8MB Size Report
Feb 2, 2016 - as traditional graph alRori/hms or as a sequence of linear ..... Centaur memory-butt er chip. lj,. The memory channel between. POWERS.
Page 1 of9

Optimizing Sparse Linear Algebra for Large-Scale Graph Analvtics - ,-~ •

'DanJet~:1!I1101'1!(l, Jiohn A~ Gunne1s. Xinyu Oue, ,Fabio Checconi.and

~-,h

J./ I,.

.III-:'~

Fabrizio ,Petrtni.

I~M T.J. lNatson Research Cen tef

Ta.i-'Ching Tuan. Um'ierSI:V or Mar~·Ii3nd Chris long.

US Depar:;'I1en; of !)ef~l1se

Emerging data-intensive applications attempt to process

and provide inslQht into '•ast ... amounts of onsn e data. A. nev~' dass of rinear algebra algorithms can efficiently' execute sparse matrix-matrix and matrix-vector multiplications on farge-sco/e shored memory multiprocessor systems, l

enablin g anaJ}lsts to more easil~...discern meanin g'ful data relationships, such as those in social net'h'orksc

T

he variety and volnme of data coflected by roday's -computing ~}'~tE!ms.and the rate at ••.•. hich it i. collected. far outpace the ab,lilies of

can represenr users or 4!'11'ellits. \\Ihile edges can denote tbe ,eJationmil"5 between them. Many resE!arcl:li.'ffons halle explored the mapping between graphs and linear algecurrent systems to execute comp8ex anaR~'til:s bra, heca:usE the ability to represear graph algorithms 00, that data andprev ide m.eaningfl!l!l insights. Bl!!cal!llse as Iinear algebrak operations can g.eati~' sintpllf)' data a nal}'si,;.and al1Mvfor more I!llniform data. rreatment. of 1Ml. liIexibiiity and abilit)' to pro.ride intuiti",e mappings for a range of applications, graphs ha~'e eme.g~d as Recent studies shm" the ieasill'ilit}' of recasting the abstraction of (;hoLce-especially Ilfhcenthe ~ollected important graph algorithms as a sequence of linear algI!'data does not have an obvious structure. Grapn vertices braic aperatill'lls., such as gl'!'lIerClJi:red5'par:se rnCltru·matrix

2.&

COMPUTelit

:orithm design and performance. Indeed. optimized algorithms often require specialized data st ructures, and performance is heavily influenced by how data structures and access parter ns interact with the execution plat form's memory tuararchv, Figure ! shows some of th •• most frequently used represenranons for sparse mat nces. Unlike [he dense matrix in Figure Ia. a sparse matrix stores only the signnir ant elementsthe nonzero ent rres. Because only a subs •.•t of elernent s l5 stored and the nonzero structure ces,

Yolt:>

cannot

vanes ar russ marrildodJl1ify

an

e-ntry's

coordinates (i and II from its position in memory. One option-the COOTdinore format ICOO). shown in Fi!!ure lb-u!>es three separate arrays 1.0 store i and j, and the value of each siJ.1niikant element. A more compact representarron-e-rhe compressed sparse row format lCSRJ. shown in Figure Ie-involves creat ing a compressed row index that Identifies the beginning of each matrix row. t he column index and the value array remain the same. A notable variation of the CSR format-the compressed sparse column (CSq-imtolves ~torin& t he marr ix element ordered by column and compressing t he colum u mde x instead of the row index. SPARSE MATRIX-VECTOR MULTIPLICATION (SPMV) The matrix-vector product represents one of the most Important operations in tmear alR"bra, and is used primarily 10 solve linear equations and represent linear rran sfor rnat ions. It IS applied between an m ~ n matrix (AI and a column vector [x] of size n, and produces a new vector (bl of size m. The matrix-vector product, as defined in Equat ion L is valid lor any kind of rnatr ix, If the matrix is represented with a sparse format, however .•.•.•'" refer to the oper auon a~ SpMV.

. x ,

b = \.-.'-, A T

J

_ ••

.t •••.a

01

iI:

= O,...• ln-l.

HI

Sparse implemenrat

ions are com-

monly used for soh'inR paruat dinerantral equatrous

I,PDEslan ";,,ntifi( AUGUSf

1011'3

OJ'

11

2

http://online.qrnags_corn/CMG081S/printpage.aspx?pg=29,30,31,32,33,34,3S,36,37&prn=10

2/2/2016

Page 3 of9

IRREGULAR APPLICATIONS

I!'ngini!'t!'rin,s applicati.cns. In graphs. simpli!' deri~·ativi!'.5 c-f ttw adjacency matrix. sitch as the Laplacian matrix. are important in extracting information such as balanced appmltimat-e miniraum «Its and a double centering of the well..JknowlIhcommute-time distance berw.~n nodes in the gTal!>h..:" Purrherruere, computing the eigenll.il.luei>and eigem'of'Ctc-fs Q~. dbe Lap!acial!l matrjx call be used 1111I clustermg, partiticning, communitj· dete.ctinn. and anomaly det«tiollh_41 Anther aplPiicat'Gn

"If dne adija..:eu,cy

m.atrix is

.computing the PageR.i1ni!: algorilllm.4 All ofthese examplas require the resowhoa ofa linear system. which in turn relies on SpMV. Tradltional SpMV approaches In general-purpose architectures. the CSIR format aflows (or a "'eIY simple .. optimized SpMV implementation: the row index in figure Ic el!lables the separation of each matr:L'I:-veICtor produd·s· oomlP"Dents; the sto,ase formar keeps. tIDenonaero eDemelllltof eaeh m'lf il mHo

compressien o,fsparse Qing reduces a high perQent,

Centaur memory-butt er chip. lj, The memory channel between POWERS and Centaur provides up [0 28 Gbvres per second of memory bandwidth. Each socket supports up to eight channels for a total of 224 Gbytes per seeend of memory bandwidth. For OUI experiments, we used a two-socket POWERS running at 3.3 GHz, with a 10lalof512 Gbytes of RAM. Each socket is equipped with a dualchip module conraining two six-core chips for a total of 24 cores. The total amount ofU cache is 192 Mbytes. Tes ting gr aph a 19orithms of u nprecedented seal e is difficult because realworld data for such graphs b unavailable. To ensure that tile proposed algorirhrns, will work efficiently on graphs of various scale, structure, and sparsity, the research community has d ••voted cons iderable effort to developing synthetir graph generators that allow the creation of realist ic models for Renerally evaluating Rraphs. In our exper irneru s, we use the recur,i"" matri.\ (R.MATI'-' and block rwolewl frt/us-Ren},; (BTERI!!> generators, which produce ~raphs with properties oflarge-srate, real-world f!raplls.

ilar ity is heavily used in algorit hms that address important classes ofraalworld problems, induding findinp,

Because of this symmetry. matrix multiplication can be done only for pairs for which the row vertex id is not Jl!reater than or less than the column "~rt~.J t he execution tunes

gu lar mat rix portion,

compute SpMV for dt tfe reur mat ri x classes, in(leasing the number of nonzero elements. \"Ie u~~ a stenc i I [sc ianrific] marrtx to solve "llipti,

cardinalities

of two set s. Jaccard

similar

documents

si rn-

(dupli·

cates or n ear dupl icates] in a !arfle body such as the Web or a collection of new art icles.!' Other problem classes include query refinement for search e ngine s that

sugRest

a lter naro

formu lat ion,ll plagia r ism for a search system that d ocume

nrs

from

t he

query

detection idantifies

SaI11e SOllfCtt..n

and collaborative filt~rinR for re commender svst e rns. which make rerom mendat

ion s by ide nt ifvi n g u se rs

with similar tasres. For an uudrr ert ed graph G" (V,f), the Jaccard sim ilnrit y" biHween

.1 pai r of v er t ic es (i, j) is the

respect

ively].

PERFORMANCE EVALUATION W", lI?swd our SpMV and SpGEMM ve "ions on the IBM PO\ •••. ER& processor, an architecture designed for bi~: data and complex analytics.'S Each POWER8 (ore is equipp ••d with private Ll and 12 caches of 64 Kbyres and 512 Kbvtes, respectivelv. The 13 cache is shared, and each core has last accesv to all ~·Mhyr" slice. The OIH:hlP memory controller is not directi}' connected (0

DRA~l modules,

but r at he r tu. t he

SpMV results

PDEs

[dashed

!ill~sl.

the t ic

gener

ar ors=-whlch

and

two

[0

svu-

rna int at n

the pr oper r ies common to soc ia l network g raphs-s-tor rhe i(laph domain [sohd and das hed/dotred Iine s). Tilt! R-MAT and BTER ~"n"r~tor' us •• Ihi' pararnerer s suggesre d by t h ei I re specrrve a:U1hoL .•.r/.u~ Gr.lph~ h;l\lt! a constant inR, the

edRe factor of :n. By h avxa rne

uu.mber

AVe-V!:.'"

of

raon aero

~'H~

SI

2

http://online.qmags,com/CMG0815/printpage.aspx?pg=29,30,31,32,33,34,35,36,37

&pm= 10

2/2/2016

Page 7 of9

IRREGULAR APPLICATIONS 3:i.OO 21.13

Taw

aoo ".sr.I

4..i11l

$: ••

5

Z1lII

1I..ilIII ~

4, and 210li.mes silJll\'er than !fora S4:1entifie matrix. depi.'nding en the generator and graph size. Our improved algorithm [red lines] offers a more consistem mu11ilP!i£••tiOl.l time. W,e, nolis.;!e some diff".-e uces ..but th!i! var;atJ.ility is greati)' reduced. \i\fith our

1l.50

Q2S 0.1:1

O.3il

T'-14 GU19

OJlII,--

O!1I6

Il.ilil~

411

2B

18

5211l'

TlB

U8

rAallb SIll? (nn.
Moder n Alc.'hih~(.·-

tUIl!'~.1-Supe-rcomputmg.

6.

H.1!{1:.' U111V.I'J~.!S!o,

MeJ.~1.Ui.HH

lhi!' Sp'!J

}~..1tJs.sf1l ..:tn~~nJ

12. hI. S.ih.1U1I,utd

10.1145/2503210.2503293. of

A.

Ullm.all, Mmlllg ulMu'llVC

S. G. Gou mas et a l., "Per IOJ tU.111Ct:' Eva l-

,i\ppJ,,:ultDns.

J'urII"hCe

progral11lS.

B"""no

t.tb~

lIili[erti!!i1.!o

,r"~,,deliiflo9

,,,,CJei.e
the

soft·••• rare engineer-

j",

IlII1t!
a;pplied

,WtJlnthe

5p«!r_ !.HIS res.e.arc:l,

grrlrph L111i."CIr~· "mdl nhWllic·Qreo:l