On-Chip Network-Enabled Multicore Platforms Targeting Maximum ...

83 downloads 2273 Views 747KB Size Report
The range of hardware acceleration architectures tried so far offer limited degree of fine-grain parallelism. Network-on-chip (NoC) is an emerging paradigm that ...
On-Chip Network-Enabled Multi-Core Platforms Targeting Maximum Likelihood Phylogeny Reconstruction Turbo Majumder, PhD Candidate, School of EECS ([email protected])

0

1

PE01 2

3

4

PE02 5

6

7

PE03 8

9

10

PE10 11

12

13

PE11 14

15

16

PE12 17 18

19

20

PE13 21

22

23

12

13

14

15

3

0 i

1 iv

2 v

3 vi

0

7

4 ii

5 iii

6 viii

7 vii

4

8 xv

9 xiv

10 ix

11 x

8

12 xvi

13 xiii

14 xii

15 xi

12

left

0

1

2

3

0

x1

1

2

3

11

x2 15

0

1

2

up to 8 sums-of-four-products +

+

12

contiguous on Hilbert and torus



2

3

4

5

6

7

8

9

10

11

12

13

14

0 i

1 iv

2 v

3 vi

0

7

4 ii

5 iii

6 viii

7 vii

4

15

EV 11

+

+

+

+

x3[0]

x3[1]

x3[2]

x3[3]

Traffic types and routing

Note: PEij : PE j in NoC node i j = 0, 1, 2, 3 i = 0 … N-1

30

15

3

non-contiguous on Hilbert and torus

1

14

15

8 xv

9 xiv

10 ix

11 x

12 xvi

13 xiii

14 xii

15 xi

0

1

2

3

12

Interconnect Hilbert 0, 1, 2, 3, 4, … - Torus node numbers i, ii, iii, iv, v, … - Hilbert curve node numbers A-type partition

A-type traffic

B-type partition

B-type traffic

Results

N 1400

6

avg. dispersion in 2D_serial

node of folded torus (upper level)

PE1

NoC node

800

3 600 2 400

1

200

0

0 newviewGTRCAT (f2)

coreGTRCAT (f3)

total communication latency in 2D_serial total communication latency in 2D_parallel total communication latency in 3D_torus total communication latency in 3D_sttorus residual communication latency in 2D_serial residual communication latency in 2D_parallel residual communication latency in 3D_torus residual communication latency in 3D_sttorus

Variation of partition dispersion and function communication latency across different NoC architectures

newviewGTRGAMMA (f6)

2D_serial 2D_parallel

newviewGTRGAMMA (f6)

2-D folded torus (2D_serial, 2D_parallel)

3D_sttorus 3D_torus

Function-level speedup across different NoC architectures

coreGTRCAT (f3)

3-D folded torus (3D_torus)

average dispersion of each partition

from/to crossbar of subnet

PE0

PE2

Computation core Network architectures:

avg. XYZ dispersion in 3D_sttorus 4

crossbar

6000

Test cases with larger number (avg. 23.33) of partitions

3-D stacked torus (3D_sttorus)

3D_sttorus

Average aggregate speedup of the accelerated kernels across different NoC architectures

5000

4000

3000

2000

1000

0 2D_parallel

3D_torus

3D_sttorus

Total run-times for different inputs using different NoC-based platforms vis-à-vis only software Time spent in accelerated kernels (s)

Allocation time (s)

2D_serial

292.000444

0.515478

0.130387

0.145065

292.791374

924.052039

2D_parallel

292.000444

0.481303

0.104805

0.145065

292.731617

924.052039

3D_torus

292.000444

0.433625

0.050889

0.145065

292.630024

924.052039

3D_sttorus

292.000444

0.474657

0.050889

0.145065

292.671056

924.052039

2D_serial

7038.847538

19.1142

8.467062

8.273363

7074.702162

37124.7233

2D_parallel

7038.847538

18.04733

6.805803

8.273363

7071.974034

37124.7233

3D_torus

7038.847538

16.766102

3.304655

8.273363

7067.191658

37124.7233

3D_sttorus

7038.847538

18.102936

3.304655

8.273363

7068.528491

37124.7233

Input data (DNA)

50_5000

500_5000

Total run-time using PCIe Total 4T NoC platform as interface software runhardware accelerator time (s) time (s) (s)

Unaccelerated software run-time (s)

Test cases with fewer (avg. 15.67) partitions Test cases with larger number (avg. 23.33) of partitions

20 18 16 14 12

Total system energy consumption across different NoC architectures

10 8 6

5

1000

average communication latency

PE3

7000

Test cases with fewer (avg. 15.67) partitions

3D_torus

4

avg. dispersion in 2D_parallel

1200

avg. XYZ dispersion in 3D_torus

S

subnet (lower level)

E

2D_parallel

8

Computation core Datapath: 64 bit Number representation accuracy of 2-52 using Fixed-Point Hybrid Number System All components designed with Verilog HDL and synthesized with 65 nm standard libraries Multi-core System Interconnects laid out, parasitics (resistance, capacitance) extracted to determine physical parameters (power dissipation, delay) N=16 and N=64 system sizes simulated using TreeSim 32-lane PCIe 2.0 interface (5 Gbps) Software RAxML-VI-HPC (version 7.0.4) on three inputs sourced from 2,177-taxon 68-gene mammalian dataset Pentium IV 3.2 GHz dual-core CPU; GNU gprof utility for profiling Best software runtime used as the baseline Functions coreGTRCAT (f3) (48%), newviewGTRGAMMA (f6) (21%) and newviewGTRCAT (f2) (17%) collectively account for more than 85% of the total runtime

Multi-core System Design network switch

40

2D_serial

A large number of cores to deal with sub-problems Low-latency inter-core communication

W

45

2D_serial

non-contiguous on Hilbert, contiguous on torus

0

13

Total system energy (uJ)

NoC

Total dispersion across different NoC architectures

50

Interconnect Hilbert 0, 1, 2, 3, 4, … - Torus node numbers i, ii, iii, iv, v, … - Hilbert curve node numbers

Experimental Setup

Phylogenetic tree reconstruction is data/computation intensive. Use parallelism: Divide into a large number of smaller semi-independent sub-problems that can be computed concurrently. •

Different kinds of allocated partitions using Hilbert curve

35

Why Network-on-Chip? •

55

3

Schematic representation of computation tree of newviewGTRCAT (f2)



Test cases with larger number (avg. 23.33) of partitions Test cases with fewer (avg. 15.67) partitions

60

Aggregate speedup of accelerated kernels

Finding a phylogenetic tree that best explains the evolutionary relationship among a given set of species is computationally complex because of the greater than exponential searchspace (real, multi-dimensional) and floating point arithmetic computation.

PE00

Total dispersion

Objective •

Results (cont’d.)

Dynamic Node Allocation

newviewGTRCAT (f2)

0

200

400

600

Function-level speedup

800

1000

2 0 2D_serial

2D_parallel

3D_torus

3D_sttorus

Summary •Network-on-Chip (NoC) based multi-core platform for accelerating Maximum Likelihood (ML) based phylogeny reconstruction •Chief contributions →design of a fine-grained parallel PE architecture →novel algorithm to dynamically allocate nodes to tasks based on Hilbert spacefilling curves →design and extensive evaluation of different 2-D and 3-D NoC architectures •Function-level speedup of ~847x, aggregate speedup of the accelerated portion up to ~6,500x, and overall run-time reduction of more than 5x over multithreaded software; exceeds the performance of all state-of-the-art hardware accelerators for this application class.