On-Chip Network-Enabled Multi-Core Platforms Targeting Maximum Likelihood Phylogeny Reconstruction Turbo Majumder, PhD Candidate, School of EECS (
[email protected])
0
1
PE01 2
3
4
PE02 5
6
7
PE03 8
9
10
PE10 11
12
13
PE11 14
15
16
PE12 17 18
19
20
PE13 21
22
23
12
13
14
15
3
0 i
1 iv
2 v
3 vi
0
7
4 ii
5 iii
6 viii
7 vii
4
8 xv
9 xiv
10 ix
11 x
8
12 xvi
13 xiii
14 xii
15 xi
12
left
0
1
2
3
0
x1
1
2
3
11
x2 15
0
1
2
up to 8 sums-of-four-products +
+
12
contiguous on Hilbert and torus
•
2
3
4
5
6
7
8
9
10
11
12
13
14
0 i
1 iv
2 v
3 vi
0
7
4 ii
5 iii
6 viii
7 vii
4
15
EV 11
+
+
+
+
x3[0]
x3[1]
x3[2]
x3[3]
Traffic types and routing
Note: PEij : PE j in NoC node i j = 0, 1, 2, 3 i = 0 … N-1
30
15
3
non-contiguous on Hilbert and torus
1
14
15
8 xv
9 xiv
10 ix
11 x
12 xvi
13 xiii
14 xii
15 xi
0
1
2
3
12
Interconnect Hilbert 0, 1, 2, 3, 4, … - Torus node numbers i, ii, iii, iv, v, … - Hilbert curve node numbers A-type partition
A-type traffic
B-type partition
B-type traffic
Results
N 1400
6
avg. dispersion in 2D_serial
node of folded torus (upper level)
PE1
NoC node
800
3 600 2 400
1
200
0
0 newviewGTRCAT (f2)
coreGTRCAT (f3)
total communication latency in 2D_serial total communication latency in 2D_parallel total communication latency in 3D_torus total communication latency in 3D_sttorus residual communication latency in 2D_serial residual communication latency in 2D_parallel residual communication latency in 3D_torus residual communication latency in 3D_sttorus
Variation of partition dispersion and function communication latency across different NoC architectures
newviewGTRGAMMA (f6)
2D_serial 2D_parallel
newviewGTRGAMMA (f6)
2-D folded torus (2D_serial, 2D_parallel)
3D_sttorus 3D_torus
Function-level speedup across different NoC architectures
coreGTRCAT (f3)
3-D folded torus (3D_torus)
average dispersion of each partition
from/to crossbar of subnet
PE0
PE2
Computation core Network architectures:
avg. XYZ dispersion in 3D_sttorus 4
crossbar
6000
Test cases with larger number (avg. 23.33) of partitions
3-D stacked torus (3D_sttorus)
3D_sttorus
Average aggregate speedup of the accelerated kernels across different NoC architectures
5000
4000
3000
2000
1000
0 2D_parallel
3D_torus
3D_sttorus
Total run-times for different inputs using different NoC-based platforms vis-à-vis only software Time spent in accelerated kernels (s)
Allocation time (s)
2D_serial
292.000444
0.515478
0.130387
0.145065
292.791374
924.052039
2D_parallel
292.000444
0.481303
0.104805
0.145065
292.731617
924.052039
3D_torus
292.000444
0.433625
0.050889
0.145065
292.630024
924.052039
3D_sttorus
292.000444
0.474657
0.050889
0.145065
292.671056
924.052039
2D_serial
7038.847538
19.1142
8.467062
8.273363
7074.702162
37124.7233
2D_parallel
7038.847538
18.04733
6.805803
8.273363
7071.974034
37124.7233
3D_torus
7038.847538
16.766102
3.304655
8.273363
7067.191658
37124.7233
3D_sttorus
7038.847538
18.102936
3.304655
8.273363
7068.528491
37124.7233
Input data (DNA)
50_5000
500_5000
Total run-time using PCIe Total 4T NoC platform as interface software runhardware accelerator time (s) time (s) (s)
Unaccelerated software run-time (s)
Test cases with fewer (avg. 15.67) partitions Test cases with larger number (avg. 23.33) of partitions
20 18 16 14 12
Total system energy consumption across different NoC architectures
10 8 6
5
1000
average communication latency
PE3
7000
Test cases with fewer (avg. 15.67) partitions
3D_torus
4
avg. dispersion in 2D_parallel
1200
avg. XYZ dispersion in 3D_torus
S
subnet (lower level)
E
2D_parallel
8
Computation core Datapath: 64 bit Number representation accuracy of 2-52 using Fixed-Point Hybrid Number System All components designed with Verilog HDL and synthesized with 65 nm standard libraries Multi-core System Interconnects laid out, parasitics (resistance, capacitance) extracted to determine physical parameters (power dissipation, delay) N=16 and N=64 system sizes simulated using TreeSim 32-lane PCIe 2.0 interface (5 Gbps) Software RAxML-VI-HPC (version 7.0.4) on three inputs sourced from 2,177-taxon 68-gene mammalian dataset Pentium IV 3.2 GHz dual-core CPU; GNU gprof utility for profiling Best software runtime used as the baseline Functions coreGTRCAT (f3) (48%), newviewGTRGAMMA (f6) (21%) and newviewGTRCAT (f2) (17%) collectively account for more than 85% of the total runtime
Multi-core System Design network switch
40
2D_serial
A large number of cores to deal with sub-problems Low-latency inter-core communication
W
45
2D_serial
non-contiguous on Hilbert, contiguous on torus
0
13
Total system energy (uJ)
NoC
Total dispersion across different NoC architectures
50
Interconnect Hilbert 0, 1, 2, 3, 4, … - Torus node numbers i, ii, iii, iv, v, … - Hilbert curve node numbers
Experimental Setup
Phylogenetic tree reconstruction is data/computation intensive. Use parallelism: Divide into a large number of smaller semi-independent sub-problems that can be computed concurrently. •
Different kinds of allocated partitions using Hilbert curve
35
Why Network-on-Chip? •
55
3
Schematic representation of computation tree of newviewGTRCAT (f2)
•
Test cases with larger number (avg. 23.33) of partitions Test cases with fewer (avg. 15.67) partitions
60
Aggregate speedup of accelerated kernels
Finding a phylogenetic tree that best explains the evolutionary relationship among a given set of species is computationally complex because of the greater than exponential searchspace (real, multi-dimensional) and floating point arithmetic computation.
PE00
Total dispersion
Objective •
Results (cont’d.)
Dynamic Node Allocation
newviewGTRCAT (f2)
0
200
400
600
Function-level speedup
800
1000
2 0 2D_serial
2D_parallel
3D_torus
3D_sttorus
Summary •Network-on-Chip (NoC) based multi-core platform for accelerating Maximum Likelihood (ML) based phylogeny reconstruction •Chief contributions →design of a fine-grained parallel PE architecture →novel algorithm to dynamically allocate nodes to tasks based on Hilbert spacefilling curves →design and extensive evaluation of different 2-D and 3-D NoC architectures •Function-level speedup of ~847x, aggregate speedup of the accelerated portion up to ~6,500x, and overall run-time reduction of more than 5x over multithreaded software; exceeds the performance of all state-of-the-art hardware accelerators for this application class.