A Programmable Parallel Accelerator for Learning and ... - Google Sites

A Programmable Parallel Accelerator for Learning and Classification Hari Cadambi, Abhinandan Majumdar, Michela Becchi Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.

www.nec-labs.com

Our Target Application Domain: Examples

Intelligent processing of l t off data. d t large amounts Learning and classification used increasingly. SEMANTIC SEARCH

FACE RECOGNITION IN A CROWD

2

1

Motivation for Accelerator • Wider range of apps: digital pathology, cognitive databases… • Increasing data Æ increasing computational load • Stringent performance constraints – Semantic Search Æ search millions of documents in a few milliseconds – Crowd face tracking Æ analyze VGA+ moving images in real-time

• Above trends justified investigating specialized architectural support for these workloads

3

This Paper • Proposed MAPLE, an accelerator with new architectural features for learning / classification • Proposed programming model, mapping strategy • What MAPLE is not: – A general-purpose engine – A replacement for GPUs or any other processor

• What this paper tries to do: – Suggest and evaluate new architectural features for these workloads, and how they could be programmed – Can an existing processor (CPU or GPU) use these features? Maybe… 4

2

How We Went About Designing MAPLE Profiled representative workloads Computational Bottlenecks Identified structure and common set of primitives Architected MAPLE Simulator, architectural exploration Prototype 5

Workload Analysis Applications Semantic S ti Search

Image Segmentation, Recognition

Digital Pathology

Algorithm

Computational Bottlenecks

SSI

Dot-products Dot products and array ranking : 99%

CNN

1D, 2D, 3D convolutions: 99%

K-means

Minimum Euclidean distance: 96%

SVM

Large matrix – vector multiplication: 85-95%

GLVQ

Minimum Euclidean distance: 99%

But is there a common structure?

6

3

Common Structure: Example • Semantic Search: Given N documents, Q concurrent queries, find top K matching documents • Computational bottleneck:

QUERY Q

QUERY 1 QUERY 2

OUT Q

ARRAY RANK

OUT 2

INTERIM RESULT

MATMUL

DOC N

OUT 1

DOC 1 DOC 2

LARGE: 1-2GB for 2-4M docs

• Common structure: Dense LA Æ large intermediate result Æ second operation (array rank, min/max, sum) 7

MAPLE Architecture Common Structure

Array Rank Find min/max Sum

Operand A DENSE LA

Operand B

LARGE INTERIM RESULT

Distribute operand B

Stream operand A

Off-chip Memory

On-chip Memory

2D PE Array

SECOND OP

REDUCED RESULT

1. In-memory processing of second op 2. Operates in parallel with PE array 3. Reduces off chip writes

Distributed Smart Memory

Chain First and Second Ops

Off-chip Memory (stream result)

8

4

PE Array and Smart Memory Simple Vector Processing Elements All PEs operate in SIMD mode Off-chip Memory y (stream A)

MEM

MEM

MEM

PE

PE

PE

MEM

MEM

MEM

PE

PE

PE

S SMART MEMORY

SMART MEMORY

Off-chip Memory (stream result) MEM

MEM

MEM

PE

PE

PE

SMART MEMORY

9

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

BROADCAST DOC1

MEM

7 6

4 2

QUERY 2

DOC 2

5 3

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

SMART MEMORY

MEM

PE

SMART MEMORY 10

5


X

DOC 3

QUERY 1

DOC 4

MEM

PE

GENERATE ROW 1

QUERY 3


7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

MEM

Top 2

PE

5

3

9

6

4

5

8

6

MEM

3

SMART MEMORY

6

SMART MEMORY

PE

5

QUERY 4

DOC 1



MEM

PE 7

11


DOC 3

QUERY 1

DOC 4

X

MEM

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

PE

QUERY 3


MEM

SEND ROW 1 to SMART MEM. BROADCAST DOC2 PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1



5 3

SMART MEMORY

MEM

PE

7 6

SMART MEMORY 12

6


X

DOC 3

QUERY 1

DOC 4

MEM

PE

QUERY 3


GENERATE ROW 2

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

4

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1



5 2

SMART MEMORY

3

MEM SMART MEMORY

7

PE 9

5

6

13


DOC 3

QUERY 1

DOC 4

X

MEM

PE

QUERY 3


9 5

1 5

8 3

1 3

1 6

5

5

3

9

6

4

5

8

6

4

5

2

3 MEMORY

9

7 SMART

5

6 MEMORY

SMART

MEM

PE G K

Top 2

MEM

PE

1

MEM

BROADCAST DOC 3. PE SMART MEM STARTS RANKING.

7 6

4 2

QUERY 2

DOC 2

5 3

QUERY 4

DOC 1



H L

14

7


DOC 2

X

DOC 3

QUERY 1

DOC 4

MEM

5 3

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 1



QUERY 3

MEM

PE

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

PE


Top 2

4

5

5

3 MEMORY

8

9 SMART

6

6 MEMORY

SMART

MEM

PE

15

Alternate Mappings: More Resources

DOC 3

MEM

PE

PE

QUERY 4

QUERY 3

PE

MEM

MEM

PE

MEM

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MEM

QUERY 2

QUERY 1

DOC 4

7 6

4 2

QUERY 2

X

5 3

Top 2

5

3

9

6

4

5

8

6

MEM

PE

SMART MEMORY

Performance doubled: Each chain produces 2 output cols.

MEM

PE

QUERY 4

DOC 2

QUERY 1

DOC 1



MEM

PE

SMART MEMORY 16

8

Alternate Mappings: Fewer Resources

DOC 2

X

DOC 3

QU 1

DOC 4

MEM

QU 1

DOC 1


Query column does not fit in PE memory

MEM

PE

7 6

4 2

9 5

1 5

8 3

1 3

1 6

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QU 2

QU 2

PE

5 3

MEM

PE

SMART MEMORY

Specifying all this splitting, etc could be a nightmare! SMART MEMORY 17

Typical Conversation with ML Domain Expert • Us: “Do you want to program our accelerator?” • Them: “Why would we do that?” • Us: “Performance.” • Them: “We don’t even like CUDA (or even C for that matter) matter). Why should we program your accelerator?” • Us: “Um, because we’re colleagues…?” Programmers don’t easily accept new accelerators!

18

9

Specifying Semantic Search First Op = matmul Streaming matrix = A Streaming matrix rows = D // number of documents Streaming matrix cols = … On-chip matrix = B On-chip matrix rows = … On-chip matrix cols = Q

// number of concurrent queries

Second Op = arrayRank(K)

MAPPING

Given problem size and architecture params, automatically does data mapping.

MAPLE ASSEMBLY

ASSEMBLY GENERATION

The compiler can also explore the design space 19

Architectural Design Choices Off-chip Memory (stream A)

Off-chip Memory (stream A)



MEM

MEM

PE

PE

MEM

MEM

PE

PE

MEM

MEM

MEM

MEM

PE

PE

PE

PE

MEM

MEM

PE

PE

MEM

MEM

PE

PE

SMART MEMORY

2 CHAINS 2 PEs / CHAIN SMART MEMORY

SMART MEMORY

1 CHAINS 4 PEs / CHAIN

SMART MEMORY

2 OFF-CHIP MEMORY BANKS 2 CHAINS / CORE 2 PEs / CHAIN SMART MEMORY

20

10

Prototype for Experiments API for transferring assembly, matrix data,…

MAPLE on Virtex5 FPGA 512 PEs, 125 MHz 64 chains, 2 memory banks

HOST: Xeon 2.5 GHz Quad-core

Tesla GPU, 1.3 GHz, 128 cores

21

Results for Semantic Search

Miilliseconds per Query

MAPLE vs CPU 32 concurrent queries, Ranking top 32 16

2.5 GHz quad-core Xeon, 4 threads

14

MAPLE Prototype (125MH ) (125MHz)

2M documents 32 concurrent queries 128 top K

12

2.5GHz Xeon 4-core: 52ms/query C870 Tesla GPU: 11.4ms/query MAPLE Proto: 3.76ms/query

10 8

• Why is MAPLE faster?

6

• PE-smart memory chaining : perform both first and second op in parallel • In-memory processing : fewer off-chip accesses

4 2 0 256K

512K

1M

Number of Documents 22

11

Results for Conv. Neural Networks 2.5 GHz quad-core Xeon, 4 threads MAPLE Prototype (125 MHz)

300

Millliseconds per frame

250 200

CNN for Face Recognition

150

2.5GHz Xeon 4-core: 6 fps C870 Tesla GPU: 9.5 fps MAPLE Proto: 13 fps

100 50 0

23

Results for SVM Training • Compared to optimized GPU implementation from UCB, MAPLE prototype is 2-6x slower • Why? – SVM has a large matrix vector mult which is memory bandwidth limited • 1 compute op / fetch

– MAPLE cannot match GPU’s memory bandwidth (GDDR5, etc) – But MAPLE is only FPGA prototype, GPU is custom processor

24

12

Summary and Conclusions • Looked into new architectural features for learning and classification – Systematically analyzed representative workloads – Identified bottlenecks, common structure

• Prototyped the accelerator system: showed promising speedups • Future – Use of such accelerators in low-power systems (e.g., with Atom as the host processor) – Embedded learning and classification

25

Thank You!

26

13

Questions • • • • •

• •

Memory model GPU with 128 PEs, MAPLE has 512 – fair comparison? Specification holes? What about other apps besides SSI? K-means? CNN? Other? Could I not rewrite my application (on a GPU) to avoid interim storage? How much of perf win comes from computation, how much from reducing memory accesses? What fraction of peak is achieved in each case? How were the CPU and GPU optimized?

•

Reviewer comments.

•

27

14

A Programmable Parallel Accelerator for Learning and ... - Google Sites

A Programmable Parallel Accelerator for Learning and ... - Google Sites

Suggest Documents

FreePipe: a Programmable Parallel Rendering ... - Google Sites

FreePipe: a Programmable Parallel Rendering ... - Google Sites

A Parallel Accelerator for Semantic Search Semantic ... - Google Sites

A Parallel Accelerator for Semantic Search - Google Sites

an accelerator-centric OS for omni-programmable ... - Google Sites

An accelerator architecture for programmable multi ... - CiteSeerX

FlexiTaint: A Programmable Accelerator for Dynamic ... - Georgia Tech

A Programmable Adaptive Router for a GALS Parallel ... - CiteSeerX

Programmable Parallel Data-path for FEC

Realization of a Programmable Parallel DSP for High ... - CiteSeerX

ParSPIKE - A Parallel DSP-Accelerator for Dynamic Simulation ... - APT

ParSPIKE - A Parallel DSP-Accelerator for Dynamic Simulation ... - APT

blended learning - The Learning Accelerator

blended learning - The Learning Accelerator

High-Performance Parallel Accelerator for Flexible and Efficient Run ...

Building and using a highly parallel programmable logic array ...

Building and using a highly parallel programmable logic ... - IEEE Xplore

Building and using a highly parallel programmable logic array

mindset - The Learning Accelerator

Programmable Controllers - Google Sites

Accelerator for Sparse Machine Learning - Technion - Electrical

Accelerator and Target Technology for Accelerator Driven ...

System Architecture for Programmable Connected ... - Google Sites

Accelerator and Target Technology for Accelerator Driven ...