Our Target Application Domain: Examples. Intelligent processing of l t fd t ... Applications. SSI. Dot-products and ....
A Programmable Parallel Accelerator for Learning and Classification Hari Cadambi, Abhinandan Majumdar, Michela Becchi Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.
www.nec-labs.com
Our Target Application Domain: Examples
Intelligent processing of l t off data. d t large amounts Learning and classification used increasingly. SEMANTIC SEARCH
FACE RECOGNITION IN A CROWD
2
1
Motivation for Accelerator • Wider range of apps: digital pathology, cognitive databases… • Increasing data Æ increasing computational load • Stringent performance constraints – Semantic Search Æ search millions of documents in a few milliseconds – Crowd face tracking Æ analyze VGA+ moving images in real-time
• Above trends justified investigating specialized architectural support for these workloads
3
This Paper • Proposed MAPLE, an accelerator with new architectural features for learning / classification • Proposed programming model, mapping strategy • What MAPLE is not: – A general-purpose engine – A replacement for GPUs or any other processor
• What this paper tries to do: – Suggest and evaluate new architectural features for these workloads, and how they could be programmed – Can an existing processor (CPU or GPU) use these features? Maybe… 4
2
How We Went About Designing MAPLE Profiled representative workloads Computational Bottlenecks Identified structure and common set of primitives Architected MAPLE Simulator, architectural exploration Prototype 5
Workload Analysis Applications Semantic S ti Search
Image Segmentation, Recognition
Digital Pathology
Algorithm
Computational Bottlenecks
SSI
Dot-products Dot products and array ranking : 99%
CNN
1D, 2D, 3D convolutions: 99%
K-means
Minimum Euclidean distance: 96%
SVM
Large matrix – vector multiplication: 85-95%
GLVQ
Minimum Euclidean distance: 99%
But is there a common structure?
6
3
Common Structure: Example • Semantic Search: Given N documents, Q concurrent queries, find top K matching documents • Computational bottleneck:
QUERY Q
QUERY 1 QUERY 2
OUT Q
ARRAY RANK
OUT 2
INTERIM RESULT
MATMUL
DOC N
OUT 1
DOC 1 DOC 2
LARGE: 1-2GB for 2-4M docs
• Common structure: Dense LA Æ large intermediate result Æ second operation (array rank, min/max, sum) 7
MAPLE Architecture Common Structure
Array Rank Find min/max Sum
Operand A DENSE LA
Operand B
LARGE INTERIM RESULT
Distribute operand B
Stream operand A
Off-chip Memory
On-chip Memory
2D PE Array
SECOND OP
REDUCED RESULT
1. In-memory processing of second op 2. Operates in parallel with PE array 3. Reduces off chip writes
Distributed Smart Memory
Chain First and Second Ops
Off-chip Memory (stream result)
8
4
PE Array and Smart Memory Simple Vector Processing Elements All PEs operate in SIMD mode Off-chip Memory y (stream A)
MEM
MEM
MEM
PE
PE
PE
MEM
MEM
MEM
PE
PE
PE
S SMART MEMORY
SMART MEMORY
Off-chip Memory (stream result) MEM
MEM
MEM
PE
PE
PE
SMART MEMORY
9
An Example: Semantic Search
DOC 3
QUERY 1
DOC 4
X
9 5
1 5
8 3
1 3
1 6
PE
QUERY 3
MAPLE’s OFF-CHIP MEMORY
BROADCAST DOC1
MEM
7 6
4 2
QUERY 2
DOC 2
5 3
MEM
PE
Top 2
5
3
9
6
4
5
8
6
MEM
PE
QUERY 4
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
SMART MEMORY
MEM
PE
SMART MEMORY 10
5
An Example: Semantic Search
X
DOC 3
QUERY 1
DOC 4
MEM
PE
GENERATE ROW 1
QUERY 3
MAPLE’s OFF-CHIP MEMORY
7 6
4 2
9 5
1 5
8 3
1 3
1 6
QUERY 2
DOC 2
5 3
MEM
Top 2
PE
5
3
9
6
4
5
8
6
MEM
3
SMART MEMORY
6
SMART MEMORY
PE
5
QUERY 4
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
MEM
PE 7
11
An Example: Semantic Search
DOC 3
QUERY 1
DOC 4
X
MEM
7 6
4 2
9 5
1 5
8 3
1 3
1 6
QUERY 2
DOC 2
5 3
PE
QUERY 3
MAPLE’s OFF-CHIP MEMORY
MEM
SEND ROW 1 to SMART MEM. BROADCAST DOC2 PE
Top 2
5
3
9
6
4
5
8
6
MEM
PE
QUERY 4
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
5 3
SMART MEMORY
MEM
PE
7 6
SMART MEMORY 12
6
An Example: Semantic Search
X
DOC 3
QUERY 1
DOC 4
MEM
PE
QUERY 3
MAPLE’s OFF-CHIP MEMORY
GENERATE ROW 2
7 6
4 2
9 5
1 5
8 3
1 3
1 6
QUERY 2
DOC 2
5 3
4
MEM
PE
Top 2
5
3
9
6
4
5
8
6
MEM
PE
QUERY 4
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
5 2
SMART MEMORY
3
MEM SMART MEMORY
7
PE 9
5
6
13
An Example: Semantic Search
DOC 3
QUERY 1
DOC 4
X
MEM
PE
QUERY 3
MAPLE’s OFF-CHIP MEMORY
9 5
1 5
8 3
1 3
1 6
5
5
3
9
6
4
5
8
6
4
5
2
3 MEMORY
9
7 SMART
5
6 MEMORY
SMART
MEM
PE G K
Top 2
MEM
PE
1
MEM
BROADCAST DOC 3. PE SMART MEM STARTS RANKING.
7 6
4 2
QUERY 2
DOC 2
5 3
QUERY 4
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
H L
14
7
An Example: Semantic Search
DOC 2
X
DOC 3
QUERY 1
DOC 4
MEM
5 3
7 6
4 2
9 5
1 5
8 3
1 3
1 6
QUERY 2
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
QUERY 3
MEM
PE
5
3
9
6
4
5
8
6
MEM
PE
QUERY 4
PE
MAPLE’s OFF-CHIP MEMORY
Top 2
4
5
5
3 MEMORY
8
9 SMART
6
6 MEMORY
SMART
MEM
PE
15
Alternate Mappings: More Resources
DOC 3
MEM
PE
PE
QUERY 4
QUERY 3
PE
MEM
MEM
PE
MEM
9 5
1 5
8 3
1 3
1 6
PE
QUERY 3
MEM
QUERY 2
QUERY 1
DOC 4
7 6
4 2
QUERY 2
X
5 3
Top 2
5
3
9
6
4
5
8
6
MEM
PE
SMART MEMORY
Performance doubled: Each chain produces 2 output cols.
MEM
PE
QUERY 4
DOC 2
QUERY 1
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
4 documents, 4 concurrent queries, need best 2 matches
MEM
PE
SMART MEMORY 16
8
Alternate Mappings: Fewer Resources
DOC 2
X
DOC 3
QU 1
DOC 4
MEM
QU 1
DOC 1
QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q
Query column does not fit in PE memory
MEM
PE
7 6
4 2
9 5
1 5
8 3
1 3
1 6
Top 2
5
3
9
6
4
5
8
6
MEM
PE
QU 2
QU 2
PE
5 3
MEM
PE
SMART MEMORY
Specifying all this splitting, etc could be a nightmare! SMART MEMORY 17
Typical Conversation with ML Domain Expert • Us: “Do you want to program our accelerator?” • Them: “Why would we do that?” • Us: “Performance.” • Them: “We don’t even like CUDA (or even C for that matter) matter). Why should we program your accelerator?” • Us: “Um, because we’re colleagues…?” Programmers don’t easily accept new accelerators!
18
9
Specifying Semantic Search First Op = matmul Streaming matrix = A Streaming matrix rows = D // number of documents Streaming matrix cols = … On-chip matrix = B On-chip matrix rows = … On-chip matrix cols = Q
// number of concurrent queries
Second Op = arrayRank(K)
MAPPING
Given problem size and architecture params, automatically does data mapping.
MAPLE ASSEMBLY
ASSEMBLY GENERATION
The compiler can also explore the design space 19
Architectural Design Choices Off-chip Memory (stream A)
Off-chip Memory (stream A)
Off-chip Memory (stream A)
Off-chip Memory (stream A)
MEM
MEM
PE
PE
MEM
MEM
PE
PE
MEM
MEM
MEM
MEM
PE
PE
PE
PE
MEM
MEM
PE
PE
MEM
MEM
PE
PE
SMART MEMORY
2 CHAINS 2 PEs / CHAIN SMART MEMORY
SMART MEMORY
1 CHAINS 4 PEs / CHAIN
SMART MEMORY
2 OFF-CHIP MEMORY BANKS 2 CHAINS / CORE 2 PEs / CHAIN SMART MEMORY
20
10
Prototype for Experiments API for transferring assembly, matrix data,…
MAPLE on Virtex5 FPGA 512 PEs, 125 MHz 64 chains, 2 memory banks
HOST: Xeon 2.5 GHz Quad-core
Tesla GPU, 1.3 GHz, 128 cores
21
Results for Semantic Search
Miilliseconds per Query
MAPLE vs CPU 32 concurrent queries, Ranking top 32 16
2.5 GHz quad-core Xeon, 4 threads
14
MAPLE Prototype (125MH ) (125MHz)
2M documents 32 concurrent queries 128 top K
12
2.5GHz Xeon 4-core: 52ms/query C870 Tesla GPU: 11.4ms/query MAPLE Proto: 3.76ms/query
10 8
• Why is MAPLE faster?
6
• PE-smart memory chaining : perform both first and second op in parallel • In-memory processing : fewer off-chip accesses
4 2 0 256K
512K
1M
Number of Documents 22
11
Results for Conv. Neural Networks 2.5 GHz quad-core Xeon, 4 threads MAPLE Prototype (125 MHz)
300
Millliseconds per frame
250 200
CNN for Face Recognition
150
2.5GHz Xeon 4-core: 6 fps C870 Tesla GPU: 9.5 fps MAPLE Proto: 13 fps
100 50 0
23
Results for SVM Training • Compared to optimized GPU implementation from UCB, MAPLE prototype is 2-6x slower • Why? – SVM has a large matrix vector mult which is memory bandwidth limited • 1 compute op / fetch
– MAPLE cannot match GPU’s memory bandwidth (GDDR5, etc) – But MAPLE is only FPGA prototype, GPU is custom processor
24
12
Summary and Conclusions • Looked into new architectural features for learning and classification – Systematically analyzed representative workloads – Identified bottlenecks, common structure
• Prototyped the accelerator system: showed promising speedups • Future – Use of such accelerators in low-power systems (e.g., with Atom as the host processor) – Embedded learning and classification
25
Thank You!
26
13
Questions • • • • •
• •
Memory model GPU with 128 PEs, MAPLE has 512 – fair comparison? Specification holes? What about other apps besides SSI? K-means? CNN? Other? Could I not rewrite my application (on a GPU) to avoid interim storage? How much of perf win comes from computation, how much from reducing memory accesses? What fraction of peak is achieved in each case? How were the CPU and GPU optimized?
•
Reviewer comments.
•
27
14