Sep 22, 2011 - eg. FFT: O(log n), Matrix Multiplication: O(n). Logic-in-memory changes the relative cost of operations, requiring new types of algorithms.
Application-Specific Logic-in-Memory for Polar Format Synthetic Aperture Radar Qiuling Zhu, Eric L. Turner, Christian R. Berger, Larry Pileggi, Franz Franchetti September 22, 2011
Application-Specific Logic-in-Memory Can we push some memory-intensive computational logic into or close to the memory by constructing a smart and efficient “Logic in memory” block ?
Traditional:
Logic-in-memory:
Main Memory
Local Memory
CPU
Main Memory
Local Memory
CPU
logic
logic
Slide 2
Enabling Technology: Regular Patterns D. Morris, et. al, “Design of Embedded Memory and Logic Based On Pattern Constructs” , Symp.VLSI Technology, June 2011.
Regular patterns
SRAM bitcell
Implementing sub-22nm designs using a limited set of pattern constructs can enable robust compilation of smart memories
Application-specific “Magic” memory
Compatible Logic Compatible logic cells
Slide 3
Tool Chain: Chip Generator and Memory Compiler Logic in Memory
Local Memory logic Logic
Slide 4
Tool Chain: Chip Generator and Memory Compiler Chip Generator
Logic in Memory
Smart Memory Compiler SRAM bitcell
Local Memory
App-specific logic-in-memory Compatible logic cells
logic Logic
Slide 5
Tool Chain: Chip Generator and Memory Compiler Chip Generator
Logic in Memory
Smart Memory Compiler SRAM bitcell
Local Memory
App-specific logic-in-memory Compatible logic cells
logic Logic
Chip Generator
Generates designs from high-level parameterization and specification Utilizes Stanford’s chip generator platform (Genesis 2) Smart Memory Compiler
Map memory and logic onto a set of pre-characterized pattern constructs Allow flexible synthesis of logic and memory functionalities in place of hard IP Slide 6
Big Question: Impact on Algorithms Logic-in-memory changes the relative cost of operations, requiring new types of algorithms. Traditional Data storage and processing are logically and physically split Algorithms are optimized w.r.t. cost measure as Operation count, minimum number of memory accesses, reuse,… eg. FFT: O(log n), Matrix Multiplication: O(n)
Logic-in-memory Local data dependency Regular memory access pattern Simple computational logic Cost measure changes
Slide 7
Case Study: Interpolation Memory Ex 2: Image Pyramid Memory
Ex 1: FFT Twiddle Factor AL U
level k level k-1
AL U
Ex 3:
Geometry Transformation
level k-2
Ex 4: Tomography Backprojection
x Original Phantom image
Slide 8
Outline
SAR Polar Format Algorithms for Logic-in-Memory Extension: Partial Reconstruction
Implementation and Design Automation
Experimental Results
Summary
Slide 9
Synthetic Aperture Radar (SAR) Data acquisition
Slide 10
Synthetic Aperture Radar (SAR) Data acquisition
Image formation SAR image formation
Interpolation
2D FFT
Slide 11
FFT Upsampling Based Polar Reformatting m1
n2
m1
Grid Interpolation
n2
n2
n2
Inverse 2D FFT
SAR image formation:
Computational cost:
Range interpolation
Interpolation: 10lm1·(m·log2(m) + n·log2(n))
• FFT upsampling based
Cross range interpolation 2D inverse FFT
2D IFFT: 10·n22·log2(n2) I is the number of segments per range line, m is the input segment size and n is the size of the upsampled output segment.
Slide 12
FFT Upsampling Based Polar Reformatting m1
n2
m1
Grid Interpolation
n2
n2
n2
Inverse 2D FFT
SAR image formation:
Computational cost:
Range interpolation
Interpolation: 10lm1·(m·log2(m) + n·log2(n))
• FFT upsampling based
Cross range interpolation 2D inverse FFT
2D IFFT: 10·n22·log2(n2) I is the number of segments per range line, m is the input segment size and n is the size of the upsampled output segment.
Data transferring cost:
Slide 13
FFT Upsampling Based Polar Reformatting m1
n2
m1
Grid Interpolation
n2
n2
n2
Inverse 2D FFT
SAR image formation:
Computational cost:
Range interpolation
Interpolation: 10lm1·(m·log2(m) + n·log2(n))
• FFT upsampling based
Cross range interpolation 2D inverse FFT
Logic-in-Memory Interpolation • Needs new algorithm
2D IFFT: 10·n22·log2(n2) I is the number of segments per range line, m is the input segment size and n is the size of the upsampled output segment.
Data transferring cost:
Memory
CPU
Interpolation
Slide 14
Local Interpolation Based Polar Reformatting Approach: direct local interpolation
P(x,y)
Finding neighbors is expensive Grid points in Curvilinear grid (measurements) Grid points in Cartesian space (outputs)
sqrt, atan operations are expensive in Logic-in-memory Slide 15
Local Interpolation Based Polar Reformatting dx P(x,y)
dy
(+, -,×…) sqrt, atan…
Grid points in Curvilinear grid (measurements) Grid points in Cartesian space (outputs)
Steps: Coordinate transformation • Four-corner image perspective geometric transformation • Avoid sqrt and atan
2D surface interpolation • Simple logic computation • bilinear, bicubic,… Slide 16
2D Interpolation dx dx i, j
i, j
dy
i-1, j
dy i, j-1
i, j
i-1, j+2
i-1, j+1
i, j+1 i, j+1
P(x,y)
i+1, j
Nearest Neighbor
i-1, j-1
i, j+2
P(x,y)
i+1, j+1 i+1, j-1
Bilinear Interpolation
i+2, j-1
i+1, j i+2, j
i+1, j+1 i+2, j+1
i+1, j+2 i+2, j+2
Bicubic Interpolation Dividable 2D interpolation • Bilinear: (2 horizontal + 1 vertical) 1D interpolations • Bicubic: (4 horizontal + 1 vertical) 1D interpolations • 1D interpolation: Newton divided difference form based polynomial interpolation
Suitable for Logic in Memory • Localized computation: Outputs are only decided by their neighbors • Regular memory access: Continuous or block data array access • Simple computational logic: Adders, subs, boolean operations …
Slide 17
Tiling: Accurate Geometry Approximation
error
Geometry approximation conditions: K
deltawidth is small enough RL is large enough Solution: Image tiling
Tile1
RL
Tile2
Tile in the Cartesian grid Tile3 deltawidth
Tile4
Output oriented tiling Easy to identify boundary and tile overlap
Slide 18
Outline
SAR Polar Format Algorithms for Logic-in-Memory Extension: Partial Reconstruction
Implementation and Design Automation
Experimental Results
Summary
Slide 19
SAR Partial Reconstruction Scenario: Big image, small screen, pan-and-zoom (e.g. handheld device) Bad approach: reconstruct everything, display only region of interest Better: reconstruct only what will be displayed requires sophisticated filtering before reconstruction Image data 10,000 × 10,000
Display 800× 600
Partial Image formation Partial image formation
Interpolation + Filtering
2D FFT
Slide 20
Partial Reconstruction I Reconstructs and displays low-resolution full-size image • Traditional: Interpolate all, full-size large IFFT then decimation • Alternative: Partial interpolation then smaller-size IFFT • Theory behind: Multiplication in the Frequency is identical to convolution in the spatial space.
cut off high frequencies in Fourier space
only computes the pixels that are required!
Smaller-size interpolation
Smaller-size IFFT Low pass filtering In the spatial domain
Slide 21
Partial Reconstruction II Reconstructs and displays a high-resolution image portion • Traditional: Full-size large IFFT, reconstruct all then cut off unnecessary region • Alternative: Decimation filtering and then smaller-size IFFT • Theory behind: Multiplication in the space is identical to convolution in the Fourier domain. Displacement in time is equivalent to phase shift
FFT
sample
ROI
interpolate
decimation filter
smaller IFFT
Logic in Memory
Slide 22
Decimation Filter Implementation FIR Polyphase filter is expensive at high decimation factors Cascaded Integrated Comb(CIC) filter is more economical • Large decimation factors • No multiplication inp
• CIC compensation is required
z-1
z-1
CIC filter structure
M=1 N=4
z-1
z-1
R outp z-M
z-M
Frequency Response:
z-M
z-M
Magnitude Response (dB) 0
Magnitude (dB)
-20 -40 -60
CIC ciccomp cascade
-80
-100
CIC Spec: Decimation factor = 16; N = 4; M= 1 -120 CIC Comp Spec: 0 Fp = 0.45; Fst = 0.55; Ap = 0.1dB, Ast = 35dB; 45 stages; downsample = 2 ; total decimation factor = 32 ;
5
10
15
Frequency (Hz)
Slide 23
Outline
SAR Polar Format Algorithms for Logic-in-Memory Extension: Partial Reconstruction
Implementation and Design Automation
Experimental Results
Summary
Slide 24
Design Automation and Optimization Hardware Structure
Design Automation Flow: Customized Parameters
Code Generator
Design Space Exploration
RTL Design (memory/logic mixed)
Smart memory Compiler
Target + Budget
Performance Model
Performance /Cost Report
Regular Pattern
Slide 25
Chip Generator http://genesis.web.ece.cmu.edu/gui/scratch/mydesign-10545.php
Reference: O. Shacham, O. Azizi, M. Wachs, et. al, "Rethinking Digital Design: Why Design Must Change”, Micro, IEEE, Dec 2010.
Slide 26
Outline
SAR Polar Format Algorithms for Logic-in-Memory Extension: Partial Reconstruction
Implementation and Design Automation
Experimental Results
Summary
Slide 27
Reconstruction Quality vs. FFT SAR Perfect reconstruction of point targets original
hermitian image
Actual reconstruction algorithms FFT-based
linear
cubic
Is FFT-based SAR better than interpolation-based SAR? Slide 28
Can FFT and Interpolation Be Distinguished?
nearest neighbor interpolation
FFT interpolation
bilinear interpolation
bicubic interpolation
Answer: Hypothesis Testing Hypothesis testing for linear and FFT: Random guessing:
P(Error) = 0.495 P(Error) = 0.5
Results are statistically indistinguishable. Interpolation is as good as FFT
Slide 29
Accuracy Improvement Through Tiling Mean Square Error relative to Gold Standard Method 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0
Mean square error vs. interpolation methods for different tile numbers
One-tile 4-tiles 16-tiles
Nearest Neighbor
Bilinear
Bicubic
MSE decreases with more tiling and higher interpolation order Slide 30
Energy Saving for Logic-in-Memory Energy Saving for SAR PFA Grid Interpolation 1.00E+12 1.00E+11 1.00E+10 1.00E+09
Energy(nJ) vs. SAR image size CPU_centric Logic_in_Memory
1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 size32×32
size64×64
size128×128
size256×256
size512×512
Energy saving increases with the increasing of problem size Slide 31
Accurate Region-of-Interest by Sacrificing Border Decimation Filter Hardware Cost with ROI Factors 9
Area[1000um2]vs. Region of Interest(ROI) , decimation factor = 2
8
ast=15dB
7
ast=20ddB
6
ast=25dB
5
ast=30dB
4
ast=35dB
3
error
2 1 0 0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
ast: decimation filter stopband attenuation (dB)
Imperfect image edge is resulting from non-steep filter transition region Slide 32
Partial Reconstruction: Operation saving vs. Cost 2D IFFT Computational Cost vs Decimation Factor 1.00E+10
Logic in Memory Hardware Cost
Operationcount vs. decimation factor, SAR image size = 4K×4K
4.00E-04
Logic area/memory area vs. decimation factor
3.50E-04
1.00E+09
3.00E-04
1.00E+08
Grid Interpolation + Decimation Filter(Beta=0.3,Ast=25dB) Grid Interpolation + Decimation Filter(Beta=0.3,Ast=35dB) Grid Interpolation + Decimation Filter(Beta=0.2, Ast=35dB) Grid Interpolation
2.50E-04
1.00E+07
2.00E-04
1.00E+06
1.50E-04 1.00E-04
1.00E+05
Beta: filter rolloff factors ; Ast: decimation filter stopband attenuation (dB)
5.00E-05
1.00E+04
0.00E+00
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
IFFT operation counts decreases exponential with increasing decimation factors Logic hardware cost is negligible compared with memory cost Decimation filter cost slightly increases when increasing decimation factors
Slide 33
140
Outline
SAR Polar Format Algorithms For Logic-in-Memory Extension: Partial Reconstruction
Implementation and Design Automation
Experimental Results
Summary
Slide 34
Summary Logic in Memory and its applications for interpolation Local Memory
Logic in Memory for SAR FPA and partial reconstruction Magnitude Response (dB) 0
inp
-20
Tile2
z-1
z-1
z-1
z-1
Magnitude (dB)
Tile1
R Tile3
-40 -60 -80
-100
Tile4 z-M
z-M
z-M
z-M
-120
outp
0
5
10
15
Frequency (Hz)
Evaluation and integration with Genesis2 Decimation Filter Hardware Cost 7
Area[1000um2] vs. Decimation Factor
6 5 4 3 Beta=0.3,Ast=25dB
2
Beta=0.3,Ast=35dB Beta=0.2, Ast=35dB
1
Polar-to-Rect_Interpolation
0 0
20
40
60
80
100
120
140
Slide 35