1x'. 1x'. OperaIons' Memory'ComputaIon'. +''â'''. ~32x'. ~2x'. I ⤠W â¡ (sign(I. XNOR'. BitXcount'. ~32x'. ~58x'. I
Deep Learning for Mobile Part II Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps
Today • Motivation • XNOR Networks • YOLO
State of the Art Recognition Methods
State of the art recognition methods • 'Very'Expensive'' • Memory' • ComputaIon' • Power'
Convolutional Neural Networks Convolutional Neural Networks
…'
…'
Common Deep Learning Packages • Deep Learning Packages used include, • Caffe (out of Berkley - first popular package). • MatConvNet (MATLAB interface very easy to use). • Torch (based on LUA used by Facebook) • TensorFlow (based on Python used by Google).
TensorFlow
TensorFlow
put I and weight filter W main
GPU !
R
I⇤W ⇡ R (sign(I) +''−''×'
Number'of'Opera-ons':' • AlexNet''!1.5B'FLOPs' • VGG''''''''!'19.6B'FLOPs'
Inference'-me'on'CPU':' • AlexNet''!~3'fps' • VGG''''''''!'~0.25'fps'
Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016
TensorFlow iOS
Accelerate Framework
“image operations”
“signal processing”
“matrix operations”
“misc math”
Accelerate Framework
“image operations”
“matrix operations”
“signal processing”
“misc math”
BNNS
(2016)
“basic neural network subroutines”
Accelerate Framework
“image operations”
“matrix operations”
“signal processing”
“misc math”
BNNS
(2016)
“basic neural network subroutines” (Taken from https://www.bignerdranch.com/blog/neural-networks-in-ios-10-and-macos/ )
Deep Learning Kit
http://deeplearningkit.org/
Today • Motivation • XNOR Networks • YOLO
Lower Precision
Lower Precision Train on Dense (D)
[Han'et'al.'2016]' Train on Sparse (S)
Pruning the Network
6400
6400
4800
4800
4800
Count
Count
Reducing'Precision'
• 3200 Saving'Memory' • Saving'ComputaIon' 1600
0
−0.05
R 0
3200
1600
0.05
Weight Value
(a)
Count
6400
0
3200
1600
I
−0.05
0
B
Weight Value
(b)
0
0.05
−0.05
0
0.05
Weight Value
2 { (c) 1, +1}
1Xbit' 32Xbit' 8Xbit' Figure 2: Weight distribution of the original GoogleNe
sparsity-constrained GoogleNet (c), {X1,+1}' {0,1}'get rid of the sparisty c retrain the dense network (e). MUL' XNOR' ADD,'SUB'
BitXCount'(popcount)'
Final Dense Training The final D step recovers the prune again. These previously-pruned connections are initialized learning rate (since the sparse network is already at a good decay remained unchanged. Restoring the pruned connec Rastegari, Mohammad, et al. "XNOR-Net: Classification Usingmake Binary Convolutional Neural Networks." network, and moreImageNet parameters it easier for ECCV the2016 netwo
Why Binary?
Why Binary? • Binary'InstrucIons'' • AND,'OR,'XOR,'XNOR,''PoPCount'(BitXCount)'
'
• Low'Power'Device'
Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016
My ormula My ormula ditor
ver absolute values of the elements in the input IOctober across the October channel. The 2016 2 f e sor centered at the location ij (across width and height). This procedure i f e all possible sub-tensors leads to a large number of redundant computations all sub-tensors in I (denoted by K), we can approximate the convolution Mohammad Rastegari 1 w⇥h Mohammad Rast P lve A with a 2D filter k 2Why R , Binary? K = A ⇤ k, where 8ij k = . K ij |I | w⇥h :,:,i the third row of figure 2. Once we obtained the scaling factor ↵ for the weigh nput I and weight filter W mainly using binary operations: ome this redundancy, first, we compute a matrix A = , which c scaling factors for all sub-tensors in the input I. Kij corresponds to is th fo rver all absolute sub-tensors in Iof(denoted by K), we can approximate the convolutio values the elements in the input I across the channel. Thei October 2016 October 2016 sor centered at the location ijw⇥h (across width and height). This procedure 1 Mohammad Rast Mohammad Rastegari nput I and weight filter W mainly using binary operations: lve third A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h .K the row scaling weigh I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (11 1 1 Introduction ComputaIon' factors in forIall sub-tensors in the I. Kij corresponds to fo rscaling all sub-tensors (denoted by K), we input can approximate the convolutio October 2016 October 2016 sor centered at the location ij (across width and height). This procedure i nput I andRweight filter W mainly using binary operations: I⇤WR ⇡ (sign(I) ~ +''−''×' sign(W)) K↵ (11 1x' 1x' 1 of Introduction 1 2. Introduction the third row figure Once we obtained the scaling factor ↵ for the weigh r all sub-tensors in I (denoted by K), we can approximate the convolutio I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (11 R B nput I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'
1 Introduction 1 Introduction
Binary'Weight'Networks'
XNOR' I ⇤ W ⇡ (sign(I) ~ sign(W)) RB RB BitXcount' XNORXNetworks'
K↵ ~32x'
~58x'
Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016
(11
My ormula My ormula ditor d
over absolute values of the elements in the input IOctober across the October channel. Then 2016 20 f e sor centered at the location ij (across width and height). This procedure is f e all possible sub-tensors leads to a large number of redundant computations. rlve all A sub-tensors in I (denoted by K), we can approximate the convolution Mohammad Rastegari 1 w⇥h Mohammad Raste P with a 2D filter k 2Why R ,K = A ⇤ k, where 8ij k = . K ij Binary? |I | w⇥h the third row of figure 2.W Once we obtained the scaling factor:,:,i↵ for the weight nput I and weight filter mainly using binary operations: ome this redundancy, first, we compute a matrix A = , which is the c scaling factors for all sub-tensors in the input I. Kij corresponds to for r all absolute sub-tensors in Iof(denoted by K), we can approximate the convolution over values the elements in the input I across the channel. Then October 2016 October 2016 sor centered at the location ijw⇥h (across width and height). This procedure is 1 Mohammad Raste Mohammad Rastegari nput I and weight filter W mainly using binary operations: lve third A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h .K the row scaling weight I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (11) 1 1 Introduction ComputaIon' factors in forIall sub-tensors in the I. Kij corresponds to for rscaling all sub-tensors (denoted by K), we input can approximate the convolution October 2016 October 2016 sor centered at the location ij (across width and height). This procedure is nput I andRweight filter W mainly using binary operations: I⇤WR ⇡ (sign(I) ~ +''−''×' sign(W)) K↵ (11) 1x' 1x' 1 of Introduction 1 2. Introduction the third row figure Once we obtained the scaling factor ↵ for the weight r all sub-tensors in I (denoted by K), we can approximate the convolution I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (11) R B nput I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'
1 Introduction 1 Introduction XNOR' I ⇤ W ⇡ (sign(I) ~ sign(W)) RB RB BitXcount'
K↵ ~32x'
~58x'
Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016
(11)
Reminder: XNOR
Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016
:,:,i |Iin | 2this redundancy, c⇥win ⇥h :,:,i th To overcome first, we compute a matrix A = , wh as its operands . Ifor R , where (c, w , h ) represents channels, and all sub-tensors in I (denoted by K), we can ehere compute a matrix = , which is the in in each element I2 =A I is the input tensor for the l layer of CNN c l(l=1,...,L) c c⇥w⇥h average over absolute values of, where the where elements the I across the chann dements height 2 R of the w Then winin ,element h input hin . this We propose in the input I across channel. cubes inrespectively.W figure 1). W is a set tensors, each in set W = between input I and weight filter W mainly using binary w⇥h 1 w⇥h th th l tions CNN: Binary-weights, where elements of W are binary we A with afilter 2D filter 2 ,K K A ⇤isk, where Fig.convolve This illustrates thekapproximating procedure section 3.28ij for ij = isbinary the in the lk layer ofthe the CNN. Kin the number of kapprox l,of R K = kAfigure ⇤weight k, where 8ij =Rw⇥h .explained edure explained in section 3.2 for a= convoij 1,...,K )2: nd XNOR-Networks, where elements of both areinput binary contains scaling factors for all sub-tensors inWthe I.tensors. Kij corresponds ensors the input I. KijCNN. corresponds toI aand for filters inin the lthbinary layer of the ⇤ represents convolutional operation with lution using operations. 2 c⇥w in ⇥h in sub-tensor centered the location ij(c, (across width and height). This pro Wja as its operands .and I 2 height). Rat , where winis, hin ) represents channels, (across width This procedure c⇥w⇥h shown inrespectively.W the third row of figure Once thehscaling factor ↵ for th nd height 2 factor R where obtained win(sign(I) , h Wesign(W)) propose eary-Weight-Networks we obtained the scaling ↵2.,for weight Ithe ⇤ wwe W ⇡ in .~ all can sub-tensors in I (denoted bythe K), we canof approximate the co iations offor binary CNN: Binary-weights, where elements W are binary dand by K), we approximate the convolution T T where 1 is an n-dimensional vector where all of its enteries are 1. 1 orbetween constrain a convolutional neural network hI, W, ⇤ifactored to have binary weights, where all of its enteries are 1. 1 can be input I and weight filter W mainly using binary operations: and XNOR-Networks, where elements of both I and W are binary tensors. ainly using binary operations: ate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1}c⇥w⇥h
onstrain a convolutional neural network hI, W, ⇤i to -Weight-Networks the real-value weight filter W 2 W using a binary filte factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional onstrain a convolutional neural network hI, W, ⇤i to h y: he real-value weight filter W 2 W using a binary filter factor ↵ 2 R+ suchIthat W⇡⇡(I↵B.B) A ↵convolutional ⇤W y: dicates a convolution without any multiplication. Sin W ⇡ (I with B) ↵additions and e can implement theI ⇤convolution filters reduce memory usage by a factor of ⇠ 32⇥ icates a convolution without any multiplication. Sinc ers. We represent a CNN with binary weights by hI, e can implement the convolution with additions and s ry tensors and A is a set of positive real scalars, suc filters reduce memory usage by a factor of ⇠ 32⇥ and ↵ = Alk is an scaling factor and Wlk ⇡ Alk Blk ers. We represent a CNN with binary weights by hI, ry tensors and A is a set of positive real scalars, suc
Why Binary?
out from the optimization and the optimal solutions can be achieved fro ptimal solutions can be achieved from equation 2 as ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional operation can be apfollow nary-Weight-Networks d by: I ⇤ W ⇡ (sign(I) ~ sign(W)) K↵
(I) ~ sign(W)) K↵ (11)R B R R B ⇤T ⇤ neural Tto have binary weights, ⇤ T to constrain network hI, W, ⇤i T a convolutional ⇤ ⇤ W C =H sign(Y) = sign(X ) sign(W) = H(1) B I ⇤= W ⇡ (IB B) ↵ (9) = sign(X ) sign(W) c⇥w⇥h mate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1} + indicates a convolution without any multiplication. Since the weight values aling factor ↵ 2 R such that W ⇡ ↵B. A convolutional operation can bei apSince |X |, |W | are independent, knowing that Y = X W then, i i i i knowing that Yi = Xi Wi then, y, with additions and subtractions. The bitedwe by:can implement the convolution R B E [|Y |] = E [|X ||W |] = E [|X |] E [|W |] therefore, i i i i i |] E [|W |] therefore, i ght filters reduce memory usage by a factor of ⇠ 32⇥ compared to singleB W I ⇤ W ⇡ (I B) ↵ (1) filters. We represent a CNN with binary weights by hI, B, A, i, where B is P P ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ binary tensors and A is|Y a set of positive real scalars, such that Bweight = B1lkvalues is a indicates a convolution without any multiplication. Since the | |X ||W | 1 i1 i i⇤ ⇤ W | 1 ⇤ i ter ↵ =implement AkXk is an scaling factor and W ⇡↵ A⇡ = kXk kWk = lk= lkadditions lk Blk `1 `1 ry, and we⇡can the convolution with and subtractions. The biB kWk = (10) `1 n `1 W = sign(W) n n n n n eight filters reduce memory usage by a factor of ⇠ 32⇥ compared to singlen filters. representWithout a CNN loss withofbinary weights hI, B,W, A, Bi,are where B is ng binaryWe weights: generality we by assume vectors Rastegari, Mohammad, et al.is "XNOR-Net: ImageNet Classification Binary Convolutional Neural Networks." binary tensors and A a set of positive realUsing scalars, such that B = BlkECCV is 2016 a c⇥w⇥h here n = c ⇥ w ⇥ h. To find an optimal estimation for W ⇡ ↵B, we solve the c⇥w⇥h Binary Convolution: Convolving weight filter W 2 R (where w
tion Error Quantization Methods
without any multiplication. Since th convolution with additions and subtr y usage by a factor of ⇠ 32⇥ com NN with binary weights by hI, B, A
ht-Networks
a convolutional neural network hI, W, ⇤i to have binary w value weight filter W 2BW using a binary filter B 2 {+1, W = sign(W) + ↵ 2 R such that W ⇡ ↵B. A convolutional operation can
⇤ W ⇡ (I
WB _' 0.75 R B a convolution without any multiplication. Since the weigh B) ↵
B) ↵
I ⇤ W ⇡ (I
mplement the convolution with additions and subtractions. educe memory usage by a factor of ⇠ 32⇥ compared to represent a CNN with binary weights by hI, B, A, i, wh ors and A is a set of positive real scalars, such that B = = Alk is an scaling factor and Wlk ⇡ Alk Blk
lter W 2 W using a binary filter B that W ⇡ ↵B. A convolutional oper
Optimal Scaling Factor B
arg min = J(↵, W ) ↵,WB B
J(↵, W ) = ||W
B 2 ↵W ||2
Optimal Scaling Factor B
arg min = J(↵, W ) ↵,WB B
J(↵, W ) = ||W 2
T
J(↵, B) = tr ↵ B B
B 2 ↵W ||2 T
T
2↵B W + W W
Optimal Scaling Factor B
arg min = J(↵, W ) ↵,WB B
B 2 ↵W ||2
J(↵, W ) = ||W 2
T
J(↵, B) = tr ↵ B B 2
=↵ n
T
T
2↵B W + W W T
2↵ · tr B W + constant
Simple Example
arg min ↵
2
b
2↵ · b · w + constant
s.t. b 2 {+1, 1} • Since we know that
b = +1, b = 1,
↵
is always positive then,
if w > 0 if w < 0
• Or more simply,
b = sign(w)
Optimal Scaling Factor • Since
B
W = sign(W) then, T
||W||`1 = tr W sign(W) • Therefore,
2
J(↵) = ↵ n
2↵ · ||W||`1 + constant
Scaling Factor This optimization can be solved by assigning B a convolutional neural network hI, W, ⇤i to have bina Optimal Scaling Factor B
i
nvolution without any multiplica
W < 0, by therefore optimal isand B B= sign alue weight filter Wthe 2BW aW binary {+ be assigning +1 ifsolution 0filter isolved i =using i i2= ⇤ ⇤ + Binary-Weight-Network value forsuch the that scaling factor ,Inwe taketothe derivativ imal is BW= ⇡ sign(W). order findoperation the optim 2 Rsolution ↵B. ↵ A convolutional ⇤ r ↵ , we take the derivative of J with respect to ↵ and se o zero: egy for computing ↵ top-1 to R B ⇤
I ⇤ W ⇡ (I
B) ↵
I ⇤ W ⇡ (I B) ↵ T ⇤ g equation 6T ⇤ 53.8 7 W B ⇤ ↵ = W B ⇤ ∗ =without B∗ B 2n the we convolution any multiplication. Since ↵ g a separate layer 6 α , W n= arg min {||W − 46.2 αW || } B ,α W plement the convolution with additions and subtractio ⇤ By replacing B with sign(W) duce memory usage by a factor of ⇠ 32⇥ compared sign(W) (a) epresent a CNN with binary weights by hI, B, A, i, B∗ P W = sign(W) P T T s and A is a set of |W positive scalars, such that W1real sign(W) |WB In this table, evaluate two k i| W sign(W) ⇤ i |we ↵ = = = = kWk `1 Alk isnan scaling factor and W ⇡ A B lk lk lk n n n n B W
nvolutional neural network hI, e weight filter W 2 W using a bi R+ such that W ⇡ ↵B. A conv
actors and specifying the right or
tworks
wekconvolve olve A with a 2D filter 2R How to train CNN with binary scaling allijsub-tensors scaling factors for contains allasub-tensors infactors the inputfor I. filters? K correspon a sub-tensor centered at and the location (acr nsor centered at the location ij (across width height). ij This p in rowthe of figure Once ↵ wefor o n the third row of figureshown 2. Once we third obtained scaling factor 14the Rastegari et2.al. for by all K), sub-tensors I (denoted the by cK or all sub-tensors in Iand (denoted we can in approximate How'to'train'a'CNN'with'binary'filters?' between inputusing I andbinary weightoperations: filter W mainly u input I and weight filter W mainly w⇥h
R
w⇥h A a 2D k2 , Kwith =A ⇤ k,filter where 8ijR kij ,
Binary-Weight-N for computing ↵ ≈ Strategy I ⇤ W ⇡ (sign(I) ~ I ⇤ W ⇡R (sign(I) ~ sign(W)) K↵ B R Using equation 6 Using a separate layer
(
)
(a) Table 3: In this table, we eval scaling factors and specifying (a) demonstrates the importance
Naive Solution
1. Train a network with real parameters. 2. Binarize the weight filters.
Naive Solution
W
R
R
''.'.'.''' R
'.'.'.''
Binarization
WB
B
B
''.'.'.''' B
'.'.'.''
Naive Solution
W W
R R
WBB W
B B
R ''.'.'.''' R R ''.'.'.''' R Binarization Binarization
B B
''.'.'.''' B ''.'.'.''' B
'.'.'.'' '.'.'.''
'.'.'.'' '.'.'.''
Naive Solution AlexNet'TopX1'(%)'ILSVRC2012' 60'
56.7'
50' 40' 30' 20' 10' 0'
0.2' Full'Precision' '''
Naïve'
Binary Weight Network
Myf ormulaeditor Mohammad Rastegari October 2016
1
Introduction Binary Weight Network
RB↵
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
R
R ''.'.'.''' R
'.'.'.''
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
W
Binary Weight Network
1 Introduction RB↵
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
R
R ''.'.'.''' R
'.'.'.''
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
1
Introduction Binary Weight Network
R
W R ''.'.'.''' R
'.'.'.''
B
WB B ''.'.'.''' B
'.'.'.''
RB↵
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
1
Introduction Binary Weight Network
R
W R ''.'.'.''' R
'.'.'.''
B
WB B ''.'.'.''' B
'.'.'.''
RB↵
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
W R R 1 Binary Introduction NetworkWeightRNetwork ''.'.'.''' R RB↵
[ mathescape, columns=fullflexible, basicstyle=,
W R ''.'.'.''' R '.'.'.''
'.'.'.''
WB B ''.'.'.''' B
'.'.'.''
] Train for binary weights: basicstyle=,
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) B 5. ↵ = kWnk`1 BB 6. Forward pass with ↵, W X 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
W
B
B W
''.'.'.''' B
B
'.'.'.''
LOSS$
L
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
1
Introduction Binary Weight Network
R
W R ''.'.'.''' R
'.'.'.''
B
WB B ''.'.'.''' B
'.'.'.''
R
Gw R ''.'.'.''' R
'.'.'.''
RB↵
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W ) sign(x) !
+1'
+1' X1'
Gx !
X1'
+1'
[Hinton et al. 2012]
L
1. Randomly initialize W of Binary Weights B 2. For iterGradients = 1 to N 3. Load a random input image X B 4. W = sign(W) b g(w) = f (sign{w}) = f (w ) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB b 7. Compute loss function C @g(w) @f (w ) @sign(w) @C B = 8. @W = Backward pass with ↵, W T T T R b] @w @w @[w @C 9. Update W (W = W @W ) sign(x) sign(x) !
@sign(x) Gx ! @x
+1' X1'
1
+1' X1'
+1'
B
R
Myf ormulaeditor
Binary Weight Network
Mohammad Rastegari October 2016
1
Introduction Binary Weight Network
R
RB↵
W R ''.'.'.''' R
'.'.'.''
[ mathescape, columns=fullflexible, basicstyle=,
] Train for binary weights:
1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
W=WR
R
Gw
R ''.'.'.''' R
'.'.'.''
Gw R ''.'.'.''' R
'.'.'.''
Binary Weight Network AlexNet'TopX1'(%)'ILSVRC2012' 60'
56.8'
56.7'
50' 40' 30' 20' 10' 0.2'
0' ''' Full'Precision'
Naïve'
Binary'Weight'
My ormula My ormula ditor
er absolute values of the elements in the input IOctober across the October channel. Th 2016 2 f e rlllpossible centered at the location (across width and This procedure f e sub-tensors leadsijw⇥h toby a K), largewe number ofheight). redundant computation sub-tensors in I (denoted can approximate the convolutio Mohammad Rastegar 1 Mohammad Ras P e A with a 2D filter k 2 R Reminder , K = A ⇤ k, where 8ij k = . ij |I | w⇥h :,:,i eut third row of figure 2. Once we obtained the scaling factor ↵ for the weig I and weight filter W mainly using binary operations: e this redundancy, first, we compute a matrix A = , which c aling factors for all sub-tensors in the input I. Kij corresponds to is tf sub-tensors in Iof(denoted by K), we can approximate the convoluti errllcentered absolute values the elements in the input I across the channel. Th October 2016 October 201 at the location ijw⇥h (across width and height). This procedure 1 Mohammad Ras Mohammad Rastegari put I and weight filter W mainly using binary operations: A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h . ee third row scaling weig I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (1 1 1 Introduction ComputaIon' aling factors for all sub-tensors in the input I. K corresponds to f ij ll sub-tensors in I (denoted by K), we can approximate the convoluti October 201 October 2016 r centered at the location ij (across width and height). This procedure put I andRweight filter mainly using binary operations: I⇤W ⇡W (sign(I) ~ +''−''×' sign(W)) K↵ (1 R 1x' 1x' 1 of Introduction 1 2. Introduction e third row figure Once we obtained the scaling factor ↵ for the weig ll sub-tensors in I (denoted by K), we can approximate the convoluti I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (1 R B put I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'
1 Introduction 1 Introduction XNOR' ⇡ (sign(I) ~ sign(W)) RB RB I⇤W BitXcount' XNORXNetworks'
K↵ ~32x'
~58x'
(1
P |I:,:,i | |I | 2 c⇥w ⇥h :,:,i in in To this first, we compute arepresents matrix Achannels, = , which W itsovercome operands . I 2redundancy, R= , where (c, w , h ) we as compute a matrix A , which is the in in c c c⇥w⇥h average absolute values of the elements I across and height respectively.W R , where w Then win ,in hthe input hin . We proposethe channel elements inover the input I2 across the channel. w⇥h 1 the, elements w⇥h riations of binary CNN: Binary-weights, where binary we convolve A with a 2D filter k 2 R K = Aof⇤ W k, are where 8ij kij = w⇥ R , K = A ⇤ k, where 8ij k = . K ij cedure explained in section 3.2 for approximating a convow⇥h andcontains XNOR-Networks, where both W the are binary tensors. scaling for all of sub-tensors input I. Kij corresponds to b-tensors in the inputfactors I. Kij elements corresponds toI and forin
indicates a convolution without any multiplication. Sinc , we can implement the convolution with additions and s ht filters reduce memory usage by a factor of ⇠ 32⇥ c filters. We represent a CNN with binary weights by hI, B inary tensors and A is a set of positive real scalars, such
Binary Weight - Binary Input Network O
sub-tensor centered at the This location ij (across n ija(across width and height). procedure is width and height). This proced in the the third row offactor figure↵2. Once we obtained the scaling factor ↵ for the w nce shown we obtained scaling for the weight Binary-Weight-Networks forwe allcan sub-tensors in I the (denoted by K), we can approximate the convo otedand by K), approximate convolution T hI, W, ⇤i to have binary weights, er tobetween constrain a convolutional neural network tor where all of its enteries are 1. 1 can beusing factored input I and weight filter W mainly binary operations: mainly using binary operations: mate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1}c⇥w⇥h
Binary Input and Binary Weight (XNOR1 Introduct 1 Introduction Net)
optimal solutions + can be achieved from equation 2 as caling factor ↵ 2 R such that W ⇡ ↵B. A convolutional operation can be ap-
I ⇤ W ⇡ (I
I⇤W ⇡ (sign(I) ~ sign(W)) ↵ K↵ R (11) RB ↵ B R B K↵ B X T ⇤ ⇤ I ⇤ W ⇡ (I B) ↵
ated by: gn(I) ~ sign(W)) R
B) ↵
T
= sign(X ) sign(W) = H B
(9)
BB
W (1)
o constrain a convolutional neural network hI, W, ⇤i to h te the real-value weight filter W 2 W using a binary filter ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional o d by:
indicates a convolution without any multiplication. Since the weight values , knowing that Yi = Xi Wi then, ary, we can implement the convolution with additions and subtractions. The biXeight [|Wi |]reduce therefore, i |] E filters memory usage by a factor of ⇠ 32⇥ compared to singleon filters. We represent a CNN with binary weights by hI, B, A, i, where B is ✓ and A ◆ ◆ real scalars, such that B = Blk is a f binary tensors is ✓ a set of positive |W 1Alk is an scaling 1 factor and Wlk ⇡ ⇤ A ⇤lk Blk filteri |and ↵ = ⇡ kXk kWk = ↵ (10)
n
`1
ary-Weight-Networks
n
`1
ating binary weights: Without loss of generality we assume W, B are vectors where n = c ⇥ w ⇥ h. To c⇥w⇥h find an optimal estimation for W ⇡ ↵B, we solve the gingweight filter W 2 R (where win w, hin optimization:
in ⇥hin
in the row of figure 2.the Once we obtained the scaling factor ↵ f e weshown obtained thethird scaling factor ↵ for weight nary-Weight-Networks Binary Input and Binary Weight (XNORfor can all sub-tensors inthe I (denoted by K), we can approximate the ed byand K), we approximate convolution T Binary Weight Binary Input Network to constrain a convolutional neural network hI, W, ⇤i to have binary weights, r where all of its enteries are 1. 1 can be factored 1 Introduction between input Ioperations: and weight filterIntroduction W mainly using binary operations: 1 Net) mainly using binary ate the real-value weight W 2 W using binary filter2Bas2 {+1, 1} ptimal solutions can filter be achieved froma equation
y:indicates a convolution without any multiplication. Since y, we can implement the convolution with additions and sub I⇤W ⇡ (I B) ↵ of ⇠ 32⇥ co ght filters reduce memory usage by a factor filters. We represent a CNN with binary weights by hI, B, dicates a convolution any multiplication. Sint inary tensors and A is awithout set of positive real scalars, such ter and implement ↵ = Alk is an factor and Wlkadditions ⇡ Alk Blkand e can thescaling convolution with c⇥w⇥h
aling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional operation can be aped by: IR ⇤W ⇡ (sign(I) ~ sign(W)) K↵ B n(I) ~ sign(W)) K↵ (11) B ↵ B R R B B T ⇤T ⇤ W X (9) (1) = sign(X ) sign(W)I ⇤=WH⇡ (IB B) ↵
RB↵
B indicates that a convolution without any multiplication. Since the weight values knowing YY = X W then, γ i i i Y y, we can implement the convolution with additions and subtractions. The bi]ght E filters [|Wi |]reduce therefore, memory usage by a factor ⇠ 32⇥ compared to singleγ of B
onstrain a convolutional neural network hI, W, ⇤i to ed by: the real-value weight filter W 2 W using a binary filte I ⇤ W ⇡ (I B) ↵ factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional
Y Y ∗ n filters. We represent a CNN Bwith binary weights by hI, B, A, ∗ B 2i, where B is Y , γ =◆arg min ∥Y − γYthat∥ B = B is a ✓ and A ◆ binary tensors is ✓ a set of positive real scalars, such lk Y B ,γ Wi | 1 1 ⇤ A ⇤ B lter and ↵ = A is an scaling factor and W ⇡ lk ⇡ kXk kWk∗`1 =lk ↵ lk lk 1(10) `1 n n YB = sign(Y) γ ∗ = ∥Y∥ℓ1
to constrain a convolutional neural network hI, W, ⇤i to hav ate the real-value weight filter W 2 W using a binary filter B -Weight-Networks ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional op n
ng binary weights: Without loss of generality we assume W, B are vectors here n = cB⇥ w ⇥ h. To c⇥w⇥h find an optimal estimation for W ⇡ ↵B, we solve the ∗ ∗ 1 1 weightX filter=W 2 R WB =(where winα∗ = w,∥W∥ hin β ∗ = ∥X∥ sign(X) sign(W) g⇥h optimization: ℓ1 ℓ1 n
NOR-Networks, where elements of both I and W a ary-Weight-Networks
requires computing the scaling factor for all 2 J (B, ↵) = kW ↵Bk size as W. Two of⇤ these sub-tensors are illustrated ↵ , B⇤ = argminJ (B, ↵) d X2 . Due to overlaps between ↵,B subtensors, computeads to a large number of redundant computations. in
n
(2)
ight respectively.Wwhere 2 R elements , where nd XNOR-Networks, of bothwI andwW in ,areh ns of binary CNN: Binary-weights, where the eleme
n
Binary Weight - Binary Input Network (1) Binarizing Weights
R
=
B
= =
B
(2) Binarizing Input Redundant computation in overlapping areas
R
Inefficient
X
sign(X)
(2) Binarizing Input
!
Efficient
|X:,:,i | c
=
=
B sign(X)
Average Filter
c" (3) Convolution with XNOR-Bitcount
R
R
≈
B sign(X)
B
ortant to reach high accuracy. edundancy, first, we compute a matrix A = Binary Convolution
|I:,:,i c
ute values of the elements in the input I across the w⇥h hal• anumber 2D filter k typically 2 R is cN,WKNI= A ⇤ck, where 8ij k Convolution has of operations , operations. where is the nd NI = w Note that somein modern CPUs I. canKij corres ctors for the input in hall in . sub-tensors c = no. of channels as a single cycle operation. On those CPUs, Binaryed at the location ij (across width and height). Thi speed up.= Our binary approximation of convolution NW no. of elements in W of operations is cN N , where c is the W I ow of figure 2. Once we obtained the scaling factor ↵ operations and N non-binary operations. With the I N = no. of elements in I I. Note that some modern CPUs can h n in ensors in I (denoted by K), we can approximate th n perform 64 binary operations in one clock of CPU, cN 64cNW WN I operation. On those CPUs, Binarydcycle weight filter W mainly using binary operations: uted by S = = . 1 cNwith • Replacing regular cNW Nconvolution binary convolution. W +64 I +NI 64
annel size and approximation filter size but not the input size. In figur binary of convolution I ⇤ W ⇡ (sign(I) ~ sign(W)) K↵ p achieved by changing the number of channels and and N non-binary operations. With the I ameter, fixobtain otheraparameters as follows: c= can noticeable speedup since 64256, binary • Onewe convolutional operation using XNOR and bitcount o 4of binary operations in one clock of CPU, operations can performedarchitecture in one clockhave cycle.this convolutions inbe ResNet[4] last row in 2. Note that the number of non-bin cNW Nfigure 64cN I we gain W n of convolution 62.27⇥ theoretical speed = 1 cN N +NS = cNW +64 . Wof Itheoperations. I ared toall binary n 64 with overheads, we achieve 58x speed
use this approximation for estimatin
ormula ditor Binary Weight - My Binary Input Network f
e
Binary Dot Product: To approxim Mohammad RastegariT X W ⇡ HT ↵B, where H, B 2 { October 2016 optimization:
R
1
Introduction
RB↵
R
≈
B
B
↵ ⇤ , B⇤ ,
⇤
, H⇤ =
[ mathescape, columns=fullflexible, basicstyle=, sign(X)
n We define Y 2 R such that Yi = X ] + 2 R such that = ↵. The equa 1. Randomly initialize W
2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
⇤
, C⇤ = a
XNOR Networks AlexNet'TopX1'(%)'ILSVRC2012' 60'
56.7'
56.8'
50' 40' 30.5'
30' 20' 10' 0.2'
0' '''
XNOR-Net: ImageNet Classification Using Binary
Problem with Pooling Pool$
Ac/v$ Pool'
BNorm$
AcIv'
BNorm'
Conv' '
Conv$ $
Network Structure in XNOR-Ne sign(x) !
+1
A typical block in CNN A'typical'block'in'CNN'
Fig. 3: This figure contrasts the block structure in o MaxXPooling'
CNN (left).
Training XNOR-Networks: ✗AInformaIon'Loss' typical block in ✓MulIple'Maximums'
9
Pool$
BinConv$ $
BinAc/v$
BNorm$
with Pooling on Using BinaryProblem Convolutional Neural Networks
A block in XNOR-Net
k structure in our XNOR-Network (right) with a typical
pical block in CNN contains several different layers. block in a CNN. This block has four layers in the
In this section, we explain how to approxima Rn by a dot product between two vectors in use this approximation for estimating a convo
XNOR Network
Binary Dot Product: To approximate the do XT W ⇡ HT ↵B, where H, B 2 {+1, 1} optimization:
Myf ormulaeditor Mohammad Rastegari October 2016
, H⇤ = argmin
↵,B, ,H
sign(X)
[ mathescape, columns=fullflexible, basicstyle=,
] 1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
⇤
We define Y 2 Rn such that Yi = Xi Wi , C 2 R+ such that = ↵. The equation 7 ca ⇤
, C⇤ = argmink1 ,C
Pool'
RB↵
↵ ⇤ , B⇤ ,
Conv' '
Introduction
B
B
AcIv'
R
BNorm'
1
R
≈
In this section, we explain how to approxima Rn by a dot product between two vectors in use this approximation for estimating a convo
XNOR Network
Binary Dot Product: To approximate the do XT W ⇡ HT ↵B, where H, B 2 {+1, 1} optimization:
Myf ormulaeditor Mohammad Rastegari October 2016
, H⇤ = argmin
↵,B, ,H
sign(X)
[ mathescape, columns=fullflexible, basicstyle=,
] 1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )
⇤
We define Y 2 Rn such that Yi = Xi Wi , C 2 R+ such that = ↵. The equation 7 ca ⇤
, C⇤ = argmink1 ,C
Pool'
RB↵
↵ ⇤ , B⇤ ,
Conv' '
Introduction
B
B
AcIv'
R
BNorm'
1
R
≈
Refer to this as an “XNOR” Network
Results ✓ 32x'Smaller'Model'
AlexNet'TopX1'(%)'ILSVRC2012' 60'
500 MB
56.8'
56.7'
Float
500
Binary
450 400 350
50'
300
44.2'
245 MB
250 200 100 MB
150
40'
100 50
10
20'
AlexNet
VGG
ResNet-18
Rastegari et al.
✓ 58x'Less'ComputaIon' 1200"
10'
Speedup by varying channel size Double"Precision"
1GB
1000"
Binary"Precision"
80x
65x
0.2'
60x
600"
475MB
40x 55x
400" 200"
Speedup by varying filter size
60x
800"
0'
1.5 MB
0
30.5'
30'
16 MB
7.4 MB
20x 16MB
100MB 1.5MB
7.4MB
0"
VGG*19"
ResNet*18"
AlexNet"
(a)
0x 1
32
number of channels
(b)
1024
50x 0x0
10x10
20x20
filter size
(c)
Fig. 4: This figure shows the efficiency of binary convolutions in terms of memory(a) and computation(b-c). (a) is contrasting the required memory for binary and double precision weights
Results AlexNet'Top.1$&$5'(%)'ILSVRC2012' 90' 80' 70' 60' 50' 40' 30' 20' 10' 0'
Today • Motivation • XNOR Networks • YOLO
Ali Farhadi You of Only Look Once (YOLO) University Washington
[email protected]
Person: 0.64
Horse: 0.28
1. Resize image. 2. Run convolutional network. 3. Non-max suppression.
Dog: 0.30
Figure 1: The YOLO Detection System. Processing images
with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 ⇥ 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.
Fast und the r of
box. These scores encode both the probability of that class YouinOnly Look Once (YOLO) appearing the box and how well the predicted box fits the object.
obartPM gend to
and preconditional probability map
tion ures
Figure 2: The Model. Our system models detection as a reRedmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016. gression problem. It divides the image into an even grid and si-
You Only Look Once (YOLO)
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.
You Only Look Once (YOLO)
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.
YOLO on Nature
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.
YOLO on Nature
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.