Lecture 22

Deep Learning for Mobile Part II Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps

Today • Motivation • XNOR Networks • YOLO

State of the Art Recognition Methods

State of the art recognition methods •  'Very'Expensive'' •  Memory' •  ComputaIon' •  Power'

Convolutional Neural Networks Convolutional Neural Networks

…'

…'

Common Deep Learning Packages • Deep Learning Packages used include, • Caffe (out of Berkley - first popular package). • MatConvNet (MATLAB interface very easy to use). • Torch (based on LUA used by Facebook) • TensorFlow (based on Python used by Google).

TensorFlow

TensorFlow

put I and weight filter W main

GPU !

R

I⇤W ⇡ R (sign(I) +''−''×'

Number'of'Opera-ons':' •  AlexNet''!1.5B'FLOPs' •  VGG''''''''!'19.6B'FLOPs'

Inference'-me'on'CPU':' •  AlexNet''!~3'fps' •  VGG''''''''!'~0.25'fps'

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

TensorFlow iOS

Accelerate Framework

“image operations”

“signal processing”

“matrix operations”

“misc math”





“misc math”

BNNS

(2016)

“basic neural network subroutines”





“misc math”

BNNS

(2016)

“basic neural network subroutines” (Taken from https://www.bignerdranch.com/blog/neural-networks-in-ios-10-and-macos/ )

Deep Learning Kit

http://deeplearningkit.org/


Lower Precision

Lower Precision Train on Dense (D)

[Han'et'al.'2016]' Train on Sparse (S)

Pruning the Network

6400

6400

4800

4800

4800

Count

Count

Reducing'Precision'

•  3200 Saving'Memory' •  Saving'ComputaIon' 1600

0

−0.05

R 0

3200

1600

0.05

Weight Value

(a)

Count

6400

0

3200

1600

I

−0.05

0

B

Weight Value

(b)

0

0.05

−0.05

0

0.05

Weight Value

2 { (c) 1, +1}

1Xbit' 32Xbit' 8Xbit' Figure 2: Weight distribution of the original GoogleNe

sparsity-constrained GoogleNet (c), {X1,+1}' {0,1}'get rid of the sparisty c retrain the dense network (e). MUL' XNOR' ADD,'SUB'

BitXCount'(popcount)'

Final Dense Training The final D step recovers the prune again. These previously-pruned connections are initialized learning rate (since the sparse network is already at a good decay remained unchanged. Restoring the pruned connec Rastegari, Mohammad, et al. "XNOR-Net: Classification Usingmake Binary Convolutional Neural Networks." network, and moreImageNet parameters it easier for ECCV the2016 netwo

Why Binary?

Why Binary? •  Binary'InstrucIons'' •  AND,'OR,'XOR,'XNOR,''PoPCount'(BitXCount)'

'

•  Low'Power'Device'


My ormula My ormula ditor

ver absolute values of the elements in the input IOctober across the October channel. The 2016 2 f e sor centered at the location ij (across width and height). This procedure i f e all possible sub-tensors leads to a large number of redundant computations all sub-tensors in I (denoted by K), we can approximate the convolution Mohammad Rastegari 1 w⇥h Mohammad Rast P lve A with a 2D filter k 2Why R , Binary? K = A ⇤ k, where 8ij k = . K ij |I | w⇥h :,:,i the third row of figure 2. Once we obtained the scaling factor ↵ for the weigh nput I and weight filter W mainly using binary operations: ome this redundancy, first, we compute a matrix A = , which c scaling factors for all sub-tensors in the input I. Kij corresponds to is th fo rver all absolute sub-tensors in Iof(denoted by K), we can approximate the convolutio values the elements in the input I across the channel. Thei October 2016 October 2016 sor centered at the location ijw⇥h (across width and height). This procedure 1 Mohammad Rast Mohammad Rastegari nput I and weight filter W mainly using binary operations: lve third A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h .K the row scaling weigh I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (11 1 1 Introduction ComputaIon' factors in forIall sub-tensors in the I. Kij corresponds to fo rscaling all sub-tensors (denoted by K), we input can approximate the convolutio October 2016 October 2016 sor centered at the location ij (across width and height). This procedure i nput I andRweight filter W mainly using binary operations: I⇤WR ⇡ (sign(I) ~ +''−''×' sign(W)) K↵ (11 1x' 1x' 1 of Introduction 1 2. Introduction the third row figure Once we obtained the scaling factor ↵ for the weigh r all sub-tensors in I (denoted by K), we can approximate the convolutio I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (11 R B nput I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'

1 Introduction 1 Introduction

Binary'Weight'Networks'

XNOR' I ⇤ W ⇡ (sign(I) ~ sign(W)) RB RB BitXcount' XNORXNetworks'

K↵ ~32x'

~58x'


(11

My ormula My ormula ditor d

over absolute values of the elements in the input IOctober across the October channel. Then 2016 20 f e sor centered at the location ij (across width and height). This procedure is f e all possible sub-tensors leads to a large number of redundant computations. rlve all A sub-tensors in I (denoted by K), we can approximate the convolution Mohammad Rastegari 1 w⇥h Mohammad Raste P with a 2D filter k 2Why R ,K = A ⇤ k, where 8ij k = . K ij Binary? |I | w⇥h the third row of figure 2.W Once we obtained the scaling factor:,:,i↵ for the weight nput I and weight filter mainly using binary operations: ome this redundancy, first, we compute a matrix A = , which is the c scaling factors for all sub-tensors in the input I. Kij corresponds to for r all absolute sub-tensors in Iof(denoted by K), we can approximate the convolution over values the elements in the input I across the channel. Then October 2016 October 2016 sor centered at the location ijw⇥h (across width and height). This procedure is 1 Mohammad Raste Mohammad Rastegari nput I and weight filter W mainly using binary operations: lve third A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h .K the row scaling weight I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (11) 1 1 Introduction ComputaIon' factors in forIall sub-tensors in the I. Kij corresponds to for rscaling all sub-tensors (denoted by K), we input can approximate the convolution October 2016 October 2016 sor centered at the location ij (across width and height). This procedure is nput I andRweight filter W mainly using binary operations: I⇤WR ⇡ (sign(I) ~ +''−''×' sign(W)) K↵ (11) 1x' 1x' 1 of Introduction 1 2. Introduction the third row figure Once we obtained the scaling factor ↵ for the weight r all sub-tensors in I (denoted by K), we can approximate the convolution I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (11) R B nput I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'

1 Introduction 1 Introduction XNOR' I ⇤ W ⇡ (sign(I) ~ sign(W)) RB RB BitXcount'

K↵ ~32x'

~58x'


(11)

Reminder: XNOR


:,:,i |Iin | 2this redundancy, c⇥win ⇥h :,:,i th To overcome first, we compute a matrix A = , wh as its operands . Ifor R , where (c, w , h ) represents channels, and all sub-tensors in I (denoted by K), we can ehere compute a matrix = , which is the in in each element I2 =A I is the input tensor for the l layer of CNN c l(l=1,...,L) c c⇥w⇥h average over absolute values of, where the where elements the I across the chann dements height 2 R of the w Then winin ,element h input hin . this We propose in the input I across channel. cubes inrespectively.W figure 1). W is a set tensors, each in set W = between input I and weight filter W mainly using binary w⇥h 1 w⇥h th th l tions CNN: Binary-weights, where elements of W are binary we A with afilter 2D filter 2 ,K K A ⇤isk, where Fig.convolve This illustrates thekapproximating procedure section 3.28ij for ij = isbinary the in the lk layer ofthe the CNN. Kin the number of kapprox l,of R K = kAfigure ⇤weight k, where 8ij =Rw⇥h .explained edure explained in section 3.2 for a= convoij 1,...,K )2: nd XNOR-Networks, where elements of both areinput binary contains scaling factors for all sub-tensors inWthe I.tensors. Kij corresponds ensors the input I. KijCNN. corresponds toI aand for filters inin the lthbinary layer of the ⇤ represents convolutional operation with lution using operations. 2 c⇥w in ⇥h in sub-tensor centered the location ij(c, (across width and height). This pro Wja as its operands .and I 2 height). Rat , where winis, hin ) represents channels, (across width This procedure c⇥w⇥h shown inrespectively.W the third row of figure Once thehscaling factor ↵ for th nd height 2 factor R where obtained win(sign(I) , h Wesign(W)) propose eary-Weight-Networks we obtained the scaling ↵2.,for weight Ithe ⇤ wwe W ⇡ in .~ all can sub-tensors in I (denoted bythe K), we canof approximate the co iations offor binary CNN: Binary-weights, where elements W are binary dand by K), we approximate the convolution T T where 1 is an n-dimensional vector where all of its enteries are 1. 1 orbetween constrain a convolutional neural network hI, W, ⇤ifactored to have binary weights, where all of its enteries are 1. 1 can be input I and weight filter W mainly using binary operations: and XNOR-Networks, where elements of both I and W are binary tensors. ainly using binary operations: ate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1}c⇥w⇥h

onstrain a convolutional neural network hI, W, ⇤i to -Weight-Networks the real-value weight filter W 2 W using a binary filte factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional onstrain a convolutional neural network hI, W, ⇤i to h y: he real-value weight filter W 2 W using a binary filter factor ↵ 2 R+ suchIthat W⇡⇡(I↵B.B) A ↵convolutional ⇤W y: dicates a convolution without any multiplication. Sin W ⇡ (I with B) ↵additions and e can implement theI ⇤convolution filters reduce memory usage by a factor of ⇠ 32⇥ icates a convolution without any multiplication. Sinc ers. We represent a CNN with binary weights by hI, e can implement the convolution with additions and s ry tensors and A is a set of positive real scalars, suc filters reduce memory usage by a factor of ⇠ 32⇥ and ↵ = Alk is an scaling factor and Wlk ⇡ Alk Blk ers. We represent a CNN with binary weights by hI, ry tensors and A is a set of positive real scalars, suc

Why Binary?

out from the optimization and the optimal solutions can be achieved fro ptimal solutions can be achieved from equation 2 as ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional operation can be apfollow nary-Weight-Networks d by: I ⇤ W ⇡ (sign(I) ~ sign(W)) K↵

(I) ~ sign(W)) K↵ (11)R B R R B ⇤T ⇤ neural Tto have binary weights, ⇤ T to constrain network hI, W, ⇤i T a convolutional ⇤ ⇤ W C =H sign(Y) = sign(X ) sign(W) = H(1) B I ⇤= W ⇡ (IB B) ↵ (9) = sign(X ) sign(W) c⇥w⇥h mate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1} + indicates a convolution without any multiplication. Since the weight values aling factor ↵ 2 R such that W ⇡ ↵B. A convolutional operation can bei apSince |X |, |W | are independent, knowing that Y = X W then, i i i i knowing that Yi = Xi Wi then, y, with additions and subtractions. The bitedwe by:can implement the convolution R B E [|Y |] = E [|X ||W |] = E [|X |] E [|W |] therefore, i i i i i |] E [|W |] therefore, i ght filters reduce memory usage by a factor of ⇠ 32⇥ compared to singleB W I ⇤ W ⇡ (I B) ↵ (1) filters. We represent a CNN with binary weights by hI, B, A, i, where B is P P ✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆ binary tensors and A is|Y a set of positive real scalars, such that Bweight = B1lkvalues is a indicates a convolution without any multiplication. Since the | |X ||W | 1 i1 i i⇤ ⇤ W | 1 ⇤ i ter ↵ =implement AkXk is an scaling factor and W ⇡↵ A⇡ = kXk kWk = lk= lkadditions lk Blk `1 `1 ry, and we⇡can the convolution with and subtractions. The biB kWk = (10) `1 n `1 W = sign(W) n n n n n eight filters reduce memory usage by a factor of ⇠ 32⇥ compared to singlen filters. representWithout a CNN loss withofbinary weights hI, B,W, A, Bi,are where B is ng binaryWe weights: generality we by assume vectors Rastegari, Mohammad, et al.is "XNOR-Net: ImageNet Classification Binary Convolutional Neural Networks." binary tensors and A a set of positive realUsing scalars, such that B = BlkECCV is 2016 a c⇥w⇥h here n = c ⇥ w ⇥ h. To find an optimal estimation for W ⇡ ↵B, we solve the c⇥w⇥h Binary Convolution: Convolving weight filter W 2 R (where w

tion Error Quantization Methods

without any multiplication. Since th convolution with additions and subtr y usage by a factor of ⇠ 32⇥ com NN with binary weights by hI, B, A

ht-Networks

a convolutional neural network hI, W, ⇤i to have binary w value weight filter W 2BW using a binary filter B 2 {+1, W = sign(W) + ↵ 2 R such that W ⇡ ↵B. A convolutional operation can

⇤ W ⇡ (I

WB _' 0.75 R B a convolution without any multiplication. Since the weigh B) ↵

B) ↵

I ⇤ W ⇡ (I

mplement the convolution with additions and subtractions. educe memory usage by a factor of ⇠ 32⇥ compared to represent a CNN with binary weights by hI, B, A, i, wh ors and A is a set of positive real scalars, such that B = = Alk is an scaling factor and Wlk ⇡ Alk Blk

lter W 2 W using a binary filter B that W ⇡ ↵B. A convolutional oper

Optimal Scaling Factor B

arg min = J(↵, W ) ↵,WB B

J(↵, W ) = ||W

B 2 ↵W ||2



J(↵, W ) = ||W 2

T

J(↵, B) = tr ↵ B B

B 2 ↵W ||2 T

T

2↵B W + W W



B 2 ↵W ||2

J(↵, W ) = ||W 2

T

J(↵, B) = tr ↵ B B 2

=↵ n

T

T

2↵B W + W W T

2↵ · tr B W + constant

Simple Example

arg min ↵

2

b

2↵ · b · w + constant

s.t. b 2 {+1, 1} • Since we know that

b = +1, b = 1,

↵

is always positive then,

if w > 0 if w < 0

• Or more simply,

b = sign(w)

Optimal Scaling Factor • Since

B

W = sign(W) then, T

||W||`1 = tr W sign(W) • Therefore,

2

J(↵) = ↵ n

2↵ · ||W||`1 + constant

Scaling Factor This optimization can be solved by assigning B a convolutional neural network hI, W, ⇤i to have bina Optimal Scaling Factor B

i

nvolution without any multiplica

W < 0, by therefore optimal isand B B= sign alue weight filter Wthe 2BW aW binary {+ be assigning +1 ifsolution 0filter isolved i =using i i2= ⇤ ⇤ + Binary-Weight-Network value forsuch the that scaling factor ,Inwe taketothe derivativ imal is BW= ⇡ sign(W). order findoperation the optim 2 Rsolution ↵B. ↵ A convolutional ⇤ r ↵ , we take the derivative of J with respect to ↵ and se o zero: egy for computing ↵ top-1 to R B ⇤

I ⇤ W ⇡ (I

B) ↵

I ⇤ W ⇡ (I B) ↵ T ⇤ g equation 6T ⇤ 53.8 7 W B ⇤ ↵ = W B ⇤ ∗ =without B∗ B 2n the we convolution any multiplication. Since ↵ g a separate layer 6 α , W n= arg min {||W − 46.2 αW || } B ,α W plement the convolution with additions and subtractio ⇤ By replacing B with sign(W) duce memory usage by a factor of ⇠ 32⇥ compared sign(W) (a) epresent a CNN with binary weights by hI, B, A, i, B∗ P W = sign(W) P T T s and A is a set of |W positive scalars, such that W1real sign(W) |WB In this table, evaluate two k i| W sign(W) ⇤ i |we ↵ = = = = kWk `1 Alk isnan scaling factor and W ⇡ A B lk lk lk n n n n B W

nvolutional neural network hI, e weight filter W 2 W using a bi R+ such that W ⇡ ↵B. A conv

actors and specifying the right or

tworks

wekconvolve olve A with a 2D filter 2R How to train CNN with binary scaling allijsub-tensors scaling factors for contains allasub-tensors infactors the inputfor I. filters? K correspon a sub-tensor centered at and the location (acr nsor centered at the location ij (across width height). ij This p in rowthe of figure Once ↵ wefor o n the third row of figureshown 2. Once we third obtained scaling factor 14the Rastegari et2.al. for by all K), sub-tensors I (denoted the by cK or all sub-tensors in Iand (denoted we can in approximate How'to'train'a'CNN'with'binary'ﬁlters?' between inputusing I andbinary weightoperations: filter W mainly u input I and weight filter W mainly w⇥h

R

w⇥h A a 2D k2 , Kwith =A ⇤ k,filter where 8ijR kij ,

Binary-Weight-N for computing ↵ ≈ Strategy I ⇤ W ⇡ (sign(I) ~ I ⇤ W ⇡R (sign(I) ~ sign(W)) K↵ B R Using equation 6 Using a separate layer

(

)

(a) Table 3: In this table, we eval scaling factors and specifying (a) demonstrates the importance

Naive Solution

1. Train a network with real parameters. 2. Binarize the weight filters.

Naive Solution

W

R

R

''.'.'.''' R

'.'.'.''

Binarization

WB

B

B

''.'.'.''' B

'.'.'.''

Naive Solution

W W

R R

WBB W

B B

R ''.'.'.''' R R ''.'.'.''' R Binarization Binarization

B B

''.'.'.''' B ''.'.'.''' B

'.'.'.'' '.'.'.''

'.'.'.'' '.'.'.''

Naive Solution AlexNet'TopX1'(%)'ILSVRC2012' 60'

56.7'

50' 40' 30' 20' 10' 0'

0.2' Full'Precision' '''

Naïve'

Binary Weight Network

Myf ormulaeditor Mohammad Rastegari October 2016

1

Introduction Binary Weight Network

RB↵

[ mathescape, columns=fullflexible, basicstyle=,

] Train for binary weights:

1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

R

R ''.'.'.''' R

'.'.'.''

Myf ormulaeditor


Mohammad Rastegari October 2016

W


1 Introduction RB↵




R

R ''.'.'.''' R

'.'.'.''

Myf ormulaeditor



1


R

W R ''.'.'.''' R

'.'.'.''

B

WB B ''.'.'.''' B

'.'.'.''

RB↵



1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

Myf ormulaeditor



1


R

W R ''.'.'.''' R

'.'.'.''

B

WB B ''.'.'.''' B

'.'.'.''

RB↵




Myf ormulaeditor



W R R 1 Binary Introduction NetworkWeightRNetwork ''.'.'.''' R RB↵


W R ''.'.'.''' R '.'.'.''

'.'.'.''

WB B ''.'.'.''' B

'.'.'.''

] Train for binary weights: basicstyle=,

1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) B 5. ↵ = kWnk`1 BB 6. Forward pass with ↵, W X 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

W

B

B W

''.'.'.''' B

B

'.'.'.''

LOSS$

L

Myf ormulaeditor



1


R

W R ''.'.'.''' R

'.'.'.''

B

WB B ''.'.'.''' B

'.'.'.''

R

Gw R ''.'.'.''' R

'.'.'.''

RB↵



1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W ) sign(x) !

+1'

+1' X1'

Gx !

X1'

+1'

[Hinton et al. 2012]

L

1. Randomly initialize W of Binary Weights B 2. For iterGradients = 1 to N 3. Load a random input image X B 4. W = sign(W) b g(w) = f (sign{w}) = f (w ) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB b 7. Compute loss function C @g(w) @f (w ) @sign(w) @C B = 8. @W = Backward pass with ↵, W T T T R b] @w @w @[w @C 9. Update W (W = W @W ) sign(x) sign(x) !

@sign(x) Gx ! @x

+1' X1'

1

+1' X1'

+1'

B

R

Myf ormulaeditor



1


R

RB↵

W R ''.'.'.''' R

'.'.'.''



1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) kW k`1 5. ↵ = n 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

W=WR

R

Gw

R ''.'.'.''' R

'.'.'.''

Gw R ''.'.'.''' R

'.'.'.''

Binary Weight Network AlexNet'TopX1'(%)'ILSVRC2012' 60'

56.8'

56.7'

50' 40' 30' 20' 10' 0.2'

0' ''' Full'Precision'

Naïve'

Binary'Weight'

My ormula My ormula ditor

er absolute values of the elements in the input IOctober across the October channel. Th 2016 2 f e rlllpossible centered at the location (across width and This procedure f e sub-tensors leadsijw⇥h toby a K), largewe number ofheight). redundant computation sub-tensors in I (denoted can approximate the convolutio Mohammad Rastegar 1 Mohammad Ras P e A with a 2D filter k 2 R Reminder , K = A ⇤ k, where 8ij k = . ij |I | w⇥h :,:,i eut third row of figure 2. Once we obtained the scaling factor ↵ for the weig I and weight filter W mainly using binary operations: e this redundancy, first, we compute a matrix A = , which c aling factors for all sub-tensors in the input I. Kij corresponds to is tf sub-tensors in Iof(denoted by K), we can approximate the convoluti errllcentered absolute values the elements in the input I across the channel. Th October 2016 October 201 at the location ijw⇥h (across width and height). This procedure 1 Mohammad Ras Mohammad Rastegari put I and weight filter W mainly using binary operations: A with a of 2Dfigure filter2.k Once 2 R we obtained , K = Athe ⇤ k, wherefactor 8ij ↵kijfor=thew⇥h . ee third row scaling weig I Introduction ⇤W ⇡ (sign(I) ~OperaIons' sign(W)) Memory' K↵ (1 1 1 Introduction ComputaIon' aling factors for all sub-tensors in the input I. K corresponds to f ij ll sub-tensors in I (denoted by K), we can approximate the convoluti October 201 October 2016 r centered at the location ij (across width and height). This procedure put I andRweight filter mainly using binary operations: I⇤W ⇡W (sign(I) ~ +''−''×' sign(W)) K↵ (1 R 1x' 1x' 1 of Introduction 1 2. Introduction e third row figure Once we obtained the scaling factor ↵ for the weig ll sub-tensors in I (denoted by K), we can approximate the convoluti I⇤W ⇡W (sign(I) ~ using sign(W)) K↵ (1 R B put I andRweight filter mainly +''−'''binary operations: ~32x' ~2x'

1 Introduction 1 Introduction XNOR' ⇡ (sign(I) ~ sign(W)) RB RB I⇤W BitXcount' XNORXNetworks'

K↵ ~32x'

~58x'

(1

P |I:,:,i | |I | 2 c⇥w ⇥h :,:,i in in To this first, we compute arepresents matrix Achannels, = , which W itsovercome operands . I 2redundancy, R= , where (c, w , h ) we as compute a matrix A , which is the in in c c c⇥w⇥h average absolute values of the elements I across and height respectively.W R , where w Then win ,in hthe  input hin . We proposethe channel elements inover the input I2 across the channel. w⇥h 1 the, elements w⇥h riations of binary CNN: Binary-weights, where binary we convolve A with a 2D filter k 2 R K = Aof⇤ W k, are where 8ij kij = w⇥ R , K = A ⇤ k, where 8ij k = . K ij cedure explained in section 3.2 for approximating a convow⇥h andcontains XNOR-Networks, where both W the are binary tensors. scaling for all of sub-tensors input I. Kij corresponds to b-tensors in the inputfactors I. Kij elements corresponds toI and forin

indicates a convolution without any multiplication. Sinc , we can implement the convolution with additions and s ht filters reduce memory usage by a factor of ⇠ 32⇥ c filters. We represent a CNN with binary weights by hI, B inary tensors and A is a set of positive real scalars, such

Binary Weight - Binary Input Network O

sub-tensor centered at the This location ij (across n ija(across width and height). procedure is width and height). This proced in the the third row offactor figure↵2. Once we obtained the scaling factor ↵ for the w nce shown we obtained scaling for the weight Binary-Weight-Networks forwe allcan sub-tensors in I the (denoted by K), we can approximate the convo otedand by K), approximate convolution T hI, W, ⇤i to have binary weights, er tobetween constrain a convolutional neural network tor where all of its enteries are 1. 1 can beusing factored input I and weight filter W mainly binary operations: mainly using binary operations: mate the real-value weight filter W 2 W using a binary filter B 2 {+1, 1}c⇥w⇥h

Binary Input and Binary Weight (XNOR1 Introduct 1 Introduction Net)

optimal solutions + can be achieved from equation 2 as caling factor ↵ 2 R such that W ⇡ ↵B. A convolutional operation can be ap-

I ⇤ W ⇡ (I

I⇤W ⇡ (sign(I) ~ sign(W)) ↵ K↵ R (11) RB ↵ B R B K↵ B X T ⇤ ⇤ I ⇤ W ⇡ (I B) ↵

ated by: gn(I) ~ sign(W)) R

B) ↵

T

= sign(X ) sign(W) = H B

(9)

BB

W (1)

o constrain a convolutional neural network hI, W, ⇤i to h te the real-value weight filter W 2 W using a binary filter ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional o d by:

indicates a convolution without any multiplication. Since the weight values , knowing that Yi = Xi Wi then, ary, we can implement the convolution with additions and subtractions. The biXeight [|Wi |]reduce therefore, i |] E filters memory usage by a factor of ⇠ 32⇥ compared to singleon filters. We represent a CNN with binary weights by hI, B, A, i, where B is ✓ and A ◆ ◆ real scalars, such that B = Blk is a f binary tensors is ✓ a set of positive |W 1Alk is an scaling 1 factor and Wlk ⇡ ⇤ A ⇤lk Blk filteri |and ↵ = ⇡ kXk kWk = ↵ (10)

n

`1

ary-Weight-Networks

n

`1

ating binary weights: Without loss of generality we assume W, B are vectors where n = c ⇥ w ⇥ h. To c⇥w⇥h find an optimal estimation for W ⇡ ↵B, we solve the gingweight filter W 2 R (where win w, hin optimization:

in ⇥hin

in the row of figure 2.the Once we obtained the scaling factor ↵ f e weshown obtained thethird scaling factor ↵ for weight nary-Weight-Networks Binary Input and Binary Weight (XNORfor can all sub-tensors inthe I (denoted by K), we can approximate the ed byand K), we approximate convolution T Binary Weight Binary Input Network to constrain a convolutional neural network hI, W, ⇤i to have binary weights, r where all of its enteries are 1. 1 can be factored 1 Introduction between input Ioperations: and weight filterIntroduction W mainly using binary operations: 1 Net) mainly using binary ate the real-value weight W 2 W using binary filter2Bas2 {+1, 1} ptimal solutions can filter be achieved froma equation

y:indicates a convolution without any multiplication. Since y, we can implement the convolution with additions and sub I⇤W ⇡ (I B) ↵ of ⇠ 32⇥ co ght filters reduce memory usage by a factor filters. We represent a CNN with binary weights by hI, B, dicates a convolution any multiplication. Sint inary tensors and A is awithout set of positive real scalars, such ter and implement ↵ = Alk is an factor and Wlkadditions ⇡ Alk Blkand e can thescaling convolution with c⇥w⇥h

aling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional operation can be aped by: IR ⇤W ⇡ (sign(I) ~ sign(W)) K↵ B n(I) ~ sign(W)) K↵ (11) B ↵ B R R B B T ⇤T ⇤ W X (9) (1) = sign(X ) sign(W)I ⇤=WH⇡ (IB B) ↵

RB↵

B indicates that a convolution without any multiplication. Since the weight values knowing YY = X W then, γ i i i Y y, we can implement the convolution with additions and subtractions. The bi]ght E filters [|Wi |]reduce therefore, memory usage by a factor ⇠ 32⇥ compared to singleγ of B

onstrain a convolutional neural network hI, W, ⇤i to ed by: the real-value weight filter W 2 W using a binary filte I ⇤ W ⇡ (I B) ↵ factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional

Y Y ∗ n filters. We represent a CNN Bwith binary weights by hI, B, A, ∗ B 2i, where B is Y , γ =◆arg min ∥Y − γYthat∥ B = B is a ✓ and A ◆ binary tensors is ✓ a set of positive real scalars, such lk Y B ,γ Wi | 1 1 ⇤ A ⇤ B lter and ↵ = A is an scaling factor and W ⇡ lk ⇡ kXk kWk∗`1 =lk ↵ lk lk 1(10) `1 n n YB = sign(Y) γ ∗ = ∥Y∥ℓ1

to constrain a convolutional neural network hI, W, ⇤i to hav ate the real-value weight filter W 2 W using a binary filter B -Weight-Networks ling factor ↵ 2 R+ such that W ⇡ ↵B. A convolutional op n

ng binary weights: Without loss of generality we assume W, B are vectors here n = cB⇥ w ⇥ h. To c⇥w⇥h find an optimal estimation for W ⇡ ↵B, we solve the ∗ ∗ 1 1 weightX filter=W 2 R WB =(where winα∗ = w,∥W∥ hin β ∗ = ∥X∥ sign(X) sign(W) g⇥h optimization: ℓ1 ℓ1 n

NOR-Networks, where elements of both I and W a ary-Weight-Networks

requires computing the scaling factor for all 2 J (B, ↵) = kW ↵Bk size as W. Two of⇤ these sub-tensors are illustrated ↵ , B⇤ = argminJ (B, ↵) d X2 . Due to overlaps between ↵,B subtensors, computeads to a large number of redundant computations. in

n

(2)

ight respectively.Wwhere 2 R elements , where nd XNOR-Networks, of bothwI  andwW in ,areh ns of binary CNN: Binary-weights, where the eleme

n

Binary Weight - Binary Input Network (1) Binarizing Weights

R

=

B

= =

B

(2) Binarizing Input Redundant computation in overlapping areas

R

Inefficient

X

sign(X)

(2) Binarizing Input

!

Efficient

|X:,:,i | c

=

=

B sign(X)

Average Filter

c" (3) Convolution with XNOR-Bitcount

R

R

≈

B sign(X)

B

ortant to reach high accuracy. edundancy, first, we compute a matrix A = Binary Convolution

|I:,:,i c

ute values of the elements in the input I across the w⇥h hal• anumber 2D filter k typically 2 R is cN,WKNI= A ⇤ck, where 8ij k Convolution has of operations , operations. where is the nd NI = w Note that somein modern CPUs I. canKij corres ctors for the input in hall in . sub-tensors c = no. of channels as a single cycle operation. On those CPUs, Binaryed at the location ij (across width and height). Thi speed up.= Our binary approximation of convolution NW no. of elements in W of operations is cN N , where c is the W I ow of figure 2. Once we obtained the scaling factor ↵ operations and N non-binary operations. With the I N = no. of elements in I I. Note that some modern CPUs can h n in ensors in I (denoted by K), we can approximate th n perform 64 binary operations in one clock of CPU, cN 64cNW WN I operation. On those CPUs, Binarydcycle weight filter W mainly using binary operations: uted by S = = . 1 cNwith • Replacing regular cNW Nconvolution binary convolution. W +64 I +NI 64

annel size and approximation filter size but not the input size. In figur binary of convolution I ⇤ W ⇡ (sign(I) ~ sign(W)) K↵ p achieved by changing the number of channels and and N non-binary operations. With the I ameter, fixobtain otheraparameters as follows: c= can noticeable speedup since 64256, binary • Onewe convolutional operation using XNOR and bitcount o 4of binary operations in one clock of CPU, operations can performedarchitecture in one clockhave cycle.this convolutions inbe ResNet[4] last row in 2. Note that the number of non-bin cNW Nfigure 64cN I we gain W n of convolution 62.27⇥ theoretical speed = 1 cN N +NS = cNW +64 . Wof Itheoperations. I ared toall binary n 64 with overheads, we achieve 58x speed

use this approximation for estimatin

ormula ditor Binary Weight - My Binary Input Network f

e

Binary Dot Product: To approxim Mohammad RastegariT X W ⇡ HT ↵B, where H, B 2 { October 2016 optimization:

R

1

Introduction

RB↵

R

≈

B

B

↵ ⇤ , B⇤ ,

⇤

, H⇤ =

[ mathescape, columns=fullflexible, basicstyle=, sign(X)

n We define Y 2 R such that Yi = X ] + 2 R such that = ↵. The equa 1. Randomly initialize W

2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

⇤

, C⇤ = a

XNOR Networks AlexNet'TopX1'(%)'ILSVRC2012' 60'

56.7'

56.8'

50' 40' 30.5'

30' 20' 10' 0.2'

0' '''

XNOR-Net: ImageNet Classification Using Binary

Problem with Pooling Pool$

Ac/v$ Pool'

BNorm$

AcIv'

BNorm'

Conv' '

Conv$ $

Network Structure in XNOR-Ne sign(x) !

+1

A typical block in CNN A'typical'block'in'CNN'

Fig. 3: This figure contrasts the block structure in o MaxXPooling'

CNN (left).

Training XNOR-Networks: ✗AInformaIon'Loss' typical block in ✓MulIple'Maximums'

9

Pool$

BinConv$ $

BinAc/v$

BNorm$

with Pooling on Using BinaryProblem Convolutional Neural Networks

A block in XNOR-Net

k structure in our XNOR-Network (right) with a typical

pical block in CNN contains several different layers. block in a CNN. This block has four layers in the

In this section, we explain how to approxima Rn by a dot product between two vectors in use this approximation for estimating a convo

XNOR Network

Binary Dot Product: To approximate the do XT W ⇡ HT ↵B, where H, B 2 {+1, 1} optimization:


, H⇤ = argmin

↵,B, ,H

sign(X)


] 1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

⇤

We define Y 2 Rn such that Yi = Xi Wi , C 2 R+ such that = ↵. The equation 7 ca ⇤

, C⇤ = argmink1 ,C

Pool'

RB↵

↵ ⇤ , B⇤ ,

Conv' '

Introduction

B

B

AcIv'

R

BNorm'

1

R

≈

In this section, we explain how to approxima Rn by a dot product between two vectors in use this approximation for estimating a convo

XNOR Network

Binary Dot Product: To approximate the do XT W ⇡ HT ↵B, where H, B 2 {+1, 1} optimization:


, H⇤ = argmin

↵,B, ,H

sign(X)


] 1. Randomly initialize W 2. For iter = 1 to N 3. Load a random input image X 4. WB = sign(W) 5. ↵ = kWnk`1 6. Forward pass with ↵, WB 7. Compute loss function C @C 8. @W = Backward pass with ↵, WB @C 9. Update W (W = W @W )

⇤

We define Y 2 Rn such that Yi = Xi Wi , C 2 R+ such that = ↵. The equation 7 ca ⇤

, C⇤ = argmink1 ,C

Pool'

RB↵

↵ ⇤ , B⇤ ,

Conv' '

Introduction

B

B

AcIv'

R

BNorm'

1

R

≈

Refer to this as an “XNOR” Network

Results ✓ 32x'Smaller'Model'

AlexNet'TopX1'(%)'ILSVRC2012' 60'

500 MB

56.8'

56.7'

Float

500

Binary

450 400 350

50'

300

44.2'

245 MB

250 200 100 MB

150

40'

100 50

10

20'

AlexNet

VGG

ResNet-18

Rastegari et al.

✓ 58x'Less'ComputaIon' 1200"

10'

Speedup by varying channel size Double"Precision"

1GB

1000"

Binary"Precision"

80x

65x

0.2'

60x

600"

475MB

40x 55x

400" 200"

Speedup by varying filter size

60x

800"

0'

1.5 MB

0

30.5'

30'

16 MB

7.4 MB

20x 16MB

100MB 1.5MB

7.4MB

0"

VGG*19"

ResNet*18"

AlexNet"

(a)

0x 1

32

number of channels

(b)

1024

50x 0x0

10x10

20x20

filter size

(c)

Fig. 4: This figure shows the efficiency of binary convolutions in terms of memory(a) and computation(b-c). (a) is contrasting the required memory for binary and double precision weights

Results AlexNet'Top.1$&$5'(%)'ILSVRC2012' 90' 80' 70' 60' 50' 40' 30' 20' 10' 0'


Ali Farhadi You of Only Look Once (YOLO) University Washington [email protected]

Person: 0.64

Horse: 0.28

1. Resize image. 2. Run convolutional network. 3. Non-max suppression.

Dog: 0.30

Figure 1: The YOLO Detection System. Processing images

with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 ⇥ 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

Fast und the r of

box. These scores encode both the probability of that class YouinOnly Look Once (YOLO) appearing the box and how well the predicted box fits the object.

obartPM gend to

and preconditional probability map

tion ures

Figure 2: The Model. Our system models detection as a reRedmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016. gression problem. It divides the image into an even grid and si-

You Only Look Once (YOLO)

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

You Only Look Once (YOLO)


YOLO on Nature


YOLO on Nature


Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...

Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...

Suggest Documents

Advanced Deep Learning Methods

CS589: Advanced Computer Networks Topics Covered Lecture ...

Deep learning based computer-aided diagnosis for

Lecture Notes for Advanced Calculus

Lecture 22: Robotics

Advanced Steel Microstructural Classification by Deep Learning

Call for Proposals UNESCO Mobile Learning Week 18-22 February ...

Lecture 6: September 22 6.1 Path coupling - Computer Science

CS425_ Computer Networks _ Lecture 22.pdf - Google Drive

advanced learning for advanced research

Deployment of Mobile Learning in Advanced

Computer-Aided Diagnosis with Deep Learning ...

Computer-Aided Diagnosis with Deep Learning

Deep Learning Based Computer Generated Face

Computer-aided Diagnosis using Deep Learning in

A Hybrid Deep Learning Architecture for Privacy-Preserving Mobile ...

Advanced Graphics - Lecture 6

Advanced Graphics Lecture Notes

Advanced Photogeology Lecture Notes

Lecture 22: Network Programming - NYU

Deep Learning for Computer Vision MIT 6.S191

Download Deep Learning for Computer Architects - Google Sites

Review Article Deep Learning for Computer Vision: A

Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...

Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...

Suggest Documents

Advanced Deep Learning Methods

Lecture 22

CS589: Advanced Computer Networks Topics Covered Lecture ...

Deep learning based computer-aided diagnosis for

Lecture Notes for Advanced Calculus

Lecture 22: Robotics

Advanced Steel Microstructural Classification by Deep Learning

Call for Proposals UNESCO Mobile Learning Week 18-22 February ...

Lecture 6: September 22 6.1 Path coupling - Computer Science

CS425_ Computer Networks _ Lecture 22.pdf - Google Drive

advanced learning for advanced research

Deployment of Mobile Learning in Advanced

Computer-Aided Diagnosis with Deep Learning ...

Computer-Aided Diagnosis with Deep Learning

Deep Learning Based Computer Generated Face

Computer-aided Diagnosis using Deep Learning in

A Hybrid Deep Learning Architecture for Privacy-Preserving Mobile ...

Advanced Graphics - Lecture 6

Advanced Graphics Lecture Notes

Advanced Photogeology Lecture Notes

Lecture 22: Network Programming - NYU

Deep Learning for Computer Vision MIT 6.S191

Download Deep Learning for Computer Architects - Google Sites

Review Article Deep Learning for Computer Vision: A

Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...

Lecture 22 - Deep Learning for Mobile - 16623 â Advanced Computer ...