Motivation: To recognize a human action from a surveillance video feed at long ... To be invariant to initialization of the starting/ending points of an action cycle ...
Regression based Learning of Human Actions from Video using HOF-LBP Flow patterns Binu M Nair, Vijayan K Asari
Motivation and Objectives •
Motivation: To recognize a human action from a surveillance video feed at long distance.
•
Objectives: To develop a human action recognition framework – Which is invariant to sequence length normalization – Can classify human actions from 10-15 frames (for real time operation) – To account for variation in speed of an action •
Different people wave with different speeds
– To be invariant to initialization of the starting/ending points of an action cycle
Overview of proposed algorithm •
Define and extract suitable motion descriptors based on the optical flow at each frame
•
Using the extracted motion descriptors, define action manifolds for each class. – Contains variations of motion with respect to the sequence
•
Learn a neural network to characterize each action manifold.
•
Classify the test sequence using the learned neural networks.
Proposed Methodology
1.
Motion Representation using Histogram of Oriented Flow and Local Binary Flow patterns(HOFLBP). –
2.
Computation of Reduced Posture Space using PCA –
3.
Motion descriptor computed from optical flow for each frame of the video sequence
Computing an action manifold for each action class using Principal Component Analysis
Modeling of Action Manifolds using Generalized Regression Neural Networks
Motion Representation using HOF-LBP Flow Patterns
Motion Representation using Histogram of Flow Patterns •
Gives information about the extent of motion on a local scale and the direction of motion
•
Algorithm – Compute Optical Flow < 𝑣𝑥 , 𝑣𝑦 > between consecutive frames at location (𝑥, 𝑦) – Compute the magnitude and direction images from optical flow. – Divide them into 𝐾 blocks •
At each block, histogram of flow is computed
•
Histogram of flow: weighted histogram of the flow direction with the weights being the corresponding magnitude.
•
•
Concatenate across blocks to get the HOF descriptor
These are local distributions which change during the course of an action sequence.
Motion Representation using Local Binary Flow Patterns •
To extract relationship between the flow vectors in different regions of the body
•
This “textural” context can be extracted by using the Local Binary Pattern encoding on optical flow magnitude and direction. 𝑃
2𝑝 𝑠𝑔𝑛(𝜃𝑐 − 𝜃𝑖 )
𝐿𝐵𝑃𝑃,𝑅 𝜃𝑐 = 𝑖=1
•
𝜃5 𝜃6 𝜃7
0
0
𝜃4 𝜃𝑐 𝜃0
0
𝜃𝑐 1
𝜃3 𝜃2 𝜃1
0
1
0
A sampling grid of (P,R) = (16,2) where P refers to the number of neighbors and R refers to the radius of the neighborhood.
•
0
The concatenation of HOF and LBP constitutes the action feature set
Feature Extraction - Optical Flow
HOF (5,5) + LBP(16,2) LBP(16,2)
Action Feature
Computation of Reduced Posture Space
Computation of Reduced Posture Space using PCA • Aim is to perform regression analysis on the set of action features – Action features will be considered as the regressors/input variables to a Frame k
regression function. – Selection of the response/output variable should
dim 2
• Bring out the variations in the regressors w.r.t to time • Be invariant to the time : selecting time will not be the solution
Frame 1 dim 1
• A multivariate time-series set of (regressors,responses) for each action class would correspond to an action manifold(Reduced posture space). • The frames of an action sequence is then considered as points on a particular manifold.
– One method to treat a multi-variate time series data • Prinicipal Component Analysis or Empirical Orthogonal Function Analysis • Time series data is represented as a linear combination of time-independent orthogonal basis functions(Eigen vectors) with time varying amplitude(Eigen coefficients).
Computation of Reduced Posture Space using PCA for action class 𝑚 • EOF Analysis – Let 𝑃 𝑡 = 𝑝1 𝑡 𝑝2 𝑡 … . 𝑝𝑀 𝑡
𝑇
∈ 𝑅𝑀 and is observed at 𝑡1 , 𝑡2 , 𝑡3 … . 𝑡𝑁 , then
𝑀
𝑝𝑚 𝑡𝑖 =
𝑌𝑘𝑚 . 𝑄𝑘 (𝑡𝑖 ) ; 𝑌𝑘𝑚 − 𝑏𝑎𝑠𝑖𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 ; 𝑄𝑘 𝑡𝑖 − 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 𝑘=1
XK(m)xD
Eigen Vectors(𝑉𝐷×𝑑 )
PCA
Coefficients (𝑌𝐾 𝑚 ×𝑑 ) – Extending this to our motion feature set 𝑋 = [𝑥1 , 𝑥2 … . 𝑥𝐾
𝑚
] of the action
class 𝑚 having a total of 𝐾 𝑚 frames and 𝑥𝑘 ∈ 𝑅𝐷 , • We get time independent basis functions which are Eigen vectors V = [𝑣1 , 𝑣2 , … . 𝑣d ] • We get time dependent coefficients 𝑌 = [𝑦1 , 𝑦2 … . 𝑦𝐾
𝑚
] and 𝑦𝑘 ∈ 𝑅 𝑑
• Establishes one-to-one correspondences between motion feature set 𝑋 and coefficients 𝑌
Modelling the action posture space using GRNN
Modeling of Action Manifolds using Generalized Regression Neural Networks •
Generalized Regression Neural Networks – Used to learn the functional mapping between 𝑥𝑘 and 𝑦𝑘 for an action class 𝑚. – Based on the radial basis function network – Faster training scheme which is one-pass algorithm – The number of input nodes depends on number of training samples. – K-Mean clustering is used before training so to reduce training sample size •
𝑆𝐷 𝑚 = { 𝑥𝑘 : 1 ≤ 𝑘 ≤ 𝐾 𝑚
•
𝑆𝑑 𝑚 = {𝑦𝑘 : 1 ≤ 𝑘 ≤ 𝐾(𝑚)
•
GRNN Model 𝑚 learns the mapping 𝐹: 𝑆𝐷 𝑚 → 𝑆𝑑 (𝑚)
•
The neural network models 𝑦=
𝑖 𝑦𝑖 . 𝑟𝑎𝑑𝑏𝑎𝑠𝑖𝑠(𝑥
− 𝑥𝑖 ) 𝑖 𝑟𝑎𝑑𝑏𝑎𝑠𝑖𝑠(𝑥 − 𝑥𝑖 )
Modeling of Action Manifolds using Generalized Regression Neural Networks • If there are 𝐿(𝑚) clusters from training pairs 𝑥𝑘 , 𝑦𝑘 ,
𝑁𝑢𝑚𝑦 𝑦= = 𝐷𝑒𝑛𝑦
2 𝐷𝑖,𝑚 𝐿(𝑚) 𝑖=1 𝑦𝑖,𝑚 . exp( 2𝜎 2 ) ; 𝐷𝑖,𝑚 2 𝐷 𝐿(𝑚) 𝑖,𝑚 𝑖=1 2𝜎 2
= 𝑥 − 𝑥𝑖,𝑚
𝑇
𝑥 − 𝑥𝑖,𝑚
Where (𝑥𝑖,𝑚 , 𝑦𝑖,𝑚 ) : set of clusters for action class 𝑚 𝑥1
2 𝐷1,𝑚 exp( 2 ) 2𝜎
𝑦1,𝑚
𝑦2,𝑚
𝑁𝑢𝑚𝑦
𝑥2
𝑥
𝑁𝑢𝑚𝑦 𝐷𝑒𝑛𝑦
𝑥3 1
𝑥𝐷
2 𝐷𝐿(𝑚),𝑚 exp( ) 2𝜎 2
𝐷𝑒𝑛𝑦
𝑦
Classification of test sequence •
Algorithm (Testing) – Compute HOF-LBP motion feature for each frame of test sequence(partial – 15 frames or full –
60-80 frames) – Project the test features 𝑋𝑟 on Eigen basis for each action class 𝑚 – Estimate the projections of each action 𝑚 by applying the feature set onto the trained GRNN model
– Correct class 𝑚∗ = argminm (projectionsm − estimationsm ) •
The model which gives the smallest difference between the eigen space projections and the GRNN estimations is the correct class.
Results (Weizmann database (10 actions, 9 individuals) • Testing strategy:- Leave 9 sequences out of training
• Partial Sequence :- 15 frames with overlap of 10 frames a1-bend
a2-jump p
a3-jjack
a1
a4-jump f
a5-run
a6-side
a1
100
a2
3
a3 a4 a7-wave1
a8-skip
a9-wave2
a5 a6
a7
a2
a3
75
22
a4
a5
a6
a7
a9 a10
a9
a10
100 88
12 93
5 78
21
100
a8 a10-walk
a8
100 1
99 100
Robustness Test (Test for Deformity) With bag
Legs Occluded
With dog
Normal Walk
Knees Up
With Briefcase
Limping
With Pole
Moonwalk
With Skirt
Test Seq
1st Best
2nd Best
Swinging a bag
Walk 2.508 Skip
3.094 3.9390
Carrying a briefcase
Walk 1.866 Skip
2.170 3.6418
Walking with Walk 1.806 Skip a dog
2.338 3.8249
Knees Up
Walk 2.894 Side
3.270 4.0910
Limping Man Walk 2.224 Skip
2.922 3.8217
Sleepwalking Walk 1.892 Skip
2.132 3.6633
Occluded Legs
Walk 1.883 Skip
2.594 2.6249
Normal Walk Walk 1.886 Skip
2.624 3.6338
Occluded by Walk 2.149 Skip a pole
2.945 3.8801
Walking in a Walk 1.855 Skip skirt
2.159 3.5401
Median to all actions
Robustness Test (View Invariance)
Test Seq
1st Best
2nd Best
Dir. 0
Walk
1.7606
Skip
2.3435
3.6550
Dir. 9
Walk
1.6975
Skip
2.3138
3.6286
Dir. 18
Walk
1.7342
Skip
2.2600
3.6066
Dir. 27
Walk
1.7314
Skip
2.3225
3.5359
Dir. 36
Walk
1.7721
Skip
2.3296
3.5050
Dir. 45
Walk
1.7750
Skip
2.2099
3.4217
Dir. 54
Walk
1.7796
Skip
2.1169
3.3996
Dir. 63
Walk
1.9683
Skip
2.3181
3.2095
Dir. 72
Walk
2.2900
Skip
2.4930
3.3460
Dir. 81
Side
2.6917
Side
2.8095
3.7771
Median to all actions
Conclusions/Inferences • Motion Information is used.
• Misclassifications are not spread across action classes. – Occurs between at most two actions.
• Does not rely too much on the silhouette mask – Only an approximate mask is required
• Can identify actions from a set of 10-15 frames • Can be used in a higher level activity recognition system where the scores for the primitive actions is available.
Thank You Questions?