Robust Recognition in Sequential Images

4 downloads 0 Views 5MB Size Report
Extracting moving people from Internet videos. In ECCV, 2008. J. Niebles, B. Han, and L. Fei-Fei. Efficient extraction of human motion volumes by tracking.
Robust Recognition in Sequential Images Xiang Xiang PhD candidate in Computer Science 5th-year PhD student, Johns Hopkins University

Vision is hard

● Problems ● Models ● Next

Here sequential images refers to video (2D images over time), hyperspectral images (2D images over channels), medical 3D scans (2D images over slices) and so on.

High-level: action & event

Datasets Small, Video Level Labels — Action Recognition: ● ● ●

UCF101 (101 classes, ~10000 short videos) HMDB51 (51 classes, 7,000 short videos) ActivityNet (200 classes, 20,000 videos, some with frame-level labels)

Large, Noisy Video Level Labels - Video Classification: ● ●

Sports-1M (487 classes, ~1.2M YouTube videos) YouTube-8M (4800 classes, 8 million YouTube videos)

Mid-level: tracking

Under short-term occlusion

Multiple Instance Learning (MIL) tracker + Particle Filter (PF) ACCV 2012: 134-146, Daejeon, Korean.

Tracking-Learning-Detection (TLD) tracker

Mid-level: segmentation

Hopkins 155 dataset

Low-level: robust feature matching

MICCAI 2014, Boston, USA. Computer-Assisted and Robotic Endoscopy 8899, 88-98

Dynamics to rescue ● Problems

● Models ● Next

A 157-seconds video from YouTube.

High-level: keyframe picking to story Keyframes, skims, storyboards, time-lapse, montages or video synopses. telling

Robust PCA

Robust Sparse Representation OMP (Orthogonal Matching Pursuit): greedy forward sequential selection methods.

IEEE ICASSP 2015, Brisbane, Australia. In preparation for IEEE Trans. Affective Computing. Codes available at https://github.com/eglxiang/icassp15_emotion

Pose-Robust Deep Representation ● ● ●

Identification involves one-to-many similarities. Pose variation in uncontrolled environment confuses identity. Processing a video stream is computationally expensive. ●





K-means clustering the poses estimated as rotation angles. Selecting frames using distances to K-means centroids. Pros: reducing the number of frames from tens or hundreds to K while still preserving the overall diversity.

YouTube Faces: 3,425 videos of 1,595 subjects. Benchmark tests: an official list of 5000 pairs of videos.

Face detection AdaBoost (OpenCV / DLib). Pose estimation Landmark detection: DLib. Face alignment (OpenCV) Center eyes & mouth. Affine warping. Face feature descriptor Pre-trained CNN named VGG-Face. Face similarity metric Correlation (max correlation). Codes available at https://github.com/eglxiang/ytf

Mid-level: tracking On the Youtube dataset with 50 sequences. A motion model helps!

(a) On-line boosting with fixed scale and features of Haar-like, HOG & LBP. (b) Semi-supervised on-line boosting with fixed scale and features of Haar-like, HOG & LBP. (c) MIL-Track with fixed scale and Haar-like feature. (d) MIL- Track with fixed scale and HOG feature. (e) TLD tracker with adaptive scale and a simply designed feature. (f) Basic template matching. (g) Basic Mean Shift. (h) Frag-Track. (i) KLT optical flow. Each corner point is drawn with an arrow (motion vector). (j) Particle filter with all samples drawn. (k) Frame-by-frame detection results of state-of-theart human detector – Deformable Part Model. Xiang, Xiang. "A brief review on visual tracking methods." Intelligent Visual Surveillance (IVS), 2011 Third Chinese Conference on. IEEE, 2011.

Mid-level: segmentation

J. Niebles, B. Han, A. Ferencz, and L. Fei-Fei. Extracting moving people from Internet videos. In ECCV, 2008. J. Niebles, B. Han, and L. Fei-Fei. Efficient extraction of human motion volumes by tracking. In CVPR, 2010.

Low-level: motion feature

from image to video: from dense SIFT to dense optical flow. image matching: SIFT flow Bottom up approach: low-level cues.

video matching?

Motion: optical flow Occurrence of motion: histogram (bag of visual words) spatiotemporal feature: flow words SIFT on patches over time: vector (not on gradients) on optical flow (velocity, motion gradient)

Horn and Schunck, Determining optical flow, MIT, 1981. Michael Black, Robust Incremental Optical Flow, Yale, 1992. Pickup, L. C., Pan, Z., Wei, D., Shih, Y., Zhang, C., Zisserman, A. & Freeman, W. T. Seeing the arrow of time, CVPR 2014.

Perform optical flow computation for vertical and horizontal image gradients separately

frm t vertical component

frm (t+1) vertical component

vertical component

patch size? (spatial bin size) 6 pixels (4x4 grid)

Motion gradient



Moving Forward, high-level video analytics is still entirely open.

Question? Thanks for your attention! [email protected]

Suggest Documents