Pose Invariant Object Recognition Using a Bag of

0 downloads 0 Views 10MB Size Report
Object recognition is a challenging task that can be tackled with computer vision ... detectors present in OpenCV. ▷ STAR, ORB ... License plates. ▷ Windows.
I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

Pose Invariant Object Recognition Using a Bag of Words Approach Carlos M. Costa, Armando Sousa, Germano Veiga [email protected], [email protected], [email protected] INESC TEC and Faculty of Engineering, University of Porto, Portugal

ROBOT’2017: Third Iberian Robotics Conference

1 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

P RESENTATION O UTLINE I NTRODUCTION Context Objectives Dataset

R ECOGNITION S YSTEM A RCHITECTURE Recognition System Overview Preprocessing Feature Detection and Description Visual Vocabulary Classifier Training Sliding Window Object Recognition

R ECOGNITION R ESULTS

2 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

C ONTEXT I I

Object recognition is a challenging task that can be tackled with computer vision and machine learning algorithms It has a growing set of applications: I

I I

Tagging the environment for assisting autonomous driving cars Recognizing and tracking objects for picking tasks Monitoring the environment for early warning systems

Fig. 1: Examples of object recognition and tracking applications 3 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

O BJECTIVES

Development of a image recognition system capable of: I Learning the characteristic shape of the target objects from a database I

I

Database contains a set of possible variations of the target objects when observed from several point of views

Tagging the target objects when observed from multiple perspectives and scales I I

Translation, rotation and scale invariant Robust against occlusions and environment clutter

4 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

D ATASET I

Manually annotated dataset (Graz-021 ) containing: I 354 images with a resolution of 640 × 480 I

I

I

Each image has associated a segmentation mask identifying which pixels belong to instances of cars 50% of images were used for training the machine learning classifier 50% of images were used for testing the recognition system

Fig. 2: Camera image (left) and its segmentation mask (right) with cars in red and the environment in black 1 https://lear.inrialpes.fr/people/marszalek/data/ig02/

5 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION S YSTEM O VERVIEW I

Setup of the visual vocabulary I I I I

I

Image preprocessing Detection of image features and extraction of its descriptors Clustering of image features using k-means Classifier training

Image tagging pipeline I I I I

Image preprocessing Detection of image features and extraction of its descriptors Multiscale sliding window with classifier evaluation Tagging of the target objects by segmenting the sliding window voting mask

6 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

P REPROCESSING All images are preprocessed before usage with: I Bilateral filter I I

I

Reduces image noise Preserves the edges of the image blobs

Contrast Limited Adaptive Histogram equalization (CLAHE) I I

Improves the image contrast over local image patches Useful for images with very dark or very bright regions

Fig. 3: Effect of removing noise and improving contrast and brightness (original image on the left, preprocessed on the right)

7 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

F EATURE D ETECTION AND D ESCRIPTION I

Image features are detected using one of the available detectors present in OpenCV I

I

STAR, ORB, SURF, SIFT, GFTT, MSER, FAST or BRISK

The feature descriptors are computed using one of the following algorithms: I

SIFT, SURF, BRIEF, ORB, FREAK or BRISK

Fig. 4: Example of image features detected by the STAR algorithm (green circles for features belonging to car instances and red circles for features belonging to the background) 8 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

V ISUAL V OCABULARY The images features are analyzed with a Bag of Words approach: I

They are segmented into target and the background using the dataset masks

I

Visual words are computed by identifying clusters of features using a k-means algorithm

I

The normalized histograms associated with the features of these clusters are useful for training a machine learning classifier with characteristic image structures Typical feature cluster found in cars:

I

I I I I I

Wheels Head lights Grills License plates Windows 9 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

C LASSIFIER T RAINING After having computed the visual words associated with the target object, a machine learning classifier is trained using one of the following approaches: I

Support Vector Machines

I

Artificial Neural Networks

I

Normal Bayes Classifiers

I

Decision Trees

I

Boosting

I

Gradient Boosting Trees

I

Random Trees

I

Extremely Randomized Trees

10 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

S LIDING W INDOW O BJECT R ECOGNITION I

For pinpointing which regions of an image contain instances of the target objects, it is necessary to analyze image patches with different Regions of Interest (ROI) configurations

I

To achieve this goal, a sliding window algorithm was implemented

I

It scans the image left to right, top to bottom several times with ROIs with increasing size

I

For the Graz-02 dataset, each image was analyzed with 482 ROIs with 9 different sizes

11 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

S LIDING W INDOW O BJECT R ECOGNITION I

The image ROI starts at the top left corner of the image with 20% of the image size

I

The ROI moves with increments of 25% of its own size until all the image is analyzed

I

The machine learning classifier is run on each ROI and if it has a confidence higher than 75% that there is a target in the image patch, then the voting mask pixels associated to the ROI are incremented

I

After a full image scan, the ROI grows 10% in size

I

The image analysis is complete when the ROI size reaches the size of the full image

I

The pixels in the voting mask that have a number of votes that are higher than 5% of the total number of ROIs will be considered has belonging to the target object 12 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION R ESULTS I I

The recognition system managed to achieve an accuracy of 87% when applied to the Graz-02 dataset The next images show the camera images with the detected features and recognized cars on the left and the sliding window voting masks on the right

Fig. 5: Results obtained with ORB detector, ORB extractor, FLANN matcher and ANN classifier 13 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION R ESULTS

Fig. 6: Results obtained with STAR detector, SIFT extractor, FLANN matcher and ANN classifier 14 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION R ESULTS

Fig. 7: Results obtained with STAR detector, SURF extractor, FLANN matcher and SVM classifier 15 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION R ESULTS

Fig. 8: Results obtained with STAR detector, SIFT extractor, FLANN matcher and SVM classifier 16 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION R ESULTS

Fig. 9: Results obtained with SURF detector, SURF extractor, FLANN matcher and ANN classifier 17 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION P ERFORMANCE RESULTS Test Feature Feature ID detector descriptor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

STAR STAR STAR STAR STAR ORB STAR SURF SIFT SIFT SIFT SIFT BRISK SURF SURF SURF SURF ORB SURF SIFT

SIFT SURF SURF SIFT BRIEF ORB FREAK SURF SIFT SIFT FREAK BRIEF FREAK SURF SURF SURF SURF ORB SURF BRISK

Feature matcher

Classifier

FLANN FLANN BFMatcher FLANN FLANN FLANN FLANN FLANN BFMatcher BFMatcher FLANN BFMatcher FLANN FLANN FLANN FLANN FLANN FLANN FLANN FLANN

Neural Network Support Vector Machine Support Vector Machine Support Vector Machine Support Vector Machine Neural Network Support Vector Machine Neural Network Neural Network Support Vector Machine Support Vector Machine Support Vector Machine Support Vector Machine Decision Tree Random Tree Boosting Tree Extremely Random Tree Normal Bayes Classifier Gradient Boosting Tree Support Vector Machine

Accuracy Precision Recall 0.874 0.855 0.854 0.847 0.841 0.839 0.815 0.815 0.794 0.784 0.605 0.601 0.579 0.578 0.503 0.499 0.469 0.446 0.423 0.421

0.234 0.271 0.299 0.306 0.276 0.206 0.274 0.168 0.217 0.242 0.191 0.191 0.191 0.175 0.172 0.171 0.167 0.165 0.161 0.159

0.162 0.214 0.234 0.362 0.277 0.195 0.279 0.202 0.296 0.385 0.717 0.732 0.801 0.648 0.847 0.845 0.864 0.886 0.897 0.889

18 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

R ECOGNITION P ERFORMANCE RESULTS Test ID

Vocabulary build time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

00m31.204s 00m21.251s 00m20.932s 00m31.204s 00m20.131s 01m25.694s 00m20.824s 00m37.574s 01m46.338s 01m40.631s 01m06.325s 01m05.877s 00m30.058s 00m37.188s 00m37.073s 00m37.495s 00m35.759s 01m24.585s 00m37.207s 01m08.126s

Training samples Classifier build time training time 00m44.265s 00m17.901s 00m17.985s 00m44.265s 00m20.105s 00m43.962s 00m24.739s 00m35.434s 01m32.902s 01m30.025s 01m00.102s 00m38.599s 00m29.093s 00m34.271s 00m43.967s 00m43.962s 00m43.969s 00m26.650s 00m43.964s 01m00.105s

00m00.028s 00m38.217s 00m37.934s 00m36.318s 00m35.184s 00m00.188s 00m36.273s 00m00.201s 00m00.234s 00m49.265s 00m53.349s 00m50.382s 00m45.131s 00m00.064s 00m00.199s 00m00.956s 00m00.491s 00m05.779s 00m04.295s 00m49.559s

Classifier test time 15m14.323s 03m02.452s 03m33.083s 09m43.652s 03m46.283s 17m04.451s 05m22.562s 13m03.423s 43m00.362s 41m43.748s 35m35.147s 25m04.586s 11m03.882s 18m05.666s 16m17.609s 15m41.621s 18m33.911s 27m22.274s 17m23.841s 45m40.242s

19 / 20

I NTRODUCTION

R ECOGNITION S YSTEM A RCHITECTURE

R ECOGNITION R ESULTS

S UMMARY

S UMMARY I

We presented a configurable object recognition system capable of segmenting the target objects from images even when they are observed from multiple perspectives

I

It managed to achieve a recognition accuracy of 87% when tagging cars present in the Graz-02 dataset

I

This system can be used as the initialization phase for a object tracking algorithm

I

It can be further improved with more advanced image processing techniques for retrieving the objects pose in relation to the camera

20 / 20

Suggest Documents