A wearable visual robot that can move semi-independently to its wearer allows it
to be able to complete a wider range of tasks than a fixed camera, such as ...
Transfer Report: Object Recognition for Wearable Visual Robots
Robert Castle Wolfson College
Robotics Research Group Department of Engineering Science University of Oxford Michaelmas Term 2006
This transfer report is submitted to the Department of Engineering Science, University of Oxford. This transfer report is entirely my own work, and, except where otherwise indicated, describes my own research.
Abstract An autonomous wearable visual robotic assistant is a robot that is worn by a user, and it consists of several parts. Mechanically it consists of at least one camera that may be free to move via a motor assembly, possibly some other sensors such as an inertia sensor, some way of interacting with the wearer, such as a display or audio, and a portable computer. Its can be used to complete a wide range of designated tasks, from simple sensor recording to providing timely, useful information about a user’s environment, and helping to guide a user to a goal or to complete a task. A wearable robot would be of great use in many fields, such as the military, maintenance, emergency relief work, and tourism. A wearable visual robot that can move semi-independently to its wearer allows it to be able to complete a wider range of tasks than a fixed camera, such as focusing on what a user is doing or looking around a user’s workspace for objects while the user continues to work. To enable a robot to provide a user with useful information about the world, the robot needs to be able to identify objects that are of interest. It then needs to be able to track its location in the world and maintain a map of where the objects are to be able to direct a user to them. To be able to identify objects in an image we review the work done in the field of object detection and select SIFT (scale invariant feature transform) as an object detector. To enable the robot to keep track of its position in the world and create a map of features it has observed we review the literature on SLAM (simultaneous localization and mapping), and in particular the EKF (extended Kalman filter) that is used on a single camera SLAM system that we adopt for use with our system. Finally we review the work that has been done on visual wearables, including work done on analysing the optimal location of an active camera to enable it to be useful in a wide range of tasks. In our work we show that by integrating an object detector into the single camera SLAM system, a database of planar objects can be identified in an image, then localized in 3D relative to the camera and added to a map. This allows the system to keep track of multiple objects and label them on a display as the camera is moved freely. This work provides the foundation for building an autonomous wearable robotic assistant that can provide assistance by identifying and tracking objects in the world and informing a user about them. From here, a wearable will be constructed, improvements to the tracking system using objects will be implemented, and interaction with the user using audio added.
Contents 1
Introduction
1
2
Literature review
6
2.1
Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Feature detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.2
Feature descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.1.3
Descriptor evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.1.4
Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.1.5
Randomised trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.1.6
Applying object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Simultaneous localization and mapping . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Varieties of SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.2
Extended Kalman filter based SLAM . . . . . . . . . . . . . . . . . . . . . .
18
2.2.3
Single camera SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.4
Building on single camera SLAM . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3
Visual wearable robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.2
3
4
Object detection using SIFT
28
3.1
Detection using SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Timings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Integrating object detection with single camera SLAM
36
4.1
Using objects with SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.1.1
The object database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.1.2
Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Single view localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.2.1
Adding recognized object locations to the SLAM map . . . . . . . . . . . .
39
Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.2 4.3
4.4 5
6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Research Direction
44
5.1
Wearable system design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.2
Using object geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.3
Using object labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.4
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Conclusions
A Courses and seminars attended
49 51
A.1 Summer Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
A.2 Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
A.3 Robotics Research Group Seminars . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.4 Computer Vision Reading Group . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.5 Active Vision Reading Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
Acknowledgements Many thanks to Professor David Murray for his help and support. Thanks to Professor David Lowe for providing some source code and conversations. This work has benefited greatly for many insightful conversations with members of the Active Vision Lab. I gratefully acknowledge the financial support of the UK Engineering and Physical Science Research Council (grants GR/S97774 and EP/D037077).
1 Introduction Humans use vision as their dominant sense, and, as such, our world is built up around visual cues. Enabling computers to access this information allows improved human-computer interaction, such as gesture recognition, and applications such as special effects in films. Computer vision algorithms can be used to extract a vast array of useful information from images; for example, they can be used to recognise faces, individuals, objects, classes of objects, they can extract the geometry of a scene enabling reconstruction in 3D and therefore allow the scene to be digitally altered to create special effects in films, and enable a camera to locate itself in space and keep track of its postion as it moves. The ubiquitousness of digital cameras continues to increase as they appear increasingly on personal devices such as mobile phones, portable computers, and games consoles; replace traditional film cameras; in buildings and cities as security and enforcement cameras; and as remote eyes for operators on exploration missions to other planets or hazardous environments. Cameras are passive sensors making them unobtrusive and suitable for a wide range of applications, especially as cheap, small, light, low power digital cameras have become available. Combined with the great increase in portable computing power and the reduction in power consumption per computation over recent years, this leads to the possibility of wearable visual computing. Wearable visual computing generally involves a person having at least one camera
2 attached to them which is also connected to a portable computer. Wearable visual computing opens a wide range of applications such as the recording of information for transmission back to a base or for local storage on the portable computer allowing further analysis of a situation. Another application is the use of a wearable camera as a memory augmentation device, such as the SenseCam [28] being developed by Microsoft Research to record images of significant events thoughout a wearer’s day. To leverage the power of wearable visual computing an element of autonomy from the user in the wearable is required. This would allow the user to work as normal, but have an assistant that can provide relevant timely contextual information about the surroundings. A camera that can look around the world on its own and extract relevant information from it is much more valuable than just a passive recorder. Anything that is identified as useful could be conveyed to the user using speech, a laser pointer, a visual display, or augmentation of a video image of the scene. Examples of applications for such a system are first as part of a maintenance system, showing what each part is, what work was previously done, how parts go together and integrating external data sources, such as blueprints to show where pipes and cables are behind walls; second as a training system to show how a system is operated; third as a visual guide in a city or a building such as an art gallery for tourism or by emergency workers; or lastly for scene reconstruction for offline exploration and labelling. Mayol-Cuevas [23] showed that the optimal placement of a wearable robotic assistant is on the shoulder, and designed a shoulder-mounted pan-tilt-roll active camera for use as a wearable robotic assistant shown in Fig. 1.1. This system incorporated Davison’s [5] method of simultaneous localization and mapping (SLAM) for a single camera to enable the wearable to keep track of its position in relation to features it had detected in the world. The features that are detected by this system are points that have been detected at corners in the image. A user was manually able to label mapped feature points and have the camera re-fixate between them. This labelling system worked well for enabling the system to re-fixate on features, but the sys-
3
Figure 1.1: The left image shows Mayol-Cuevas’ [23] wearable robotic assistant being worn and used to fixate on features. The image on the right shows the camera fixating on a hand labelled point on the telephone. The feature has been surrounded with a wire-frame cube to identify its location. tem could not identify “objects”, only a sparse set of points. The limitations of using a single point to identify an object are that it may not be detected on successive runs, or another point may be misinterpreted as the object as a single point is not particularly unique or that the point may not be detected over large viewpoint changes. In my work I am interested in the idea of automatically recognizing, rather than hand labelling, objects in the scene in which the camera is localized, and in how the known geometry of the recognized objects can be used to reinforce the estimated geometry of the scene. This is a challenging problem, and draws on computer vision and robotics literature which are reviewed in chapter 2. This year I have taken a particular path through the problem, building the system illustrated in Fig. 1.2. The system first acquires an image from the camera. The single camera SLAM part of the system detects new feature points and measures the location of previously detected ones. These results enable an update to the current estimate of the camera’s location to be formed. The newly added object detection part runs the SIFT algorithm by Lowe [20], which is reviewed in detail in chapter 3, on the image to find SIFT features and then compares these with ones
4 Object database
Object feature detection
Camera
Object detection
3D position calculation
Objects labelled on display
Image acquisition
Point feature measurement
Camera localization and mapping
Camera position
Figure 1.2: An overview of the system flow for detecting objects and integrating their measurements into the single camera SLAM system. in a database of known objects. Matching objects have their 3D position calculated and are either added to the SLAM system or update previous estimates of their location, improving the location estimate of the camera as well as making the SLAM system more robust in its tracking. Whereas point features require successive measurements from multiple views to get an accurate location relative to the camera, detecting an object with known dimensions allows the object’s location relative to the camera to be accurately measured with a single measurement. The detected objects are also labelled on a display for the user. The results in Fig. 1.3 show a frame from a sequence of a scene and the objects that were detected and tracked in a 3D map representation of the scene. The integration of object detection into the single camera SLAM framework is described in chapter 4. This work [3] has been submitted to the IEEE International Conference on Robotics and Automation (ICRA) 2007 and is under review. Chapter 5 discusses where the research will be taken over the next two years and how the available time will be managed. It will encompass the construction of a wearable robot and
5
Figure 1.3: A view of the scene on the left and a 3D reconstruction of the SLAM map with the identified objects on the right. its control with a single camera SLAM system; improved integration of the object detection system, with the goal of frame-rate performance; use of the active camera to search for new objects intelligently and recover from tracking failure by looking for previously tracked objects; running the system with two independent cameras, one the active camera on the wearable and the other a camera on a user’s tablet PC for labelled views; and finally building an interface to allow the user to add new objects to the database and label them and provide interactive feedback between the wearable and the user.
2 Literature review This review is split into the three main strands of research that are required for a wearable robotic assistant that can detect and track objects. The first section covers object detection, the second section covers simultaneous localization and mapping, and the final section covers wearable robotic assistants.
2.1 Object detection To be able to detect objects in an image, features on the object have to be repeatably detectable, and the geometric constraints between the features must be consistent with those in a reference image. The work in this area can be divided up into three main areas; feature detection, descriptor creation, and matching.
2.1.1 Feature detectors Features can come from corners, edges, blobs, or anything else in an image that can be repeatably localized in an image. Ideally, features on an object should be able to be identified regardless of changes in orientation of that object, or its distance from the camera. There are many methods for determining features, each with their associated benefits and weaknesses: we now examine some well known methods.
2.1
Object detection
7
Harris The Harris [13] corner detector uses pixel gradients to locate corners, combining them in a matrix which defines the shape of the auto-correlation surface. When both eigenvalues of this matrix are large, this indicates that there has been a significant intensity change in two orthogonal directions, and hence may correspond to an area of the image (a corner) that is able to be reliably located. To avoid the need to explicitly compute the eigenvalues of this matrix, Harris proposed using a simpler expression to infer the eigenvalues, which takes a large value when both eigenvalues are large. While an approximation, this method generally gives good results in practice. It is a relatively slow method requiring convolutions with a Gaussian smoothing mask and derivative masks, but the principle disadvantage is that it is not scale invariant. This means that the same features will not be detected as the position of the object varies within the scene. While it is suitable for frame to frame tracking of individual features it is not desirable for object detection as objects may appear anywhere in a scene. Shi-Tomasi The Shi-Tomasi [31] interest point detector is very similar to the Harris detector. Instead of using an approximate expression to determine when both eigenvalues are large, they are explicitly computed. A point is chosen as a feature when the smaller of the two eigenvalues is above a threshold. The Shi-Tomasi detector suffers from the same problems as Harris, but again is suitable for frame to frame tracking and is used in the SLAM work of Davison [7, 5] described later in this chapter. FAST The FAST detector [29, 30] is a method whereby feature points can be identified in an image much more rapidly than by either the Harris or Shi-Tomasi method. For each pixel, a ring of pixels at a certain radius around that pixel are considered, and it is flagged as a feature if at
2.1
Object detection
8
least n contiguous pixels are all brighter, or all darker, than the centre pixel. The value of n can vary, but for a ring of size 16 pixels, it is generally taken to be between 9 and 12. This method is able to operate very rapidly for two reasons; firstly, there is no need for any convolutions to be applied to the image, and secondly, many pixels can be rapidly rejected with just a few operations. By judicious choice of which order to compare pixels within the ring to the centre pixel, most pixels in the image can be rejected as not being features after just a few (two or three) comparisons. The key advantages to this method are that features may be found computationally very cheaply, and that they are invariant to in-plane rotation. The main limitations are that it does not enable features to be found when there is any blur within the image (perhaps caused by rapid motion of the camera), and, as with the Harris method, it can detect features at one scale only. Laplacian of Gaussian The Laplacian of Gaussian (LoG) was originally used by Marr and Hildreth [21] to detect edges at the zero values of the convolution of the image I with the Laplacian (second derivative) of a Gaussian ∇2 G, L(x, y, σ) = ∇2 G(x, y, σ) ∗ I(x, y),
(2.1)
where ∗ is the convolution operation in the image dimensions, x and y, and G(x, y, σ) =
1 −(x2 +y2 )/2σ2 e . 2πσ 2
(2.2)
The top left image of Fig. 2.1 shows a 1D Gaussian, and the top right shows the Laplacian of this Gaussian with its distinctive “Mexican hat” shape. Lindeberg [19] used the LoG as a “blob” detector, where a blob is a locally maximal interest point on the convolved image. Adjusting the Gaussian’s σ to blur the image moves the resulting image through a scale-space. Applying a blur to the image is equivalent to the image being further away from the camera, because as an image moves away from a camera less detail is visible, the same as when the image is
2.1
Object detection
9
Gaussian (σ = 1.0)
Laplacian of Gaussian (σ = 1.0)
0.4
0.5 0.4
0.3
0.3 0.2
0.2 0.1 0
0.1
−0.1 0 −10
−5
0
5
10
−0.2 −10
Gaussian (σ = 1.4)
−5
0
5
10
Approximation of the LoG
0.35 0.1
0.3 0.25
0.05
0.2 0.15
0
0.1 0.05 0 −10
−0.05 −5
0
5
10
−10
−5
0
5
10
Figure 2.1: The top left graph shows a 1D Gaussian with σ = 1.0 and its Laplacian in the top right. The bottom left shows another Gaussian with σ = 1.4, and the bottom right shows this Gaussian subtracted from the top left. It can be clearly seen how the difference of Gaussian is an approximation of the Laplacian of Gaussian. blurred. From a given image, a scale-space can be created by blurring the image by successive amounts. The scale-space is then a stack of blurred images, with the original at the bottom and the most blurred at the top. Features are found on a LoG image by looking for maxima and minima (extrema) in the scale-space. A point is consided to be an extrema if its magnitude is larger than its surrounding neighbour pixels on the image and on the images above and below it in scale. The LoG is good because it can identify objects as they change in scale in a scene, but its disadvantages are that it is slow due to the multiple convolutions required to produce a scale space, and local extrema can be detected in regions containing contours or straight edges. These extrema are less stable as they are more sensitive to noise and changes in the texture.
2.1
Object detection
10
Difference of Gaussian The difference of Gaussian (DoG) used by Lowe [20] is an approximation of the LoG. Instead of applying Gaussians with a successively larger σ to an image, a Gaussian with a smaller σ is applied to the previously convolved image, L1 (x, y, σ1 ) = G(x, y, σ1 ) ∗ I(x, y)
(2.3)
L2 (x, y, σ2 ) = G(x, y, σ) ∗ L1 (x, y, σ1 ) ,
(2.4)
to create a scale space. Each doubling of σ is known as an octave and with each doubling the Gaussian image is re-sampled to half its size. This is done to greatly reduce the computation time. Subtracting two neighbouring blurred images, D(x, y, σ) = (G(x, y, σ2 ) − G(x, y, σ1 )) ∗ I(x, y) = L2 (x, y, σ2 ) − L1 (x, y, σ1 ) ,
(2.5) (2.6)
is an approximation of the LoG as can be seen in Fig. 2.1. The top right image is the LoG, and the bottom right image is the result of subtracting two Gaussians with different values of σ. Extrema are found in a similar manner as the LoG by comparing the pixels in the difference images to the neighbouring pixels on the image and on the images above and below it. The DoG has the same scale invariance benefit as the LoG, but is faster due to the increased efficiency of applying the Gaussians and mainly due to image re-sampling. It also has the same extrema instability problem as the LoG. The DoG is described in more detail in chapter 3. Other detectors Maximally stable extremal regions by Matas et al. [22] are regions that contain pixels that are either all brighter or all darker than all of the pixels on the outer boundary. They are robust to a wide range of geometric changes. The Harris-Affine detector by Mikolajczyk and Schmid [24] is a version of the Harris detector that is invariant to scale and affine transformations and
2.1
Object detection
11
is one of a range of scale and affine invariant interest point detectors that are robust to a wide range of geometric changes.
2.1.2 Feature descriptors Descriptors describe information relating to a feature. Typically this is gathered from the area around a feature, and this information is stored and used for comparison against other descriptors. There are a variety of qualities that are required to make a good descriptor for object detection. Descriptors need to be distinctive, so that they do not get confused with other features causing mismatches and false detections. They need to be robust to changes in viewing conditions such as illumination, rotation, and viewing angle. Being fast to compute and match is also an advantage when the system needs to run at frame rate, particularly as the number of objects to be matched increases. The most basic descriptor is just a patch of pixels around the feature which is fast to compute, but it is unlikely to be very distinctive if the patch is small. It is also not robust to changes in illumination, rotation, viewing angle or scale. Hence we consider more sophisticated approaches below.
2.1.3 Descriptor evaluation Mikolajczyk and Schmid performed an evaluation of the state of the art descriptors [25]. They tested ten different descriptors on their ability to cope with rotation, scale change, viewpoint change, image blur, JPEG compression, and illumination. A strong performance in all of these tests, except compression and blur, is vital to object detection with a wearable camera running at frame rate. However, a good result in the image blur test would show that objects may still be detectable when the camera is moving very rapidly, which would be a benefit to the tracking system used by the wearable camera. They tested two main types of scene: structured scenes containing distinctive boundaries such as buildings, and scenes with repeated texture. They used a variety of feature detectors
2.1
Object detection
12
that detect different types of features to remove any bias that may arise if a particular descriptor works better or worse with a particular method of finding features. Their evaluation criterion was based on the number of correct and incorrect matches for an image pair. The viewpoint change evaluation tested the ability of the descriptors to cope with a viewpoint change (i.e. an out of image plane rotation) of approximately 50◦ . The rotation evaluation tested the ability to cope with in-plane rotations between 30◦ - 45◦ . Scale was changed by a factor of 2-2.5 and JPEG compression was set so that the quality of the transformed image was 5% of the original. Image blur and illumination change were introduced by changing the settings of the camera. Remarkably, in all of the tests the descriptors based on and including SIFT (scale invariant feature transform) performed better than other types. GLOH (gradient location and orientation histogram), Mikolajczyk and Schmid’s modified SIFT descriptor, obtained the best results in most tests always closely followed by the original SIFT descriptor by Lowe. SIFT also performed well on both textured and structured scenes. Because SIFT and its derivatives performed well SIFT is investigated further in chapter 3 for use as an object detector.
2.1.4 Matching To be able to detect the presence of an object in a scene, features in a reference image and the current image need to be paired up, and a geometric consensus between them established. The method of matching varies with the type of descriptor used, but is always some form of similarity measure. For the SIFT descriptor nearest-neighbour or second-nearest-neighbour matching is performed by finding the Euclidean distance between points. These methods are described in detail in chapter 3. However, merely finding the closest matches is not enough as there will most likely be some mismatches. One method for identifying correct matches and removing incorrect ones is random sampling and consensus (RANSAC) [10, 14]. Random samples are chosen from the min-
2.1
Object detection
13
imum number of data required to generate a hypothesis, and the number of data supporting this hypothesis are counted as “inliers”. This is repeated N times, and the final fit is made using the largest set of inliers. For example, Fig. 2.2 shows RANSAC fitting a line to 2D data. Two points are chosen at random each sampling time and a line constructed between them. Points within a specified threshold are inliers and support the hypothesis, and the points outside are outliers and do not support the hypothesis. In the example the hypothesis on the right has more inliers than the one on the left, and so will be accepted in preference. Once a hypothesis is accepted all of the supporting points are used to recalculate the final fit. RANSAC does not find the exact solution, only the best one from N trials. The number of trials required does not need to try every possible combination, only enough so that the probability p that a set of samples of s points is free from outliers. The the number of trials required is N=
log(1 − p) , log(1 − (1 − ǫ)s )
(2.7)
where p is usually chosen to be 0.99 and ǫ is the probability of the number of outliers in the data. It can be hard to know how many outliers are expected in the data, so ǫ can be found adaptively. This is done by setting ǫ to the worst case estimate, and then as samples are taken and the consensus shows fewer outliers ǫ is changed to this new estimate, reducing the number of trials required. On each trial ǫ is calculated as ǫ =1−
number of inliers . total number of points
(2.8)
The same technique is used to determine correct feature correspondence pairs between two images, except that instead of fitting a straight line to the data a homography is applied using a sample of four corresponding feature pairs. A homography is a plane-to-plane transform in the form of a 3×3 matrix that transforms points from one image to points in another image. Here a homography H is used to transform the matched feature locations in the reference image xr to the corresponding feature locations in the current image xc , xc ≡ Hxr ,
(2.9)
2.1
Object detection
14
Figure 2.2: A 2D example of the RANSAC algorithm. In the left graph two points have been chosen at random and a line fitted. It can be seen that this is a poor choice as only one extra point falls inside the threshold. The right graph shows another iteration with two different points selected, here a better fit is obtained as all of the points are inliers except one outlier. where the features are in 2D homogeneous coordinates, x = [x, y, 1]⊤ . If enough of the remaining reference features transform to their corresponding current image feature within a set threshold (a Euclidean distance from the feature) then the homography is accepted. Once N trials have been completed the homography with the greatest number of inliers is recalculated using all of the supporting data. If the reference image is of a planar object then the homography identifies where the object is in the current image and allows the object to have its outline drawn around it or its location in 3D found, as shown in chapter 4.
2.1.5 Randomised trees Lepetit and Fua [18] adopt a different approach to feature descriptors and matching using randomised trees. They use a method that draws on work from classification and machine learning. Here they achieve detection of objects at 25Hz in 640 × 480 images, and are able to handle large out-of-plane rotations. A set of “key” features {ki } i = 1, . . . , K is chosen and the pixel patch pi around each is stored. During training each key feature is characterised by inputting it into a set of randomised trees {tj }, j = 1, . . . , T .
2.1
Object detection
15
If pi (m1 , n1 ) > pi (m′1 , n′1 ) L else R
If pi (u1 , v1 ) > pi (u′1 , v1′ ) L else R
If pi (m2 , n2 ) > pi (m′2 , n′2 ) L else R
If pi (u2 , v2 ) > pi (u′2 , v2′ ) L else R
2Q
1 2 Tree 1
Tree 2
Figure 2.3: Randomized trees At each of q = 1, . . . , Q levels in tree t1 a question is asked of two randomly chosen pixels in the patch, as shown in Figure 2.3. Although randomly chosen, once the random pair (mq , nq ) and (m′q , n′q ) is chosen for level q they are fixed for that particular tree. Each patch pi will result in one of the 2Q leaf nodes being occupied, with its occupancy incremented from 0 to 1. Each of the remaining trees asks a further set of Q random questions, resulting in a vector of T × 2Q nodes, which can be viewed as a histogram of T × 2Q bins. Now many new views of the same key feature are generated, and the new patches fed into the same randomised trees, and the resulting leaf node or bin occupancies incremented. The histogram can be normalised to form a probability distribution. At run time, any located feature f has its surrounding patch fed into the same set of trees, and the resulting leaf node occupancy compared against each of the i = 1, . . . , K key feature’s histogram to evaluate the likelihood of the feature f being a key feature ki . RANSAC is then used on the potential matches to find a homography. If a consistent homography is found then the object is said to have been detected. As stated before it is able to detect an object at video rates, but its scalability to multiple objects is however unexplored.
2.1.6 Applying object detection The above sections have reviewed methods of feature detection and description. Although there is a wide array of detectors and descriptors that could be used, Lowe’s SIFT descriptor
2.2
Simultaneous localization and mapping
16
was chosen for use in this work because it is a robust and proven detector that performed well in the tests done by Mikolajczyk and Schmid, and also the other detectors that performed well were based on SIFT. Lowe’s DoG detector has also been used in this work. It is not the fastest detector and will most likely be replaced in later work, but it too is proven to work robustly and pairs well with the SIFT descriptor. In chapter 3 the DoG detector and the SIFT descriptor are described in more detail.
2.2 Simultaneous localization and mapping The second major technical ingredient in the work reported here is simultaneous localization and mapping (SLAM). SLAM is a method to allow a robot to localize itself and construct a map of features as it traverses the world, beginning with little or no knowledge. Any form of data input can be used as a measurement as long as meaningful information can be gleaned from it. For example, many mobile robots use some combination of laser scanners, sonar, odometry, vision, and GPS.
2.2.1 Varieties of SLAM Attempts to solve the SLAM problem have produced a variety of methods, the main ones are briefly described here. Extended Kalman filter The extended Kalman filter (EKF) is widely used in attempting to solve the SLAM problem. Smith and Cheeseman [33] were the first to use the EKF in SLAM, Dissanayake et al. [8] provided the theoretical justification for the use of the EKF, and Davison and Murray [6] were the first to use vision with SLAM and adopted the EKF. Davison [5] went on to use the EKF in his single camera SLAM that is described later in this chapter. As this system is used in our work the EKF is explained in detail in section 2.2.2.
2.2
Simultaneous localization and mapping
17
The EKF is the non-linear extension of the Kalman filter [15]. The Kalman filter is a recursive Bayesian estimator that is used to recursively update the information known about the world with newly measured information in a probabilistic manner that minimises the mean of the squared error. It also assumes that all measurements have a Gaussian uncertainty distribution, which can be a limiting factor to methods based on the Kalman filter. The Kalman filter cannot directly be applied to the SLAM problem, because there is almost always a non-linear relationship between the measurements and the state vector. However, using a linearisation process the extended Kalman filter can be applied. Using an EKF to implement a SLAM system has been the traditional method used in the robotics community for many years. One of its key disadvantages is the requirement to store a fully populated covariance matrix associated with the state requiring an O(n2 ) update at every time-step, where n is the number of features in the map. This means that for large maps the computational cost is very large, and to keep the system running in real time careful management of the map size is required. Other methods have been proposed to alleviate this problem and have superior bounds in run time. FastSLAM FastSLAM [26, 37] uses a particle filter to keep an estimate of the robot’s location. Each particle is an estimate of the robot’s location, each with a different weight depending on how certain the system is of that estimate. Each particle contains its own Kalman filter and its own map which are updated with each measurement. The advantage of this is that the map positions are uncorrelated allowing the updates to be performed faster than the EKF at around O(mlog(n)), where m is the number of particles and n is the number of features in the map. Its disadvantage is that as the number of particles grows the system does take longer to update as each particle requires a complete update, and typically this method does require a large number of particles.
2.2
Simultaneous localization and mapping
18
Information filter Thrun et al. [36, 35] and Walter et al. [39] applied the information filter to SLAM, which is a modification of the Kalman filter and EKF where the covariance matrix of the features is inverted. This inverted matrix is the information matrix, which in the SLAM problem tends to be highly sparse and diagonal, allowing it to handle larger maps than the EKF with an update bound that is close to O(n). The disadvantage is that the map is not immediately interpretable as the information matrix needs to be inverted to gain the covariance matrix, which is computationally expensive. This makes it unusable in Davison’s single camera SLAM system, as his “active search” algorithm requires the covariance at each step to select which features to measure.
2.2.2 Extended Kalman filter based SLAM The EKF is adopted in this work as it is used in the single camera SLAM system developed by Davison, which is used as a foundation to this work. The EKF, like the Kalman filter, has three main stages; the prediction, the measurement, and the update stages. The prediction stage predicts where the robot is with an uncertainty using its current knowledge of the world, itself, and how it has moved. The measurement stage takes sensor readings and then in the update stage these are fused with the prediction to obtain a new estimate of the robot’s state (its position and orientation) and uncertainty (its covariance). The measurements and the prediction are fused in such a way that the one that has the smaller covariance has the greater effect. This process of prediction and update repeats continuously. An example to illustrate how the Kalman filter works is shown in Fig. 2.4; the top left graph shows the estimate of the state x(k − 1|k − 1) with an uncertainty of P (k − 1|k − 1) at step k − 1, this is the posterior from the previous step k − 2. The prediction for step k is made by applying the motion model, and the estimate is predicted to move to the prior x(k|k − 1) with a higher uncertainty, P (k|k − 1), in the top right graph. In the bottom right graph the measurement z(k)
2.2
Simultaneous localization and mapping
19
with measurement noise R(k) is made. The update is performed and the prediction x(k|k − 1) is combined with the measurement z(k) to give the new posterior estimate of the measurement x(k|k) with an uncertainty of P (k|k). It can be seen that the new estimate is more similar to the measurement as it had a lower uncertainty than the prediction, but the update equations use both to obtain a more accurate estimate. This new estimate is then used on the next iteration. With the Kalman filter any measurement reduces the uncertainty of the system. A robot using the EKF begins by using its current state and how it believes it has moved to predict its new location and the location of the measured features relative to itself. The vector that stores the information about the robot’s position and the position of all the features (the map) that have been measured is known as the state vector, x. Map features are added to the end of the state vector. A plant model predicts how the robot has moved at step k with a given control input based on the old state estimate at step k − 1, giving a prediction of the new state, plant model
xk|k−1 | {z }
new predicted state
z }| { = f ( xk−1|k−1 , uk , k) . |{z} | {z }
(2.10)
old state estimate control
The state prediction has an associated covariance, Pk|k−1 = ∇Fx Pk−1|k−1 ∇FTx + ∇Fu U∇FTu , | {z }
(2.11)
process noise
indicating how certain the prediction of the new state is. The plant Jacobian ∇F is the derivative of f () with respect to either the state x or the control u depending on whether it is being multiplied with the old state covariance or the control noise U. The control noise matrix describes the strength of the noise and the interrelation of the noise on each element with the others; these values have to be obtained experimentally. The control noise increases the uncertainty in the system with each step, so if no measurements are made then the robot will become more and more uncertain of its location.
2.2
Simultaneous localization and mapping
20
Prediction
x(k−1|k−1)
x(k|k−1)
P(k−1|k−1) P(k|k−1) Measurement
x(k|k) z(k)
P(k|k)
R(k)
Update
Figure 2.4: The Kalman filter prediction, measurement, and update cycle. The top left graph shows the estimate of the state x(k − 1|k − 1) with an uncertainty of P (k − 1|k − 1) at step k − 1. The prediction for step k is made and the estimate is predicted to move to x(k|k − 1) with an uncertainty of P (k|k − 1) in the top right graph. In the bottom right graph the measurement z(k) with measurement noise R(k) is made. The update is performed and the prediction x(k|k − 1) is combined with the measurement z(k) to give the new estimate of the measurement x(k|k) with an uncertainty of P (k|k).
2.2
Simultaneous localization and mapping
21
The final part of the prediction step is to predict the observations. The predicted observation, observation model
zk|k−1 =
z }| { h(xk|k−1 ) ,
(2.12)
is where the robot believes the features should be given the predicted state and model of the observation process. Now that a prediction of where the robot is and where the features are, and a covariance for these predictions is known, measurements zk are now taken with the robot’s sensors. The innovation, ν k = zk − zk|k−1 ,
(2.13)
is the difference between the measurement and the predicted observation. It can be seen that if the prediction zk|k−1 is the same as the measurement zk then the robot and the map points are where it predicted them to be, otherwise when they are not the same there is a corrective factor to apply. The innovation also has a covariance, S = ∇Hx Pk|k−1 ∇H⊤ x + R,
(2.14)
as we have uncertainty regarding our measurement and our prediction. The observation model Jacobian ∇Hx is the derivative of h() with respect to x. When a feature or multiple features are observed they have an entry in ∇Hx . The unobserved ones have an entry of zero as they have no effect on this update. R is the observation covariance and specifies the uncertainty of the measurements and how the uncertainties of the measurements are interrelated. The last matrix that needs to be calculated before the update can be done is the Kalman gain, −1 W = Pk|k−1 ∇H⊤ xS .
(2.15)
This matrix gives a weight to how important the discrepancy between the estimate and the measurement is. Pk|k−1 can be thought of as the confidence in prediction and S−1 the confidence in measurement. When W is small there is a higher confidence in the prediction than in the measurement and when W is large the opposite is true.
2.2
Simultaneous localization and mapping
22
Now that the prediction and the measurements have been made, the new state and covariance estimates can be found. The new estimate, xk|k = xk|k−1 + Wν k ,
(2.16)
is the old estimate plus the corrective factor formed by the innovation weighted by the Kalman gain. If the confidence in the measurement is high then the gain is large the innovation has a large effect, whereas if the confidence in the measurement is low, W is small and the innovation has a small effect. The new covariance estimate, Pk|k = Pk|k−1 − WSW⊤ ,
(2.17)
follows the same idea. The covariance always decreases with a measurement, because some information can always be gained. The covariance only increases when no measurements are made, due to the process noise added in the covariance prediction step. The new estimate of the robot’s state and its covariance are now known. These are now used in the next iteration as the old state estimate and covariance.
2.2.3 Single camera SLAM Davison developed an EKF based SLAM system that only used a single camera [5]. The camera is allowed to move generally in 3D, which is a challenge because neither single nor multiple views of a single point yield depth when the motion is unknown. Information comes from points collectively, but as processing has to be completed in a fixed time, a limit must be imposed on the feature map size. For some applications of wearables, such a constraint is quite compatible with a restricted workspace around a user. An example of the single camera SLAM system is shown in Fig. 2.5. In the image on the left the ellipses are the uncertainties for the detected features and the rectangles are the outlines of the patches around the feature that are used for matching. The image on the right is a 3D representation of the map; here the camera can be seen near the centre of the image and the detected features around the camera in 3D.
2.2
Simultaneous localization and mapping
23
Figure 2.5: The single camera SLAM system running. The image on the left is the camera view with features that have been detected highlighted. The image on the right shows the 3D view of the map with the points and their uncertainty ellipses. The features are surrounded by their covariance ellipsoids, where the large ellipsoids are features that are not localized well and the small ones that look like points are localized very well. The single camera SLAM system requires information about the scale of the world to be able to work. To this end a calibration plate with features that are known is required to be in view at a known depth when the system is started. This plate is a black rectangle on a white background and can be seen in the left hand image of Fig. 2.5 on the right hand side. In the extended Kalman Filter formulation of Davison, the state, ct X1 χ = . , ..
(2.18)
Xn
comprises two parts, Xi , the fixed 3D locations of map features, and ct , which consists of the time-dependent camera position, orientation, translational velocity and angular velocity, all defined in a fixed world frame. The associated covariance, Pcc PcX1 · · · PcXn PX c PX X · · · PX X 1 1 1 n 1 P= . .. , .. .. .. . . . PXn c PXn X1
(2.19)
· · · PXn Xn
is fully populated. The state prediction, measurement process, and update cycle are as de-
2.2
Simultaneous localization and mapping
24
scribed above in the EKF description, and assume constant translational and angular velocities in the world frame. The choice of features for visual SLAM For Davison’s implementation, features for (potential) addition to the 3D map are detected with the Shi-Tomasi feature detector as described in section 2.1.1, and features that are eventually added to the map are stored with an 11×11 pixel appearance template. Given a predicted camera position, each feature Xi is projected into the new image along with its associated uncertainty region derived at the 3σ limit from the prior covariance projected into the image. Searches are made within the region for correspondence using normalised sum-of-squared difference correlation. Adding and managing features Davison’s original method of feature initialization [5] initializes a new feature in the map using a uniform particle distribution from a depth of 0.5 m to 5 m along a 3D ray from the camera. At subsequent time steps, the likelihoods of hypotheses are tested by projection into the image, and the particles redistributed using importance sampling. Only when the posterior density is Gaussian-like is the feature transmuted from a line to a 3D point. The map management criteria aim to restrict the number of features observable in any one view to be compatible with both the desired localization accuracy and maintaining video frame-rate on portable CPUs. A feature is predicted to be observable from some particular viewpoint if its projection into a model of the camera lies well within the image, and if both angular and range differences between the viewpoint and the feature’s initial viewpoint are not so great as to make the image template valueless. New features are added if the number of those actually observable falls to five or fewer, and a feature is deleted if its long-term ratio of actual observability after search to predicted observability falls below 1:2.
2.3
Visual wearable robots
25
2.2.4 Building on single camera SLAM The single camera SLAM system provides a robust foundation for further research. Recently Montiel et al. [27] have shown that features can be initialized over much wider depth ranges than the particle method by using inverse depth in the state. This is possible because the probability distribution for an inverse depth feature is similar to a Gaussian, so it can be used with the EFK without the need to use a particle filter. This method also has the advantage of not requiring a calibration plate to initialize the system, allowing the system to start with no predefined knowledge of the world. Smith et al. [32] showed that lines could be extracted from the scene using FAST corners and used for tracking at frame-rate. Combining line tracking with inverse depth feature points yielded a system that was more robust to tracking failure than either type of feature on their own. In chapter 4 we show how detected objects can be localized and added to the SLAM map.
2.3 Visual wearable robots The final section of this review gives a brief overview of visual wearable robots. Much of the groundwork in wearable vision has focused on where and how to place cameras on the wearer’s body, and how to supply graphically augmented video to the wearer and a remote assistant if present. Fixed cameras have been attached to the torso to recover ambient information [4]; to the head [1, 34] and wrists [38] to provide more specialized task oriented views. The positioning of these cameras only allows them to perform a restricted set of tasks, which is suitable for solving the proposed problems, but not for a general visual wearable robotic assistant. Fixed wearable cameras have also been used for visual memory augmentation [9, 16, 28]. These record events from the day triggered by some form of event, such as light change, temperature, or just a regular time interval. However, visual sensing can and should operate at a
2.3
Visual wearable robots
26
more profound level, providing the wearer with autonomous advice as to what is where, where to go next, and so on. There are, of course, many shared problems here with robot navigation; but there are also sharp contrasts: principally, these are that the camera’s “carrier” is highly intelligent, but geometrically sloppy, and not amenable to precise or timely control. Mayol-Cuevas [23] analysed where an active camera should be placed on a human body to allow maximum coverage of the user’s work area with minimal interference to the user. He showed that the optimal placement of a wearable robotic assistant is on the shoulder, as this allowed full view of the user’s work space; a view that is similar to that of the user; reasonable independence from the user’s motion; minimal occlusion of the surroundings by the user (only by their head); and is a location that the user would not find the wearable to be too much of a burden or a hindrance to their actions. He designed a shoulder mounted pan-tilt-roll active camera for use as a wearable robotic assistant shown in Fig. 2.6. The system used Davison’s [5] single camera SLAM as described above to enable the wearable to keep track of its position in relation to the point features it had detected in the world. The wearable could fixate on a feature or saccade to a required feature. A user could manually label mapped feature points and have the camera re-fixate between them. This labelling system worked well for enabling the system to re-fixate on features, but the system could not identify “objects”, only a sparse set of points. The limitations of using a single point to identify an object are that it may not be detected on successive runs or another point may be misinterpreted as the object as a single point is not very unique or that the point may not be detected over large viewpoint changes. Another shoulder mounted active wearable camera by Kurata et al. [17] was not autonomous, but controlled by a remote operator. In their scenario a remote expert controlling the camera was able to guide an untrained user in the completion of complex tasks by use of a laser pointer, audio communications, and graphical information shown on a heads-up display worn by the user. They showed that an active camera allowed the user to complete the tasks quicker than a fixed camera as the remote operator was able to look around the workspace without asking
2.4
Summary
27
Collar Robot
Figure 2.6: The wearable designed by Mayol-Cuevas [23]. the user to move.
2.4 Summary In this chapter we have reviewed the previous work in the fields of object detection, localization and mapping as applied to a single camera, and wearable visual robots. Davison’s single camera SLAM system has a lot of scope for improvement to allow it to use more complex features than just points, and to use input from multiple independent cameras. The wearable visual robot built by Mayol-Cuevas also has a large scope for improvement to make it into an autonomous wearable visual robot that can interact with the user and provide useful feedback. In the work described in chapter 4 we show how the single camera SLAM framework can be extended to detect objects using SIFT (described in the next chapter) and accurately localize them in the world.
3 Object detection using SIFT 3.1 Detection using SIFT Object detection using SIFT (scale invariant feature transform) as described by Lowe [20] is a four-stage process. In the first stage a scale-space pyramid is created by successive convolutions with Gaussians and then adjacent images are subtracted to create a difference of Gaussian (DoG) pyramid. The second stage locates the extrema in the DoG to subpixel accuracy and filters out unsuitable extrema, and the third stage creates the SIFT descriptors for each extrema. Matching is the final stage and is done using Lowe’s second-nearest-neighbour algorithm. The scale-space pyramid for an image is created by successively convolving the image with Gaussians. To create the scale-space faster than applying Gaussians with successively larger values of σ, which require larger masks to represent the significant parts of the Gaussian, Lowe convolves a Gaussian with a smaller σ, and hence a smaller mask, with the previously convolved image. Each doubling of σ is known as an octave and with each doubling the Gaussian image is re-sampled to half its size. A representation of the scale pyramid is shown in Fig. 3.1 on the left and the first image is at the bottom of the pyramid. According to Lowe, by intially doubling the image the number of stable features increases by almost a factor of four. Fig. 3.2 shows the first image from each octave of the the scale space pyramid, here it can bee seen how
Detection using SIFT
First octave
Second Octave
3.1
29 0 1 1 0 1 0
00 11 11 00 11 00
Convolved Images
Difference of Gaussian
4σ 2σ
2σ
σ
Figure 3.1: The scale space pyramid on the left is created by successive convolutions with a Gaussian and at each octave (doubling of σ) the image is halved in size. The difference of Gaussian pyramid on the right is created by subtracting adjacent images in an octave. the image changes as it traverses scale space. The difference of Gaussian (DoG) is formed by subtracting adjacent images in an octave as shown in Fig. 3.1. An octave from the scale space pyramid is shown in Fig. 3.3 along with the resulting DoG octave. The second stage locates extrema (a maxima or a minima) by searching the octaves of the DoG pyramid, with each pixel compared against its neighbours at the same scale and its scale neighbours above and below. If the current pixel is the largest or the smallest it is recorded as an extremum. This comparison can be quite fast as it can end as soon as both constraints fail, but it is a variable speed search, depending on the amount of extrema. The extrema now need to be located to subpixel accuracy, because at the higher scales one pixel represents a large region on the original image. The subpixel localization is done using a method developed by Brown [2] that fits a 3D quadratic function around the extrema. The extrema are then filtered for contrast
3.1
Detection using SIFT
30
Figure 3.2: The first image from each octave is shown. It can be seen how the detail is lost as the image is blurred and resized. The largest image on the left is the initially doubled image. The images on the right are from the remaining octaves. to ensure that the difference between then and the surrounding pixels is large enough to be distinctive. Finally they are filtered on their edge response, as features on an edge are unstable and hard to relocalize because all positions along an edge look similar. The third stage creates descriptors for each of the extrema identified. Each feature is assigned an orientation by forming a histogram of the orientations of the pixels at the extremum’s scale around the extremum, and selecting the dominant orientation. A pixel’s orientation is found by calculating the angle of the gradient across the pixel, and its magnitude is the magnitude of the gradient. The descriptor, as shown in Fig. 3.4, is formed from the orientations and magnitudes of the pixels around the keypoint by sampling over a 16×16 array aligned to the keypoint’s orientation. This array is broken down into 4×4 subregions, and within each subregion the orientations are used to fill an orientation histogram of eight bins weighted by the magnitudes of the pixels and a Gaussian to give pixels nearer the extrema more weight than those further away.
3.1
Detection using SIFT
31
Figure 3.3: One octave of the scale space pyramid is shown on the top with its corresponding DoG pyramid on the bottom. The combination of these orientation histograms from these subregions gives a 128 value descriptor. The descriptor is then normalised to provide invariance to illumination change, as an illumination change will result in a constant factor being applied to the whole descriptor. The middle pair of images in Fig. 3.5 shows the detected SIFT features. The arrows show the orientation of the keypoint and their length represents the scale at which they were found. The shortest arrows are from the doubled image and the longest from the smallest image in the DoG pyramid. The final stage of SIFT object detection is to match the keypoints. Lowe [20] uses the secondnearest-neighbour to identify matches which is an adaptation of the nearest-neighbour algorithm. The nearest-neighbour algorithm works by calculating the the Euclidean distance of each feature descriptor in one image from each feature descriptor in the other image. The descriptor that has the smallest Euclidean distance that falls below a predefined threshold is accepted as a match. The second-nearest-neighbour algorithm works on a similar principle, except that it finds the nearest neighbour and the second nearest neighbour. If the neighbours are too close together (the ratio is above a predefined threshold) then they are rejected as they are too similar to tell apart reliably. This is regardless of how close the neighbours are to the feature that is being compared. If the ratio is below the threshold the nearest neighbour is considered
3.1
Detection using SIFT
32
Figure 3.4: The left image shows an 8×8 array (instead of a 16×16 array for clarity) oriented around the keypoint overlaid on the image pixels. The circle represents the Gaussian that is applied and the arrow if the orientation of the keypoint. The image on the right shows the full descriptor with the orientations of the pixels for the 4×4 subregions. a match. A correlation of features between the images is now known and any unmatched features are discarded. The robustness of SIFT can be seen in Fig. 3.5 as the reference image of the pansy on the left can be matched to the scene image on the right, even though in the scene the pansy is at a much smaller scale, is rotated into the scene by a large amount and contaminated by radial distortion due to a wide-angle lens.
3.1
Detection using SIFT
33
Figure 3.5: The top pair of images are the original images to be matched. The middle pair show the SIFT features detected on the image. The arrows show the orientation of the features and their length represents the scale at which they were found. The shortest arrows are from the doubled image and the longest from the smallest image in the DoG pyramid. The bottom pair shows the second-nearest-neighbour matches between the images.
3.2
Timings
34
3.2 Timings The main drawback with SIFT is the time it takes to run. When detecting a single object the majority of the time, as shown in Table 3.1, is spent creating the scale-space pyramid and DoG then locating the extrema, followed by the second-nearest-neighbour matching and finally the descriptor creation. The DoG can be replaced with another faster scale invariant feature detector such as the Harris-Affine by Mikolajczyk and Schmid [24], and only use the SIFT descriptor, but these still do not run a frame-rate. Two simple ways of speeding up the DoG are firstly, to not double the image initially, but this significantly reduces the number of features found, and secondly, to not search the entire image, only a subregion of it. These two ideas can be used intelligently in a tracking system if an object has already been located and only the region where the object is expected can be checked. The matching time will vary depending on how many features are found in the image and how many are in the database. Stage Scale pyramid creation DoG creation and extrema localization Descriptor creation Matching Total time
Time (ms) 166 160 292 360 978
Table 3.1: Example timings for SIFT running on a 320×240 image generating 524 features and performing second-nearest-neighbour matching against 2904 features from another image resulting in 9 matches.
As the number of objects in the database increases the main factor contributing to the timing is the matching, and is due to the high dimensionality of the descriptor. It can be seen that in matching a database of up to five planar objects against a 320×240 image in Fig. 3.6 that the matching time increases linearly, and it only takes two to three objects before the time taken to match takes as long as the rest of the SIFT process. To improve the scalability there are methods that can be used to search the high dimensional space of the descriptors faster than the linear second-nearest-neighbour search, such as KD-trees [11] or Lowe’s modification of this called
3.3
Summary
35
SIFT matching time vs. number of features 2000 1800 1600
Time (ms)
1400 1200 1000 800 600 Matching Time
400 200 2000
Total time
3000
4000
5000
6000
7000
8000
9000
10000
Number of features Figure 3.6: Time taken to match up to five objects from a database against a 320×240 image containing 524 SIFT features that took 620ms to locate and create. best bin first [20]. These methods only help to reduce the searching time up to a point before they themselves reach their limits. Another method to improve the search time is to reduce the dimensionality of the descriptors from 128, but the drawback with this is that the quality of the matching suffers greatly. Using a tracking system and higher level knowledge of the world, search times can be improved by only searching for objects that should be in frame and for ones that might occur in the local space.
3.3 Summary Being able to reliably detect objects in a scene under large geometric transformations makes SIFT suitable for use as an object detector in a wearable robotic assistant. Its downside though is the speed it takes to run. It is unlikely to be able to run at frame-rate in the near future, but object detection while tracking does not need to occur every frame only at regular intervals.
4 Integrating object detection with single camera SLAM 4.1 Using objects with SLAM In the system currently implemented, objects are manually added to a database of objects offline. When the system is running, this database is queried for matching objects to those found in the scene. If a match is found then the object’s location in space is found and it is added to the SLAM map. Or if the object is already in the map then its location is calculated and used to update the state in the EKF.
4.1.1 The object database In our work we build a database of planar objects. To construct an entry, an image of the object is captured and, after correcting for radial distortion, SIFT descriptors are computed and stored along with their image positions xo . The image need not be fronto-parallel, and so it is also necessary to compute and store the homography Ho between the scene and image by choosing n ≥ 4 image points whose corresponding scene points can be located in a Euclidean coordinate frame. In 2D homogeneous coordinates, points in the object plane are Xo = [Xo , Yo , 1]⊤ , and Xo ≡ Ho−1 xo .
(4.1)
4.1
Using objects with SLAM
37
Labelling and Measurement
SIFT
Pansy
Durdle Door
Butterfly
Metric Dimensions
Metric Dimensions
Metric Dimensions
Descriptors
Descriptors
Descriptors
+ + + + + + + +
+ + + + + + +
+ + +
+
+
+
Figure 4.1: The object database contains planar objects (here pictures), the list of SIFT descriptors and their locations Xo (crosses), and the locations of boundary points, three of which are marked for use in the SLAM map. Implicitly then we know the set of scene points that gives rise to the salient features. As illustrated in Fig. 4.1, the database entry contains the list of SIFT descriptors and their locations Xo . In addition we store the locations of boundary points to define the object extent, and, as explained later, flag three of them for use in the SLAM map. The image of the object is rectified by the homography so that it appears as a fronto-parallel view.
4.1.2 Object detection SIFT is run at regular intervals on the current video frame, the location of the detected features are corrected for radial distortion, and the detected features are matched against the stored
4.2
Single view localization
38
keypoints of the known objects. Lowe’s second-nearest-neighbour method is used to generate a set of candidate matching descriptors [20]. If the number of matched points from any given object’s database entry to the current image is greater than a threshold (in our case eight) we regard that object as a candidate. Because of repeated structure or other scene confusion, some of the features may be incorrectly matched. However, as the database objects are known to be planar, the database scene points [Xo , Yo , 1]⊤ and currently observed image points xi are be related by a plane-to-plane homography xi ≡ Hi [Xo , Yo , 1]⊤ ≡ Hi H−1 o xo .
(4.2)
RANSAC, as described in section 2.1.4, is used to estimate the homography Hi and, if a sufficiently large consensus set is found, we infer that the database object is visible in the current frame.
4.2 Single view localization Having determined an object is visible we recover its location by decomposing the homography between the scene and the current image. In the Euclidean object-centred coordinate frame, the object lies in the plane Zo = 0, and 3D homogeneous points on the object are Xo = [Xo , Yo , 0, 1]⊤ . In any view i, the projection can therefore be written in terms of extrinsic and intrinsic parameters as xi ≡ Ki [Ri |ti ]Xo . Hence xi ≡ Ki Ai [Xo , Yo , 1]⊤ ,
(4.3)
where Ai = [ri1 ri2 ti ] contains the translation ti and the first two columns of the rotation matrix Ri , all modulo a scaling factor. Using the homography Hi already computed as the output of RANSAC and assuming known camera calibration Ki we can recover ri1 ri2 ti = K−1 i Hi ,
(4.4)
again up to scale. Because the estimate Hi is noisy, there is no guarantee that r1 and r2 found as above will be orthogonal (which they are required to be as they are columns of a rotation
4.2
Single view localization
39
matrix). The vectors r1 and r2 are orthogonalised using Gram-Schmidt orthogonalisation, and then the third column is found by taking the cross product. This method is not ideal as it assumes that there is no error in r1 ; a more preferable method would be to distribute the errors among all three columns. The rotation matrix and translation vector calculated in this way specify the transformation of the camera from the frame of reference of an object’s canonical database image. We can apply this transformation in reverse to place the object in the frame of reference of the camera; and then again apply an additional transformation determined by the camera’s current position relative to the world coordinate frame defined by the SLAM map to determine the position of the object in world coordinates.
4.2.1 Adding recognized object locations to the SLAM map A number of methods to add recognized object locations to the SLAM map can be envisaged. In this work we choose the straight-forward but effective approach of using the recovered 3D position of the planar object to define 3D point measurements. However, we do not use the feature positions themselves. Instead we use points on the object’s boundary — and use the minimum of three to define the plane while avoiding overburdening the SLAM process. For example, for the rectangular pictures used in experiments, three of the four corners are used, as indicated in Fig. 4.1. There are several benefits in this approach. First, no additional mechanism is required in the SLAM process. Provided reasonable values are supplied for the (typically much lower) 3D error in these points, constraints on the scene will propagate properly through the covariance matrix. Secondly, and perhaps most importantly, is that we do not rely on any particular SIFT features being re-measured over time. In effect, the underlying planar structure is allowing transfer of the chosen three boundary points, much reducing the likelihood of mismatching which would otherwise damage the map. Thirdly, for graphical augmentation, the boundary
4.3
Experimental evaluation Object label
40 Image
Image size
No. of keypoints
Pansy
640 × 480
964
Bluebells
640 × 480
1311
Grasshopper
640 × 480
1447
Durdle Door
640 × 480
2954
Poster
640 × 453 Total
2904 9580
Table 4.1: Database objects, keypoints, and the image size. points provide a convenient representation of the extent of the object.
4.3 Experimental evaluation In the tests of the system reported here, a database of five planar objects was used. SIFT was run on each object and the keypoints generated. The size of the images and the number of keypoints generated for each object in the database is given in Table 4.1. The detection, localization and SLAM methods have been implemented in C++ and run on single 3.2 GHz Pentium 4 CPU. Including operating system overheads, monocular SLAM with around 20 point features takes approximately 10 ms on a 320 × 240 image, leaving some 20 ms per frame to pursue detection. SIFT detects around 300 keypoints and takes some 500 ms per frame to complete on average, and matching one object takes on average 150 ms. While SLAM
4.3
Experimental evaluation
41
(a) Frame 0000
(d) Frame 0502
(b) Frame 0022,
(e) Frame 0713
(c) Frame 0159
(f) Frame 1186
Figure 4.2: The camera view is shown on the left and the map on the right. (a) The initial image with calibration plate visible in the map; (b) The first object is detected; (c) The poster is re-detected; (d), (e) More objects are detected and added to the map; (f) All objects have been detected and successfully localized. runs at 30 Hz the detection can run at around 1 Hz at best. These timings will of course vary with the number of objects in the database. With larger databases a faster searching method such as KD-trees would be required. To allow the storing of the results, the experiments reported here were run on a pre-recorded sequence, where the SIFT processing was run only every 20 frames. Fig. 4.2 shows the evolution of processing, from initial calibration of the SLAM system to a time when there are five recognized planar objects in the SLAM map. The 2D views show overlaid identities and extents of the objects, typical of that which would be useful to the user of a wearable or hand-held camera. The views on the right show the evolution of the 3D map with textures marking the recognized areas.
4.3
Experimental evaluation
42
(a) Perspective view
(b) Along the x-axis
(c) Along the y-axis
(d) Two of the objects
Figure 4.3: (a) View of the whole 3D map. (b,c) Individually recognized and located planar objects on the XY wall are recovered as coplanar to within map error, and (d) the right angles between two walls and the floor are also recovered faithfully. See Table 4.2. Fig. 4.3 shows various views around a particular 3D map in which there are five picture objects, three of which (poster, bluebells, grasshopper) should be coplanar with the calibration plate (and hence in the XY -plane), and the other two (pansy, Durdle Door) are mutually orthogonal in the ZY and XZ planes respectively. It can be seen that all of the objects are in their respective planes to within experimental error. Table 4.2 shows the angles between the planes recovered from the SLAM map. Although the map looks well-formed in terms of coplanarity and angles, the error covariance for the 3D measurements is not fully characterised. It depends, of course, on the image
4.4
Summary
43 Object label Bluebells Grasshopper Poster Pansy Durdle Door
Actual angle 0 0 0 90 90
Measured angle 3.0 9.0 3.0 90.3 96.3
Table 4.2: Angles between the calibration plate and the objects. covariance and its tortuous propagation through the estimation of homographies and their decomposition, and a subsequent Euclidean transformation. Here, as an expedient, we have assumed that the error covariance is diagonal and have examined the effect of inserting a single 3D feature on the performance of the EKF when already running “standard” monocular SLAM. Tuning the performance to the size of the covariance suggest that the lateral and depth errors are of order 5 mm and 10 mm respectively.
4.4 Summary This research has shown that objects can be reliably detected and localized in 3D, producing results that are accurate and able to aid the SLAM system and the user. The measurements from the objects that are entered into the SLAM map have many properties that greatly aid the SLAM system. They are sparse, meaning that they keep the number of entries added to the covariance matrix to a minimum. This allows more objects and other features to be tracked in the limited time bounds compared to a dense representation, such as adding all of the object feature locations. The measurements are robust to partial occlusion as they are based on multiple feature measurements on an object, so not being able to measure some features does not stop the identification and localization of an object. The measurements are less likely to be incorrect through mismatching as they come from a larger set of features that have been filtered for incorrect matches and agree to a geometric consensus. The measurements are also more accurate as they can be localized with a single measurement from a single view rather than the multiple measurements from different views and particle filter method used for point features. This work will be developed further as described in the next chapter.
5 Research Direction The focus of the research over the next two years will be on building a full wearable system and utilising object detection to improve the SLAM system and provide a better user experience. A sketch of the currently envisioned system is shown in Fig. 5.1.
5.1 Wearable system design The first theme in my proposed research is the construction of a wearable camera combined with a tablet PC display and camera, as these will be required to carry out much of the intended research. This will be developed in two main stages. The first will involve the construction of an active camera platform, the writing of control software for it and its integration it into the SLAM framework. This will provide a foundation for some of the object detection work detailed in the next section. The active camera platform will eventually be similar to the one previously built in the Active Vision lab by Mayol-Cuveas [23], and use micro-servo motors and a modified camera needed to make it portable. At this stage the wearable does not need to be mounted on a user, so a desk bound model that uses standard servo motors and an unmodified commercial firewire camera will be built. This will allow rapid prototyping of motor layouts and testing of software. The previous wearable was mounted on a collar around the users neck-shoulder intersection and used a pan-tilt-roll motor configuration. This configuration will be looked at
5.1
Wearable system design
The proposed wearable
45
Augmented view as seen by the wearer
Figure 5.1: A sketch of the wearable device shown mounted on a user along with an augmented view of the world. again, but also along with shoulder mounting and alternate motor configurations such as rollpan-tilt to allow the wearable to lean out of the way of the user’s head. The active camera will be used to seek out objects and features to keep the SLAM system running and locate objects of interest to the user, somewhat independently of whatever the user is doing. The second part will involve integrating into the system a camera which is entirely under the control of the user. The camera will be mounted on the back of a tablet PC, and will be used to look at the surroundings and present an augmented view to the user, highlighting locations of objects and presenting useful information about them. As the user cannot be forced to move the camera in a way that benefits the SLAM system, and will most likely move too quickly, jerkily, or obscure the camera’s view, tracking and localization will utilise both cameras. The active camera will perform detailed tracking and mapping and the tablet camera will track and map while it can, but when it loses track it will attempt to recover using two different methods. The first, developed by Williams [40] in the Active Vision lab, uses a RANSAC style approach with a three point pose algorithm [12] to match features in the current image to those in the map. The other, to be developed, will relocalize the camera by detecting mapped objects in the current image, will calculate the camera’s relative position to them, and then locate the other visible mapped features to confirm the camera’s position.
5.2
Using object geometry
46
Towards the end of the research period, a miniaturised version of the final active camera assembly will be created and mounted on a user for use as a wearable robotic assistant. The user-controlled camera will also be mounted on the back of a tablet PC allowing the fully wearable system to be tested.
5.2 Using object geometry The second principle theme in the proposed research programme is the further exploitation of object geometry in SLAM. There are a number of ways the approach shown in Chapter 4 can be developed. On the more geometrical side, provided care is taken when specifying allowed acceleration noise in the EKF, it is possible to initialize the monocular SLAM (particularly that using inverse depth [27]) without calibrating the scale, and then to allow identified objects to provide resolution of the depth/speed scaling ambiguity. When the database is of a realistic size, it seems unlikely that any object identifier will be able to run sufficiently quickly, particularly if it wastes time re-identifying areas of the image already examined. This can be offset by searching for further objects intelligently by only searching unchecked areas. Re-detecting an object is also quicker since only the area of the image and scale where it is expected needs checking. Once relocalization is achieved using objects, loop closure using objects will also be obtainable. When a previously detected object is found in a location that differs from where it was seen before the system can attempt loop closure by checking that other features around the object match the previously matched features. This will allow multiple copies of an object to exist, while also closing any loops. Another area for development is the addition of objects to the database during motion and determining their size from what we know about the world already will also be implemented. Planar objects shall be added by a user selecting four corners of the desired object and then the wearable rectifying this image and adding it to the database in the manner described in chapter
5.3
Using object labels
47
4. If the newly added object comes closer and more fronto-parallel to the camera the system can recapture the image of the object and update the database with this improved image.
5.3 Using object labels The third principle theme is to explore the use of object labels for wearables. The focus of the work in chapters 3 and 4 has been on the geometrical benefits of recognition, and little has been said of the value of the availability of meaningful labels to the wearer of the camera. By adding auditory feedback to the present system we intend to explore the control of the wearer of an active camera to help guide them to locations or move them to help the mapping process. The addition of information about how likely an object is to move can also be added to the object database. This is useful as not all objects will be fixed and immovable, and if an object is moved and re-detected in a different place it would ruin the SLAM map by propagating its error through to the rest of the system. It is currently envisaged that there will be several classes of object, such as ’fixed’, ’mobile’, ’semi-mobile’. An example of these classes are a poster, a book, and a monitor respectively. The wearable would then have a higher preference to remeasure fixed objects for improving the map and mobile objects that are near the user to keep the user informed of their location in case they get moved.
5.4 Planning The above work is intended to be done in four stages as detailed in Fig. 5.2. To begin with, the wearable test platform will be designed and built and its control code written and integrated into the SLAM framework. Then the object geometry research will be done to allow recovery and loop closure, manual addition of objects to the database and scaling the map using the geometry of detected objects. Then the use of object labels will be explored to provide auditory feedback to the user and add object motion information to provide the user with relevant information regarding their workspace. During this work the construction of the final wearable will
5.4
Planning
48
2006
J
F
M A M J
J
A S
O N D
J
A S
O N D
J
A S
O N D
Wearable Design 2007
J
F
M A M J
Wearable Design Object Geometry Object Labels 2008
J
F
M A M J
Object Labels Construct final wearable Write up thesis Figure 5.2: Gantt Chart begin and be completed in tandem. It is my intention to write and refine drafts of the thesis chapters throughout the research period, so that four months to gather the final results and complete the thesis should be sufficient.
6 Conclusions In this report we have reviewed the literature required to build an autonomous wearable robotic assistant that can provide assistance by identifying and tracking objects in the world and informing the user about them. Feature detectors and descriptors that enable planar objects to be reliably detected anywhere in the scene were reviewed and Lowe’s SIFT was chosen for further examination. To enable the wearable to localize itself and map the world the SLAM literature was reviewed with a focus on the EKF as this was used in Davison’s single camera SLAM system that is used as a foundation for our work. The single camera SLAM system enables a freely moving camera to localize itself in the world using point features and create a map of these features. Finally, previous wearable system designs where reviewed along with Mayol-Cuevas’ work on wearable placement when using an active camera. SIFT was then studied further for its use as an object detector and its main limitation, its running time, analysed and some possible ways of reducing time discussed. Davison’s single camera SLAM system was then used as a foundation for integrating planar object detection using SIFT into the SLAM framework. A system was developed that can recognise planar objects stored in a database when they appear in an image. It uses the stored geometry of an object to localize the object in 3D in relation to the camera and add the 3D measurements into the map for updating by an EKF. The measurements entered into the SLAM
50 map are sparse, robust to partial occlusion, less likely to be incorrect through mismatching, and more accurate. The system is able to reliably detect a set of known objects and faithfully recover their spatial geometry. This addition of object detection to the SLAM system forms the foundation of the work planned for the remaining two years of the DPhil. To begin with an active camera platform will be constructed and the control software developed and integrated into the SLAM system, followed by the addition and integration of a second, user controlled camera. Then the knowledge of the object geometry will be used to aid recovery from tracking failure and loop closure. The ability for users to add objects to the database in real-time will be added, along with the addition of information regarding how likely it is for an object to move, to aid the mapping process. Finally the use of audio feedback to attempt to control the user will be explored. At the end of the research an autonomous wearable robotic assistant that can aid the user by locating objects and directing the user to them will be realised.
A Courses and seminars attended A.1 Summer Schools I have attended the following summer schools related to my research: • British Machine Vision Association Summer School, Kingston University, London, July 2006 (6 days). • International SLAM Summer School, University of Oxford, August 2006 (4 days).
A.2 Courses During the past year I have attended the following courses in the Engineering Science department at the University of Oxford: • Computational Geometry; David Murray, Michaelmas Term 2005. • Mobile Robotics; Paul Newman, Michaelmas Term 2005. • Computer Vision; David Murray and Ian Reid, Michaelmas Term 2005. • Estimation and Inference; Andrew Zisserman, Hilary Term 2006.
Bibliography
52
A.3 Robotics Research Group Seminars I have attended the weekly seminars in the Robotics Research Group (http://www.robots. ox.ac.uk/seminars/).
A.4 Computer Vision Reading Group I have attended the computer vision reading groups (http://www.robots.ox.ac.uk/∼cvrg/).
A.5 Active Vision Reading Groups I have attended the Active Vision weekly reading groups and presented the following papers at two of them.
• T. Okuma, T. Kurata, and K. Sakaue. A natural feature-based 3D object tracking method for wearable augmented reality.
In Proc. 8th IEEE International Workshop on Advanced
Motion Control (AMC’04), pages 451–456, 2004. (Hilary term, 2006). • Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. In Proc. 22nd IEEE Conf. on Computer Vision and Pattern Recognition, Washington DC, 2004, pages 506–513, 2004. (Trinity term 2006).
Bibliography [1] H. Aoki, B. Schiele, and A. Pentland. Realtime personal positioning system for a wearable computers. In Proc 3rd IEEE Int Symp on Wearable Computing, San Francisco CA, Oct 18-19, 1999, pages 37–43, 1999. [2] M. Brown and D. Lowe. Invariant features from interest point groups. In Proc 14th British Machine Vision Conference, Norwich, Sept 2003, pages 656–665, 2002. [3] R. O. Castle, D. J. Gawley, G. Klein, and D. W. Murray. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. Submitted to Int Conf on Robotics and Automation, 2007. [4] B. Clarkson and A. Pentland. Unsupervised clustering of ambulatory audio and video. In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, Phoenix AZ, 1999, volume VI, pages 3037–3040, 1999. [5] A. J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Proc 9th Int Conf on Computer Vision, Nice France, Oct 13-16, 2003, pages 1403–1410, 2003. [6] A. J. Davison and D. W. Murray. Mobile robot localisation using active vision. In Proc 5th European Conf on Computer Vision, Freiburg, volume 1407 of LNCS, pages II:809–825. Springer, 1998. [7] A. J. Davison and D. W. Murray. Simultaneous localization and map-building using active vision. IEEE Transactions on Pattern Analisis and Machine Intelligence, 24(7):865–880, 2002. [8] M.W.M.G. Dissanayake, P.M. Newman, S. Clark, H.F. Durrant-Whyte, and M. Csorba. A solution to the simultaneous localization and map building (SLAM) problem. IEEE Transactions on Robotics and Automation, 17(3):229–241, 2001. [9] J. Farringdon and Y. Oni. Visual augmented memory. In Proc 4th IEEE Int Symp on Wearable Computing, Atlanta GA, Oct 16-17, 2000, pages 167–168, 2000. [10] M. A. Fischler and R. C. Bolles. RANdom SAmple Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [11] J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209–226, 1977. [12] R. M. Haralick, C. Lee, K. Ottenberg, and M. Nolle. Review and analysis of solutions of the three point perspective problem. International Journal of Computer Vision, 13(3):91–110, 1994. [13] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. ALVEY Vision Conference, Univ. of Manchester, UK, pages 147–151, 1988.
Bibliography
54
[14] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [15] R.E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960. [16] T. Kawamura, Y. Kono, and M. Kidode. Nice2CU: managing a person’s augmented memory. In Proc 7th IEEE Int Symp on Wearable Computing, White Plains NY, Oct 21-23, 2003, 2003. [17] T. Kurata, N. Sakata, M. Kourogi, H. Kuzuoku, and M. Billinghurst. Remote collaboration using a shoulder-worn active camera/laser. In Proc 8th IEEE Int Symp on Wearable Computing, Arlington VA, Oct 31 - Nov 3, 2004, pages 62–69, 2004. [18] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1465–1479, 2006. [19] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):77–116, 1998. [20] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [21] D. Marr and E. Hildreth. Theory of edge detection. Proceedings of the Royal Society of London. Series B, Biological Sciences, 207(1167):187–217, feb 1980. [22] J. Matas, O. Chum, U. Martin, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc 14th British Machine Vision Conference, Norwich, Sept 2003, volume 1, pages 384–393, 2002. [23] W. W. Mayol-Cuevas. Wearable Visual Robots. PhD thesis, University of Oxford, Dept of Engineering Science, 2004. [24] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1):63–86, 2004. [25] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615–1630, 2005. [26] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 2003. [27] J. M. M. Montiel, J. Civera, and A. J. Davison. Unified inverse depth parametrization for monocular SLAM. In Proc Conf on Robotics: Science and Systems, Philadelphia PA, Aug 16-19, 2006. [28] Microsoft Research. Sensecam — last accessed 10/10/2006. microsoft.com/sendev/project sensecam.aspx.
http://research.
[29] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In Proc 10th Int Conf on Computer Vision, Beijing China , Oct 15-21, 2005, volume 2, pages 1508– 1511, October 2005. [30] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In Proc 9th European Conf on Computer Vision, Graz, May 7-13, May 2006.
Bibliography
55
[31] J. Shi and C. Tomasi. Good features to track. In Proc IEEE Conf on Computer Vision and Pattern Recognition, Seattle WA, June 21-23, 1994, pages 593–600, 1994. [32] P. Smith, I. Reid, and A. J. Davison. Real-time monocular SLAM with straight lines. In Proc 15th British Machine Vision Conference, Edinburgh, Sept 2006, 2006. [33] R.C. Smith and P. Cheeseman. On the representation and estimation of spatial uncertainty. International Journal of Robotics Research, 5(4):56–68, 1986. [34] T. Starner, J. Weaver, and A. Pentland. Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371–1375, 1998. [35] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, September 2005. [36] S. Thrun, Y. Liu, D. Koller, A. Ng, Z. Ghahramani, and H. Durrant-Whyte. Simultaneous localization and mapping with sparse extended information filters. International Journal of Robotics Research, 23(7-8):693–716, 2004. [37] S. Thrun, M. Montemerlo, D. Koller, B. Wegbreit, J. Nieto, and E. Nebot. FastSLAM: An efficient solution to the simultaneous localization and mapping problem with unknown data association. Journal of Machine Learning Research, 2004. [38] A. Vardy, J. Robinson, and L. Cheng. The WristCam as input device. In Proc 3rd IEEE Int Symp on Wearable Computing, San Francisco CA, Oct 18-19, 1999, pages 199–202, 1999. [39] M. Walter, R. Eustice, and J. Leonard. A provably consistent method for imposing exact sparsity in feature-based SLAM information filters. In Proceedings of the 12th International Symposium of Robotics Research (ISRR), San Francisco, CA, USA, October 2005. [40] B. Williams, I. Reid, and P. Smith. Automatic relocalisation for a single-camera simultaneous localisation and mapping system. Submitted to Int Conf on Robotics and Automation, 2007.