Novelty Detection Using Growing Neural Gas for ... - Semantic Scholar

2011 IEEE/RSJ International Conference on Intelligent Robots and Systems September 25-30, 2011. San Francisco, CA, USA

Novelty Detection Using Growing Neural Gas for Visuo-Spatial Memory Dmitry Kit, Brian Sullivan, and Dana Ballard

Abstract— Detecting visual changes in environments is an important computation with many applications in robotics and computer vision. Security cameras, remotely operated vehicles, and sentry robots could all benefit from robust change detection capability. We conjecture that if one has a mobile camera system the number of visual scenes that are experienced is limited (compared to the space of all possible scenes) and that the scenes do not frequently undergo major changes between observations. These assumptions can be exploited to ease the task of change detection and reduce the computational complexity of processing visual information by utilizing memory to store previous computations. We demonstrate a method to learn the distribution of visual features in an environment via a self-organizing map. Additionally, the spatial distribution of these features can be learned if a positional signal is available. Our method uses a low dimensional representation of visual features to rapidly detect changes in current visual inputs. The model encodes spatially-distributed color histograms of real world visual scenes captured by a camera moved through an environment. The distribution of the color histograms is learned using a self-organizing map with location (when available) and color data. We present tests of the model on detecting changes in an indoor environment.

I. INTRODUCTION The amount of visual input that is available in the environment is far greater than an organism’s ability to parse it. Kuipers describes this as ”drinking from the firehose of experience” [1]. Robots with a visuo-motor system interacting in dynamic complex environments must deal with the problem of keeping useful and up-to-date visuo-spatial representations. A robot may explore an environment and need to remember many disparate things about the scenes it experiences. It has been suggested that to deal with such complexity, inputs must be selectively parsed, allocating resources to dynamic and/or task relevant portions of the input [2]. We propose a computationally efficient approach that can be used to solve the change detection problem (i.e. when and how should the internal model be updated) by using image, and when available, spatial information over time. We utilize color histograms for a representation that is both fast to calculate and tolerant to small view changes (section III). To learn the distribution of visual features, a variant of self-organizing maps was used (section III-B). Our change detection mechanism is designed to act as a low level mechanism for alerting high level algorithms (i.e. D. Kit is a Ph.D. candidate in Computer Sciences at The University of Texas at Austin, Austin, TX 78712, USA, [email protected] B. Sullivan is a Ph.D. candidate in Psychology at The University of Texas at Austin, Austin, TX 78712, USA, [email protected] D. Ballard is a professor of Computer Sciences at The University of Texas at Austin, Austin, TX 78712, USA, [email protected]

978-1-61284-456-5/11/$26.00 ©2011 IEEE

more computationally expensive such as object recognition) about changes in the environment. For example, sentry robots with such a mechanism could reduce power usage by requiring expensive processing only when something is out of place. This mechanism could also allow a human operator to monitor many autonomous vehicles at once, giving alerts to the operator only when a change in the environment is detected. We first present background material (section II) on prior research addressing change detection and the construction of spatial maps. Next we present an overview of our model’s architecture. In Experiment 1 (section IV), we contrast our approach with a current spatial mapping algorithm and demonstrate the feasibility of using histograms in a visual representation. In Experiment 2 we demonstrate how using a visuo-spatial map (created from positional markers and histogrammed image data) can provide enhanced change detection. II. BACKGROUND Understanding how to represent and form expectations about spatial and visual aspects of an environment and detect changes thereof has been studied in both computer vision and robotics. The first section presents change detection research in the field of computer vision and the second presents previous work in the robotics area of simultaneous localization and mapping. A. Change Detection Due to its application in robotics and image processing, change detection in video images is an active research area. One approach to detect changes is to generate a model of a scene and use image subtraction to detect deviations. Stauffer and Grimson [3] proposed modeling every pixel, and color channel, using a mixture of Gaussians. This mixture is updated over time, to deal with changing viewing conditions. This approach makes the model unable to function in noisy conditions (e.g. shaking camera) and is most successful in fixed camera surveillance (e.g. [4]). Another example of background subtraction can be found in [5], which combined several algorithms to make a Sony AIBO robot follow a person. The background subtraction algorithm was initialized with a static viewpoint. The model was able to identify moving pixels (i.e. people), which were used by a dedicated people-tracking algorithm. However, using pixel level methods to model space for a movable camera is often costly and susceptible to noise, since to improve performance, a large number of views must be stored with their associated Gaussian models.

1194

Pickup and Zisserman [6] created an algorithm for detecting continuity errors in films. Their design leveraged knowledge of how movies are edited, and the fact that scenes often contain people. Frames were clustered via color histograms and registered with each other based on sequence order. Pixels containing changes not due to a person were classified as changed. While successful, this approach relies on specific problem constraints and is not a general-purpose solution for mobile cameras. Boiman and Irani’s [7] work treated the process of background modeling like a jigsaw puzzle and used prior knowledge of spatial configurations. Training examples provided the model with the available ‘puzzle pieces‘ and knowledge of their likely spatial layout. The learned puzzle pieces were then used to reconstruct novel images. Portions of the images not explained by these templates were considered part of a change signal. While this method performs well, its performance is linear with the number of templates used and the growth of templates is quite rapid. Itti and Baldi [8] have proposed a Bayesian approach for detecting deviations in visual representations. Their model calculates salience (i.e. contrast within a visual feature channel) across several image features over space and time. They measured deviations in the recent input to these channels via the Kullback-Leibler divergence (KLD) between the prior distribution of features and the posterior distribution of current features. Their methodology has had some success predicting human eye movement behavior, but is limited to short time scales and does not incorporate long-term visual or spatial knowledge. The solutions above achieve their success by either exploiting problem constraints or using a fixed camera. Our approach provides a general mechanism for change detection that a mobile agent can use to simplify its interaction with a 3-D environment. Unlike Stauffer and Grimson [3] algorithm, our model attempts to detect changes that were made to the visual scene in the agent’s absence when the agent revisits that location. Our work is also complimentary to Boiman and Irani [7], as our use of spatial context provides a limit on the size of the database at any particular location. While the model described in [7] is closest to our solution, it is mainly useful for detection of novel pixels. It is not clear if their approach will detect a familiar object moved from one location to another. Furthermore, as more changes are experienced the database will increase and the model’s ability to detect changes will decrease. Unlike prior research, we are most interested in change detection from an embodied perspective, i.e. data from a camera attached to a mobile robot or a human. We propose that in such situations change detection can be aided by the assumptions that 1) autonomous agents spend most time in specific environments performing specific tasks, 2) therefore they can afford to learn specific models of these environments’ visual and spatial content over time and 3) the representations they use must have a high degree of view invariance. To these ends, we captured first-person perspective movies in Experiment 1 and tested the use of

color histograms as an adequate representation for change detection. In Experiment 2, spatial data from the camera’s paths was also collected and the video was analyzed for changes. B. Spatial Maps As stated above, we want our system to use a spatial representation to index visual data and allow improved change detection. Here we review prior work on building spatial representations and how they process visual data. Creating spatial maps of environments falls into the category of the simultaneous localization and mapping (SLAM) problem. Kuipers et al [9] proposed a mechanism called the Spatial Semantic Hierarchy (SSH), which groups local sensor measurements into abstract concepts that can be used to generate global maps. Recently, Milford and Wyeth [10] developed a biologically inspired model (RatSLAM) that can generate these maps by using visual information to extract relative motion estimates. Although a model of the environment is created by these algorithms, these models are not suited well for change detection. Locating oneself in the environment requires a representation and deviation measure that uses summary statistics as signatures for locations. In both SSH, and RatSLAM, the representations may not contain the required level of detail for change detection. Experiment 1 will use RatSLAM to highlight this distinction, by replacing their visual representation with our own. While, SLAM algorithms do not have rich enough visual features for fine scale change detection, the spatial map generated is an extremely useful tool that can be used as an index to enhance image representations. By modifying our change detection algorithm to contain spatial knowledge we can test if a new piece of image data has deviations not only from prior image statistics, but also from image statistics around a specific location, as demonstrated in Experiment 2. III. M ODEL OVERVIEW The central concept behind our representation is that the spatial data structure should be view-invariant but still retain enough fidelity for change detection. We describe scenes with color histograms, which are easy to calculate and are insensitive to small view changes [11]. An online algorithm determines how to distribute finite resources to provide a summary of the previously observed histograms. While color histograms do not capture spatial features, color histogram information can be backprojected into a scene to highlight regions containing that color (see Alg. 2). Additionally, our model is modular and could use other image features sets in applications where color histograms are insufficient. The current model consists of first acquiring image and (if available) spatial data, calculating color histograms from the images, and finally training a self-organizing network on the histogram and spatial data (if available). Once trained, the network may be queried to provide a predicted histogram that can then be compared against the current true image histogram to detect changes. This section details each of the model’s components.

1195

A. Histogram Acquisition Initially, the original movie image (Fig. 1(a)) is blurred using a 3x3 pixel averaging window. This reduced the number of colors by removing high frequencies, making the objects have greater color consistency. Images were histogrammed using 10x10x10 (R,G,B) bins. The resulting histograms were dominated by the color of the larger objects in the environment (e.g. wall color) (Fig. 1(b)). To reduce the influence of these larger counts, a log of the histogram bin counts was taken (Fig. 1(c)). The final representation of these histograms was reshaped to be a single column vector (1000x1). B. Learning the distribution for histograms To detect when a change is present, an agent must first have a model of how color is distributed in an environment. In Itti and Baldi’s work [8], parameters are learned online over short time scales ( τ then 4: L ← pixels of P that map to Ih 5: Set all regions in L of size λ or more green 6: . Updating P 7: end if 8: return P . P containing green pixels 9: end procedure

KLD between the learned distribution of feature histograms and current input. The second section will demonstrate the model’s ability to learn a compact joint distribution of location and color histograms, and detect and localize changes. A. Experiment I: Learning the distribution of visual features RatSLAM is a robust algorithm for simultaneously localizing an agent and mapping its environment. It does so by building correlations between optical flow and visual features, which are stored in a library and indexed by an unique id. These correlations are stored in a cube (δx, δy, δθ) structure. The model takes advantage of the fact that with sparse visual input, co-occurrences between specific optical flow and visual experiences are rare and can be used as an identifier of specific locations in an environment. With new data, these correlations are used to create nodes in an experience map. The new nodes are created relative and connected to previously created nodes via a comparison function. In the algorithm’s final step, the experience map is traversed and distances between linked experiences are shrunk. For this experiment we captured movie data from a mobile camera exploring a complex scene where changes occurred in some of the films. Additionally, a simulated environment was used to generate images and is described later in this section. The movies were filmed with a Sony HDR-CX150 high definition camera mounted on a rolling platform. The visual features used by RatSLAM consist of the distribution of intensity values across all columns of an image. The images in our experiment were 1440x1080, yielding a feature vector 1440 elements long. This representation works well for cameras moving only in the horizontal frame, as vertical translation and rotation are ignored. The comparison function consists of finding the offset at which the difference between two intensity distributions is minimum. This value was set to a maximum value of 9, to limit the algorithm’s tendencies to overestimate angular rotation. To test the algorithm, 14168 frames were collected by moving the rolling platform with the camera in a figureeight pattern, passing through a few specific points, 13 times. The results are shown at the top of Fig. 3. These results highlight the tradeoff between knowing where the camera is in the room versus knowing if scene content has changed.

1197

Fig. 3. Results of running the standard RatSLAM algorithm on the dataset (Top) versus using the modified version of RatSLAM that used color histograms for visual templates (Bottom). The standard algorithm generated 1295 visual templates and 2262 experiences, while the modified version created just 40 visual templates and 2503 experiences. The lower number of visual templates demonstrates the view tolerance of color histograms. Unfortunately, this tolerance reduced RatSLAM’s ability to discriminate between different locations of the room, which resulted in the positional uncertainty in the lower part of the loop (bottom right vs. top right). The standard RatSLAM representation is sensitive to view changes and provides general features that allow the algorithm to reliably localize itself. Color histograms on the other hand, do not have this property, but the tolerance to small view changes allows for better change detection

RatSLAM’s pixel intensity representation excels at viewpoint changes, as shown by the number of unique visual templates generated and it superior localization performance. The view invariance of color histograms makes for an inadequate signal for localization. However unlike RatSLAM, color histograms contain image information that can be used to recognize changes from a variety of viewpoints, they do not require a camera that is restricted in its degrees of freedom in orientation, and histograms can be backprojected to localize changes in an input image. These properties yield a more robust representation for change detection. Given the feasibility of using color histograms for change detection, we used a simulated environment (an apartment scene rendered in WorldViz Vizard software) to provide a test bed where numerous trajectories through a complex environment could be recorded with and without visual changes. The paths through the environment were put into training and test categories (i.e. paths with changes and paths without). The model was trained on a set of paths without visual changes and then tested on paths with changes. Changes were generated by having 29 arbitrary objects (found in the environment and ranging from virtual floor lamps to a toaster), systematically moved one at a time to one of 66 possible locations in the environment. A novel trace path in this virtual environment was then used to systematically test the change detection algorithm. Object locations were chosen to be spread out over the apartment, giving a high probability that most would appear in some of the frames during testing. Both the training and testing trace were generated from

human trajectories. After each change, frames where the change was visible were recorded and the color distributions were compared to the previously stored experience at that location using KLD. This value was then compared to the KLD of the unchanged frame’s baseline (the black line in Fig. 4). If the value was greater than the baseline, change detection was considered successful. The KLD for the changed frames was greater than the baseline in 89% of 19054 test cases. Fig. 4 plots the KLD for the baseline versus the changed frames and illustrates how the algorithm regularly yields high KLD values for changed frames. B. Experiment II: Localization of Changes in the Real World Experiment 1 presented evidence that our technique can produce a suitable change signal. However, once it is determined that a change occurred, the model should also provide the agent with information about the location of that change, both within the color space and image coordinates. In this section, we examine the model’s ability to do so. The current location of a camera can serve as an index into the histogram space learned by the GNG. Querying the GNG with a location generates a prediction for the local viewpoint’s distribution of color histograms. Spatial information provides more specificity to the remembered histograms, thereby increasing change detection performance. We recorded spatial and head orientation information via an optical motion tracker (PhaseSpace Impulse System) from a human subject walking around a research lab with a head

1198

Obj ectAppear s

Obj ectDi sapear s

Fig. 4. The model’s ability to detect changes is dependent on how well it learns the distribution of histograms in the environment. To measure this baseline, the model was applied to a novel data trace, not used for training. The baseline (black line) shows that some areas are learned better (low KLD) than others (e.g. frame 600 versus 900). Anything below the base line is indistinguishable from normal variations, while larger values indicate a detectable change signal. Here values for three traces with different changed objects (toaster, flower, and Rubik’s cube) are shown. During testing, the environment was then changed (one change per execution) and the model was reapplied to the data trace. The plot shows the KLD for three such changes indicated by the three markers. Each point in the main plot represents the maximum KLD (boxed point in inlay) of a single interaction with that change. The inlay shows a zoomed in view of a single interaction Out of the 19054 possible interactions 89% had a KLD greater than the baseline.

mounted camera (Elmo UN411E) recording at a resolution of 720x480 pixels (see example frames in Fig. 5). Six objects were used in this experiment: a blue recycling bin, a red beanbag, yellow network cable, a green water bottle placed on a chair, a monitor, and a green cable spool. A ∼14 minute recording session (25000 frames) was used to train a selforganizing map (with 497 units), and a ∼6 minute (10918 frames) recording session was used to validate the training. To test the ability of the model to localize changes in images, objects were moved around and more videos were recorded. Instead of using the KLD between histograms for a global change measure, for each histogram bin we calculated a ratio between the actual bin count and the expected bin count generated by the trained GNG (see Alg. 2). Each pixel in the incoming image was labeled by the amount of deviation from expectation for that pixel’s color histogram bin. Large continuous areas (λ=20 contiguous pixels), of substantial deviation (> (τ = 75)% difference from the expected histogram bin count) were classified as changes and colored green by the model (see far right of Fig. 5). Through visual inspection of the results, it was determined that the algorithm reliably detected the beanbag, network cable, recycling bin, and the water bottle. The chair the water bottle was on and the monitor were not detected, due to the fact that they share color with the background colors around the lab and therefore did not produce a significant deviation. Frame to frame the algorithm generates transient change signals that are not coherent across frames, although there were a small number of false positives (e.g. a lid to a cardboard box). Using a temporal filter,

providing more training time and adding image features in addition to color should resolve these problems. A video demonstrating the results of this experiment can be found at http://www.youtube.com/watch?v=4XJW7bbKT5c&hd=1. V. CONCLUSIONS AND FUTURE WORK A. Conclusions The approach described in this paper provides an online, general-purpose and computationally efficient method for learning visuo-spatial representations and solving change detection tasks in 3-D environments. We captured mobile camera training data for the algorithm in Experiment 1 and added spatial data to our representation in Experiment 2. Despite the training data and testing data having a fair amount of variance in the viewpoints captured by the camera (e.g. compare Fig. 5(a) and 5(b)), it successfully detected changes. Additionally, we demonstrated an implementation for an alert that not only detects these changes but also localizes them. Such a mechanism would provide camera systems or autonomous agents with an alert when their internal models are not synchronized with reality, at which point more computationally expensive algorithms could be engaged to detect new or missing objects. For humanmonitored robots or camera systems this mechanism could be used as an alert to guide the attention of a human operator. B. Future Work This work has demonstrated that a GNG network using color histograms can localize changes, and when combined with spatial data, change detection can be further enhanced

1199

Fig. 5. (a) An example of a training image. (b) Shows a frame from the test run, containing visual changes. (c) Shows the change map for the frame in between the expected value for a specific color and the actual perceived value. The darker the pixel, the larger the deviation. (d) Green pixels representing above threshold changes were overlaid on the changed image in (c). For visual clarity, red circles have been added to indicate the location of changed objects in the frame.

for novel viewpoints. However, a number of interesting issues remain. To begin with, the model can be improved by not treating each color bins independently. For example, in the presence of a large amount of the color green, a previously unseen shade of green should not be as surprising as the appearance of never before seen color. It also became apparent during experimentation that histograms tend to be sparse. Since there are many core colors associated with an environment (e.g. color of the walls), it might be more appropriate to use dynamic resolution for color histograms in order to improve detection performance. Additionally, as shown in Experiment 2, color is not always reliable at detecting changes and we are considering the use of additional feature sets (e.g. steerable multiscale filters). While we had success using the GNG algorithm, a more modern variant of self-organizing maps, such as ESOINN [16], could remove the need for the offline cache training used in our model. Such algorithms provided better metrics for determining which memory units should be changed and where new units should be placed during training. We also plan to extend the model to build a decision policy for when to alert the user that a change was detected. In the current implementation a static threshold is applied, but a more dynamic adaptive policy could be implemented. This policy should be dependent on the amount of time spent in the different portions of the environment as this is related to the variance in the representation captured. Finally, we plan to modify RatSLAM to work in tandem with this algorithm. In Experiment 2, spatial data was provided by a tracking system, like a GPS system would in some autonomous devices, but in principle a SLAM system could provide this information. One simple modification is to allow RatSLAM to converge to a map of an environment before estimating the joint probability. This joint probability could then be used to add a delete operator the RatSLAM’s visual library, since there is currently no way to detect invalid or obsolete visual templates. Furthermore, in the case of positional uncertainty (e.g. malfunctioning optical flow sensor), this joint probability could be queried for possible locations based on the currently observed histogram.

VI. ACKNOWLEDGMENTS This work is supported by NSF grant 0932277 and NIH grants R01EY019174 and R01EY05729. R EFERENCES [1] B. Kuipers, “Drinking from the firehose of experience,” Artif. Intell. Med., vol. 44, no. 2, pp. 155–170, 2008. [2] J. Stober and B. Kuipers, “From Pixels to Policies: A Bootstrapping Agent,” in 7th IEEE International Conference on Development and Learning, 2008, pp. 103–108. [3] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 1999, pp. 246–252. [4] M. Piccardi, “Background subtraction techniques: a review,” in IEEE International Conference on Systems, Man and Cybernetics, vol. 4, October 2004, pp. 3099–3104. [Online]. Available: http://dx.doi.org/10.1109/ICSMC.2004.1400815 [5] M. Liem, A. Visser, and F. Groen, “A hybrid algorithm for tracking and following people using a robotic dog,” in HRI ’08: Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction. New York, NY, USA: ACM, 2008, pp. 185–192. [6] L. Pickup and A. Zisserman, “Automatic retrieval of visual continuity errors in movies,” in CIVR ’09: Proceeding of the ACM International Conference on Image and Video Retrieval. New York, NY, USA: ACM, 2009, pp. 1–8. [7] O. Boiman and M. Irani, “Detecting irregularities in images and in video,” Int. J. Comput. Vision, vol. 74, no. 1, pp. 17–31, 2007. [8] L. Itti and P. Baldi, “Bayesian surprise attracts human attention,” Vision Research, October 2008. [Online]. Available: http://dx.doi.org/10.1016/j.visres.2008.09.007 [9] B. Kuipers, J. Modayil, P. Beeson, M. MacMahon, and F. Savelli, “Local metrical and global topological maps in the hybrid spatial semantic hierarchy,” in in IEEE Int. Conf. on Robotics & Automation (ICRA-04, 2004, pp. 4845–4851. [10] M. Milford and G. Wyeth, “Mapping a suburb with a single camera using a biologically inspired slam system,” Robotics, IEEE Transactions on, vol. 24, no. 5, pp. 1038–1053, Oct. 2008. [11] M. J. Swain and D. H. Ballard, “Color indexing,” International Journal of Computer Vision, vol. 7, pp. 11–32, 1991. [12] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: a large dataset for non-parametric object and scene recognition.” [13] T. Martinetz and K. Schulten, “A ”neural-gas” network learns topologies,” Artificial Neural Networks, vol. I, pp. 397–402, 1991. [14] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, no. 1, pp. 59–69, January 1982. [Online]. Available: http://dx.doi.org/10.1007/BF00337288 [15] B. F. Institut and B. Fritzke, “A growing neural gas network learns topologies,” in Advances in Neural Information Processing Systems 7. MIT Press, 1995, pp. 625–632. [16] S. Furao, T. Ogura, and O. Hasegawa, “An enhanced self-organizing incremental neural network for online unsupervised learning,” Neural Netw., vol. 20, pp. 893–903, October 2007. [Online]. Available: http://portal.acm.org/citation.cfm?id=1294379.1294676

1200