A Biologically-Motivated Developmental System Towards Perceptual ...

1 downloads 0 Views 4MB Size Report
Towards Perceptual Awareness in Vehicle-Based ... awareness besides a classification of safe and non- .... mented and locally, instead of globally, or “mono-.
A Biologically-Motivated Developmental System Towards Perceptual Awareness in Vehicle-Based Robots Zhengping Ji ∗, Matthew D. Luciw ∗, Juyang Weng Dept. of Computer Science and Engineering Michigan State University {jizhengp, luciwmat, weng}@cse.msu.edu Shuqing Zeng, Varsha Sadekar Electrical Controls and Integration Laboratory R&D Center, General Motors Inc., {shuqing.zeng, varsha.sadekar}@gm.com Abstract Existing learning networks and architectures are not suited to handle autonomous driving or driver assistance in complex, human-designed environments such as city driving. Developmental learning techniques for such “vehicle-based” robots will be necessary. Motivated by neuroscience, we propose a system with a design based on the criteria of autonomous, open-ended development. The eventual goal is perceptual awareness – a conceptual and symbolic understanding of the sensed environment, that can be communicated, developed and refined using a teacher defined language. In the system proposed here, radars and a camera are integrated to localize nearby objects for further analysis. The attended areas are each transformed into sparse representation by a layer of developed natural filters analogous to V1. Taking that layer’s response, MILN (Multilayer In-place Learning Network) integrates unsupervised and supervised learning to selforganize efficient representations for recognition of the types of the objects. We trained our system with data from 10 different city and highway road environments and compare with other learning algorithms favorably. Results of the comparison show that this system is the only one tested that can fit all the specified criteria of development for a generalpurpose learning architecture. ∗ Both

authors contributed equally to this paper. This work is supported in part by General Motors Research and Development.

1.

Introduction

Due to the DARPA Grand Challenge and Urban Challenge [DARPA, 2007], many systems for autonomous driving have been created and many more are under development. Yet, the constraints of the contests have not yet required local perceptual awareness besides a classification of safe and nonsafe areas. Skilled driving requires a rich understanding of the complex road environment, which contains many signals and cues that visually convey information, such as traffic lights and road signs, and many different types of objects, including other vehicles, pedestrians, and trash cans, to name a few. We argue that an autonomous driving system that is adept at interpreting and understanding the humandesigned road environments will require human-level perceptual awareness. Previously, it has been argued that such systems will require a developmental approach [Weng et al., 2001], where a suitable developmental architecture, coupled with a nurturing and challenging environment, as experienced through sensors and effectors, allows mental capabilities and skills to emerge. This challenge therefore has implications beyond advancing the state-of-the-art in autonomous driving. An autonomously developing system should be heavily motivated by studies in developmental psychology and neuroscience. So, the challenges in building such a system may also lead to insights about biological mental development. The system presented in this paper presents a biologicallymotivated system for object detection, learning, and recognition, tested in highway and urban road environments. It’s design is motivated by the constraints of large-scale, open-ended development. Natural image processing for general and complex settings is

Teacher

Label Queue Teaching Interface

… Labels

Camera

Projection & Window Guess

Image Window Extraction

Image Queue



Innate Receptive Fields

Radar(s) Layer One (Derived filters)

Layer Three (Recognition) Layer Two (Sparse representation)

Figure 1: Current system’s architecture. The camera and radars work together to provide a set of image regions, possibly containing nearby objects. The teacher can communicate with the system through an interface and label the objects. A three-layer network learning network is used. The first layer encodes each image using localized (small receptive fields), sparse, orientation-selective filters comparable to those in V1. Localized receptive fields will allow spatial attention selection in later versions of this system. The second layer neurons have a classical receptive field over the entire input image and learn prototypical object features in sparse representation space. Layer-3 links layer-2’s global features with output tokens defined by the teacher.

beyond the limit of traditional hand-programmed image processing methods. The high-dimensional, appearance-based, developmental learning method presented here is characteristic of a “non-task specific” approach.

1.1 Problem definition Our eventual goal is to enable a vehicle-based agent to develop the ability of perceptual awareness, for applications including intelligent driver assistance and autonomous driving. Perceptual awareness is a conceptual and symbolic understanding of the sensed environment, where the concepts are defined by a common language between the system and the teachers and users. A language can be as simple as a predefined set of tokens or as complex as human spoken languages. Teachers are required to “arrange the experience” of the system so that it learns the language – e.g., a teacher points out sensory examples of a particular conceptual class and the system learns to associate a symbolic token with the sensed class members, even those that have not been exactly sensed before, but instead share some common characteristics (e.g., a van can be recognized as a vehicle by the presence of a license plate, wheels and taillights). More complicated perceptual awareness beyond recognition involves abilities like counting and prediction. The general setup and learning framework is illustrated in Figure 1. Sensors used are video cameras and short and long-range radars. There is one longrange radar, which scans in the horizontal field of 15o , with detection range up to 150 meters. Four short-range radars cover a 180o scanning area and

are able to detect objects within 30 meters. A single camera provides a 45o field of view. The specific skills to be taught are to localize, identify and communicate the objects that the vehicle potentially will interact with – especially those that might lead to collisions. This is non-trivial to learn since there are zero or more objects in each image from the camera, each of which may vary in terms of position, scale, 2D rotations (affine transformations) and other variations such as 3D rotation or lighting from other objects of the same communicative type. Overall, most pixels are “background” pixels that do not correspond to any nearby object.

1.2

Requirements of a developmental architecture

A high-level formulation of a developmental system is as a function that maps the current1 sensory input (including internal sensations) vector to the next effector (including internal effects) output vector: y(t + 1) = f (x(t)). Assume x(t) ∈ X and y(t) ∈ Y, where both spaces are typically very highdimensional being raw sensory input and raw motor output. The goal of learning is to approximate some underlying function f 0 : X → Y using the agent’s mental resources and architecture, where this function is shaped by e.g., some set of core motivations. A developmental architecture is needed when the learning problem is open-ended. It may contain tasks that are not even known a priori, and it may contain tasks that are not well-defined (there is no guarantee a dataset contains all situations). If any tasks are not 1 Assume discrete time steps where the current time step is denoted by t

well-defined (“muddy”) or unknown, the problem is considered to be open-ended, meaning currently unavailable experience which will occur at some indeterminate set of future times will be needed to learn the tasks. Open-ended problems require developmental learning architectures. Some existing and well-known artificial neural networks are the feed-forward networks trained with backpropogation (FBP), those using radial basis functions (RBF), constructed using the cascade correlation learning architecture (CCLA), the batch and incremental support vector machines (SVM and ISVM), and the self-organizing maps (SOM). There are many other less well-known networks. None that we know of can meet all the requirements (most of them were not designed to) of an architecture for an open-ended developmental system within an autonomous agent. Non task-specific – A common idea in supervised learning, and used by FBP, RBF, CCLA, SVM, and I-SVM is to use the available data to attempt to optimize the system’s performance (minimize error). Optimization is a “greedy” strategy in the following sense: it discards information that is not useful for accomplishing the current task (minimizing error for a particular set of data). It only learns when there is an output. An alternative is to use any sensory information to develop the internal model, even when there is no task, in case it may be useful for learning future tasks. The brain utilizes this strategy by selectively coding the environment, as experienced through the sensors. Earlier levels of sensory representation (e.g., LGN or V1) should not affected as much by the biological agent’s actions (from motor areas) than later areas, such as the object recognition area IT, which projects to and receives projections from the premotor cortex. Along sensorimotor pathways, we postulate there is a spectrum in learning mode from mainly unsupervised (at early levels) to supervised (at later levels). Natural environment based low-level feature derivation is “non task-specific” in the sense that the low-level features used promote an efficient coding of the environment for any visual task. Training time and training complexity – A key to any developmental system is that it exists in a closed feedback loop with the environment. Therefore, the architecture must be able to learn from data online and in real-time. High-dimensional input, a large storage resource, and the requirement of online training is prohibitive for incremental algorithms with more than linear training time-complexity for each storage element. Cell-centered learning, also called in-place learning [Weng and Zhang, 2006], has the lowest possible time complexity for training. Model-free learning – A major requirement of developmental learning techniques is automatic generation of internal representation. This is impor-

tant for complex sensing modalities such as vision, and complex environments, such as street driving, where surroundings are too varied, complex and unpredictable to hand-design a static set of feature detectors. Teachable – Another key component is the need for a teacher – someone that can “arrange experience” so that the learning system is able to develop skills. In online learning, the performance result can be confirmed or corrected by a teacher in a timely fashion. It also allows a teacher to dynamically determine the training samples according to the current system’s performance. This is critical, as it is impossible to fully predict the performance of a learning system ahead of the time during an otherwise batch sample collection stage, which makes any batch collection of training data less practically attractive. This is especially the case when the learning system only occasionally makes errors. This advantage enables a teacher to train the system with additional cases in problem areas so as to improve the system’s performance in these problem areas. Local Analysis for Attention Selection – Receptive fields must exist at all possible scales on the image plane (corresponding to the fovea). For attention selection, certain spatial locations are suppressed or enhanced for local analysis. This ability is essential for true generalization (e.g., recognition by parts). This implies that input should be segmented and locally, instead of globally, or “monolithically”, processed. Networks that process the input in a monolithic fashion cannot perform attention selection. Nearly all existing networks are monolithic. An internal attention selection action (choosing which receptive fields to suppress) will be needed to fully realize this capability. Theoretically, our system here can handle the above criteria. For a further discussion of these and the other criteria (e.g., long-term memory, which is also crucial) see [Weng et al., 2007]. This system is designed with these criteria in mind.

2.

Design of the Vehicle-Based Developmental System 2.1 Coarse Attention The combination of the radar and camera sensors is an efficient way to perform coarse attention selection. In the uncluttered, relatively open, road environments, they detect regions in the image that contain nearby objects. Radars were used to find salient image regions, corresponding to possible (there are false alarms) nearby objects within the larger image. A group of target points in 3D world coordinates are detected from radar sensors. In some cases, several target points would refer to the same object. Kalman filtering is applied to the original radar returns to generate the fused target points. We dis-

(a)

(b)

(c)

(d)

neurons are orientation selective, so this area will be most likely be a small, oriented slit on the retina. And by sparse, it means that only a few neurons will fire for a given stimulus. Sparseness is thought to lead to better storage capacity for associative memories and improves generalization. In addition to being sparse, V1 filters are also highly redundant and overcomplete – e.g., there is a 25:1 output to input ratio in V1 of the cat. As postulated in [Olshausen and Field, 2004], this overcompleteness may serve to make nonlinear problems more linear by mapping the input to a much higher-dimensional space. #1 #2

Figure 2: Examples of images containing radar returned points, which are used to generate attention windows. This figure shows some examples of the different road environments in our dataset.

#c

stagger distance #c + 1 Image Plane

carded radar returns more than 100 meters in distance ahead or more than eight meters to the right or left. The radar-centered coordinates are projected into the image reference system, using a perspective mapping transformation. Given a 3D radar-returned fused target point, an attention window is created within the image, taking the parameters of expected maximum height (three meters) and expected maximum width (3.8 meters) of the vehicles. Figure 2 shows examples of the “innate” attention window generation. Through the first-stage attention provided by the radar, most of the non-object pixels have been identified. For each radar window, the attended pixels are extracted as single images. Each image is normalized in size, in this case to 56 rows and 56 columns. To avoid stretching small images, if the radar window could fit, it was placed in the upper left corner of the size-normalized image , and the other pixels are set to intensities of zero. These images are used as the input to the learning network. We assume that there is only one single object within each radar window. This innate attention leads to the following advantage: for each image from the camera, the desired output of the network is simplified from an indeterminate number of labels of objects in the large image to a single label for each radar window image.

2.2 Early coding Information travels along the early part of the visual pathway (from the retina to V1), and the representation of natural signals that develops is both localized and sparse [Olshausen and Field, 2004] within V1. By localized, it means that most V1 neurons (the so-called simple cells) will respond only when a small area on the fovea is stimulated – additionally,

pixel

#(rc - c + 1)

#rc

Figure 3: Receptive field boundaries and numbering scheme of neural columns on layer-one.

How do V1 neurons develop this localized orientation selectivity and a sparse representation? Atick and coworkers [Atick, 1992] proposed that early sensory processing decorrelates inputs to V1 and showed how a whitening filter can accomplish decorrelation. Srinivasan [Srinivasan et al., 1982] proposed that predictive coding on the retina causes this decorrelation – basically: since the signals on the retina are highly spatially correlated, the retinal ganglion cells with center-surround receptive fields can act as a prediction of a central intensity based on surrounding intensities. Given a network that whitens the input, Weng and Zhang [Weng and Zhang, 2006] showed how the two well-known biological mechanisms of Hebbian learning and lateral inhibition led to the development of localized, orientation selective filters that form a sparse representation of the input, from a set of natural images. That same procedure was used to generate the prototypes for each neural column within this system’s layer-one (see below). Layer-1: neural columns of natural filters – Within V1, neurons are organized in sets of densely populated, vertical columns. Receptive fields for neurons in each column are very similar – they are distributed closely around a central point on the retina [Dow et al., 1981]. Neighboring columns overlap significantly. The first layer of this system is the visual primitive layer, which is organized into a set of neural columns, each of which contains a set

of neurons. Neurons on this layer are only initially partially connected to the pixels of the normalized image window, meaning each has a different initial receptive field. A neuron’s initial receptive field is dependent on the neural column it belongs to. We used innate (hard-coded) square, overlapping, 16 x 16 initial receptive fields. Figure 3 shows the organization of initial receptive fields We used a stagger distance of 8 pixels, therefore for the 56 by 56 images, there were 36 total neural columns. Developing the first layer neurons – We generated the layer-1 derived filters from real-world “natural” images, using the LCA algorithm, as was done in [Weng and Zhang, 2006]. The statistics of natural images are representative of the signals we interpret through vision. An overcomplete set of 512 lobe components were developed for a 16 × 16 pixel area. To do so, 16 × 16 pixels were incrementally selected from random locations in 13 natural images2 . For decorrelation similar to what may be done in the retina, the image patch x is pre-whitened by: x ˆ = Wx, where x ˆ is the whitened sample vector and W = VD is the whitening matrix, V is the matrix of principal components, generated from 50,000 16 × 16 natural image patches. The matrix contains each principal component v1 , v2 , ..., vn as a column vector, and D is a diagonal matrix where the ma1 trix element at row and column i is √ , and λi is λi the eigenvalue of vi . Pre-whitening is necessary for localized filters to develop, due to the statistics of the non-white natural image distribution [Olshausen and Field, 2004]. Figure. 4 shows the result after 10,000,000 whitened input samples. The LCA algorithm uses only biologically-plausible cell-centered mechanisms: Hebbian learning and lateral inhibition. We discarded some neurons with low update (win) totals, and kept 431 neurons. Each layer-1 neural column shares these same 431 neurons, therefore layer-1 has 431 ∗ 36 = 15, 516 neurons total. Layer-1 serves to map the raw pixel representation to a higherdimensional, sparse encoded space for layer-2. It leads to a sparse representation of inputs, meaning that for a given input, very few neurons will be active. The localized orientation-selective filters are functionally similar to those found in V1.

2.3

Object representation and motor output

Layer-2: recognition area – From V1 to IT, the classical receptive field of neurons becomes larger in each area. Neurons in the second layer of this system will have a classical receptive field over the entire 56× 56 image plane, since each neuron is fully connected to all neurons in all neural columns in layer-one. This 2 from http://www.cis.hut.fi/projects/ica/data/ images/via Helsinki University of Technology

Figure 4: The developed filters that are learned from natural images, and that were used in each neural column of the proposed system. In this figure, each patch shows the receptive field of a model neuron within a 16×16 pixel image patch. The figures are placed in order of highest number of updates, from the upper left, row-wise, to the bottom-right.

layer, and layer-3, are meant to be developed in the driving environments. The bottom-up input to this layer is a radar-returned image mapped to the sparsecoded space, so the input dimensionality is 15,516. A limited resource square grid of c (values tried were 100 and 225) of neurons in this layer are utilized to represent the inputs. However, one neuron is not intended to represent a single object. They self-organize using the MILN algorithm presented in Section 3., and each input will be represented by the total population response. Layer-3: motor area – The motor layer is an extendable layer that allows real-time supervised learning. It is made up of a single neural column. These neural weights will only update when there is top-down input from a teacher. Whichever label the teacher provides sets the output, and specifies the single winning neuron, which then updates its weights to the layer-2 neurons currently active (the layer-2 population response). When the teacher provides a new (previously unexperienced) label, a new neuron is added. In this way, classes can be taught without turning off the system. New classes may take advantage of the existing layer-two representations (“soft” invariance) to be learned quickly. When there is no label given by the teacher, the output of this layer is interpreted to give the system’s guess of the output token of the queried radar window.

3.

Multilayer In-place Learning

Layer-one is a function that transforms an image vector into a higher-dimensional sparse coded space.



..

..





Instead of using explicit lateral weights, a winnertake-all approach was used as a computationally efficient approximation of lateral inhibition. We approximate lateral excitation via a 3 × 3 update. The neurons are placed (given position) in a square grid. A winner neuron will cause the neighboring neurons to also update and fire. All non-updating neurons have response set to zero.

The weight αl controls how much layer l is influenced by top-down supervision from the next layer versus bottom-up unsupervised learning. is within (0 1) and is layer specific. It controls how much the area is influenced from the next layer versus the previous layer. Here, α1 = 0, α2 = .3, and α3 = 1. The values for the “top-down” weights are copied over from the corresponding bottom-up weights. To clarify: the same weights were used as the afferent weights to layer l as the top-down weights to layer l − 1, by using a neuron’s “fan-out” weights as its top-down weights. The firing rate transfer function g here is a low threshold: gi (zi ) = 0 if zi ≤ θl , and zi otherwise, where θl is a layer-specific low-threshold value from 0 to 1. We set θ2 = 0.4.



Taking this as input is a Multi-layer In-place Learning Network (MILN) as formulated in [Weng et al., 2007]. We do not intend to formulate MILN in this paper, but some key elements will be discussed. The network is called “in-place” since the self-organization of different areas occurs in a cellcentered (local) way. It is a concept motivated by the genomic equivalence principle, by which every cell of an organism shares the same genome. which can be considered as the developmental program that directs development in a cell-centered way. This concept of in-place learning implies that there can not be a “global”, or multi-cell, goal to the learning, such as the minimization of mean-square error for a precollected (batch) set of inputs and outputs. In inplace learning, each neuron learns on its own, as a self-contained entity using its own internal mechanisms. The mechanisms contained in this program and the cell’s experience (stimulation) over time affect the cell’s fate (here: what feature it detects). As a model of a cortical neuron, each MILNneuron is contained within a particular layer, and has either excitatory or inhibitory feedforward (from a lower layer), horizontal, or feedback (from a higher layer) input. It develops its connections and synapses through activity of other neurons. The external sensors are considered to be on the bottom (layer 0) and the external motors on the top (layer N ). It’s synaptic conductance is modeled by three weight vectors: one for each of the three connection types: bottomup weights wb , lateral (horizontal) weights wh , and top-down weights we (e is for efferent). MILN integrates both bottom-up unsupervised and top-down supervised learning modes via the explicit weights. The top-down supervision is only active when the motor output is imposed. In this way, the supervision impacts the organization of earlier layers. The total activity zi of neuron i in response to y, representing afferent activity, h (lateral activity) and e (activity from the next layer) is zi = g(wb · y − wh · h + we · e) where g is its nonlinear (or a piecewise linear approximation) sigmoidal function. MILN parameters – In this application, for the i-th neuron, we utilize explicit bottom-up and topdown weights, but approximate lateral activity. The response zi is given as µ ¶ wb,i (t) · y(t) we (e, i) · e(t) zi = gi (1 − αl ) + αl kwb,i (t)kky(t)k kwe,i (t)kke(t)k

Figure 5: (Left): The set of radar windows for a sequence, in both highway and city driving environments. (Right): The receptive field of the top-responding level-2 neuron. Note that the size of the right window is less than the size of the left window containing the vehicle. This shows the further receptive field development for that layer-2 neuron.

The winning neuron (max-response) and its neighbors were allowed to fire and update: these are called the winners. For a winner cell j, update the weights using the lobe component updating principle using the neuron’s own internal temporally scheduled plasticity as wb,j (t) = β1 wb,j (t − 1) + β2 zj y(t) where the scheduled plasticity is determined by its two agedependent weights: β1 =

1 + µ(n(j)) n(i) − 1 − µ(n(j)) , β2 = , n(j) n(j)

with β1 + β2 ≡ 1. Finally, the cell age n(j) for the winner increments: n(j) ← n(i)+1. All non-winners keep their ages and weight unchanged. µ(n(i)) is a plasticity function defined in [Weng and Zhang, 2006]. It is called “CCI plasticity” and is formed so that the learning rate for new data β2

Table 1: Average performance & comparison of learning methods over 10-fold cross validation for pixel inputs

Learning method NN ISVM IHDR MILN

Final # storage elements 1198 85 1198 100 (10 × 10)

Overall accuracy 80.32% 71.97% 79.51% 84.76%

“Vehicle” accuracy 72.28% 72.49% 71.54% 87.81%

“Other objects” accuracy 98.57% 70.79% 97.61% 83.42%

Training time per sample n/a 130ms 2.7ms 17ms

Test time per sample 455ms 2.2ms 4.7ms 8.8ms

Table 2: Average performance & comparison of learning methods over 10-fold cross validation for sparse coded inputs

Learning method NN ISVM IHDR MILN

Final # storage elements 1198 86.6 1198 225 (15 × 15)

Overall accuracy 90.49% 77.1% 85.36% 86.4%

“Vehicle” accuracy 89.69% 75.51% 86.12% 89.0%

will never converge to zero as t → ∞. This allows lifetime neuron plasticity.

4.

Experiments and results

We used an equipped vehicle to capture many realworld image and radar sequences for training and testing purpose. A dataset3 is composed from 10 different “environments” – stretches of roads at different looking places and times (see Figure 2 for a few examples of different environments). From each environment, several different interesting sequences were extracted. Each sequence contains some similar but not identical images (different view point variation, illumination and scales), which were captured with a time interval of 0.2 second. The challenge for the learning algorithms is to classify each radar window’s contents from one of two-classes: vehicles and other objects. There were 928 samples in the vehicle class and 409 samples in the other object class. Four different algorithms were compared for their evaluation of potential for open-ended autonomous development, where a efficient (memory controlled), real-time (incremental and fast), autonomous (cannot turn the system off to change or adjust), and extendable (the number of classes can increase) architecture is needed. We tested the following classification methods: nearest neighbor using a L1 distance metric for baseline performance, incremental-SVM [Cauwenberghs and Poggio, 2001]4 , Incremental Hierarchical Discriminant Regression (IHDR) [Weng and Hwang, 2007] and MILN [Weng et al., 2007] 5 . 3 http://www.cse.msu.edu/ei/datasets.htm

4 Software obtained from http://bach.ece.jhu.edu/pub/ gert/svm/incremental 5 Both the MATLAB interface to IHDR and the monolithic MILN are available at http://www.cse.msu.edu/ei/ software.htm

“Other objects” accuracy 92.31% 80.71% 83.69% 80.45%

Training time per sample n/a 330ms 12ms 110ms

Test time per sample 2273ms 7.6ms 22ms 43ms

We used a linear kernel for I-SVM, as is suggested for high-dimensional problems [Hsu et al., 2003]. We did try several settings for a RBF kernel but did not observe as good performance as the linear kernel. We used a “true disjoint” test, where the samples were left in sequence and broken into ten sequential folds. In this case (as opposed to randomly arranging samples), the problem is difficult, since there would be sequences of types of vehicles or objects in the testing fold that were never trained. This truly tests generalization. For all tests, each large image from the camera was 240 rows and 320 columns. Each radar window was size-normalized to 56 by 56 and intensity-normalized to (0, 1). As inputs to all networks were inputs from the non-transformed space (“pixel” space with input dimension of 56×56 = 3136 pixels) versus after transformation into the sparsecoded space (with dimension of 36 × 431 = 15, 516) by layer-one. Our results are summarized in Tables 1 and 2. Nearest neighbor performs fairly well, but is prohibitively slow. IHDR combines the advantage of NN with an automatically developed overlaying tree structure that organizes and clusters the data. It is useful for extremely fast retrievals. So, IHDR is a little worse than NN, but is much faster and can be used in real-time. However, IHDR typically takes a lot of memory. It allows sample merging, but in this case saved every training sample, so it did not use memory efficiently. I-SVM performed the worst with both types of input, but it uses the least memory (in terms of number of support vectors), and the number of support vectors is automatically determined by the data. A major problem with I-SVM is lack of extendability – by only saving samples to make the best two-class decision boundary, it throws out information that may be useful in distinguishing

other classes that could be added later. Of course, SVM is not formulated so that any more than two classes can be added autonomously, while IHDR and MILN, as general purpose regressors, are able to do so. MILN is able to perform better than all other methods for the “pixel” inputs using only a 10 × 10 grid with a top-down supervision parameter of 0.3, over three epochs. MILN is also fairly fast. When comparing the results of the pixel inputs to the sparse-coded inputs, it is apparent that performance improves in the sparse coded space, which follows from what was postulated in [Olshausen and Field, 2004]. We scaled MILN up in size (15 × 15 neurons) since the dimension of the data increased substantially. I-SVM’s size does not increase since it is only concerned with the boundary, where MILN tries to represent the entirety of the sensorimotor manifold well to aid later extension. Overall, MILN does not fail in any criteria, although it is rarely the “best” in any one category, as currently implemented. NN is too slow and I-SVM is not extendable for open-ended development. IHDR has problems using too much memory and does not represent information efficiently and selectively (supervised self-organization). The overall performance on this data showcases the need for attention selection, or local analysis. Our system in Fig. 1 extends MILN to have a localized layer-one. Attention selection capability will allow the system to focus analysis on subparts within the radar windows to improve generalization (e.g., recognize as a vehicle as a combination of license plate, rear window and two taillights). None of the other methods allow local analysis needed for this key upcoming extension. An incremental teaching interface was developed to experiment with the system shown in Fig. 1. The teacher could move through the collected images in the order of their sequence, provide a label to each radar window, train the agent with the current labels, or test the agent’s current knowledge. Some examples of results are shown in Fig. 5. Even in this MATLAB, non-parallelized version, the speed is close to real-time use. The average time for the entire system (not just the algorithm) to train samples was 5.95 samples/s and the average time for testing was 6.32 samples/s.

5.

Conclusions

We now return to the stated goal of perceptual awareness. From raw data, this system incrementally learns to answer one “question” of the type of object within a radar window with one of two answers: “A vehicle” or “Something else”. However, this is not the limit of this system. The design of the architecture presented here allows growth to learn more answers (conceptual categories). It can also continuously learn new examples without the learn-

ing rate ever converging to zero. It remains an open question as to how to learn different types of “questions”. There seems a long way to go yet towards true perceptual awareness. However, fundamentally new types of learning architectures will be necessary and this paper’s MILN is an initial investigation into this domain.

References Atick, J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3:213–251. Cauwenberghs, G. and Poggio, T. (2001). Incremental and decremental support vector machine learning. In Advances in Neural Information Processing Systems, volume 13, pages 409–415, Cambridge, MA. DARPA (2007). DARPA urban challenge 2007: Rules. Technical report, DARPA. Dow, B., Snyder, A., Vautin, R., and Bauer, R. (1981). Magnification factor and receptive field size in foveal striate cortex of the monkey. Exp. Brain Res., 44:213. Hsu, C., Chang, C., and Lin, C. (2003). A practical guide to support vector classification. Olshausen, B. and Field, D. (2004). Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14:481–487. Srinivasan, M., Laughlin, S., and Dubs, A. (1982). Predictive coding: a fresh view of inhibition in the retina. Proc. R. Soc. Lond. B Biol. Sci., 216:427–459. Weng, J. and Hwang, W. (2007). Incremental hierarchical discriminant regression. IEEE Trans. on Neural Networks, 18(2):397–415. Weng, J., Lu, H., Luwang, T., and Xue, X. (2007). A multilayer in-place learning network for development of general invariances. International Journal of Humanoid Robotics, 4(2). Weng, J., McClelland, J., Pentland, A., Sporns, O., Stockman, I., Sur, M., and Thelen, E. (2001). Autonomous mental development by robots and animals. Science, 291(5504):599–600. Weng, J. and Zhang, N. (2006). Optimal in-place learning and the lobe component analysis. In Proc. World Congress on Computational Intelligence, Vancouver, Canada.