Statistical Mechanisms and Constraints in Perceptual ...

1 downloads 0 Views 1008KB Size Report
1991; O'Toole & Kersten, 1992; Schoups, Vogels, & Orban, 1995) consists of testing baseline ...... Ideal observer analysis. In: L. Chalupa and J. Werner. (Eds.) ...
Statistical Mechanisms and Constraints in Perceptual Learning: What can we learn? by Melchi M. Michel

Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Supervised by Professor Robert A. Jacobs Department of Brain and Cognitive Sciences The College of Arts and Sciences University of Rochester Rochester, New York 2007

ii

Curriculum Vitae

The author was born in East Meadow, New York on December 5th, 1976. He attended Oswego State University from 2000 to 2002, and graduated in 2002 with a Bachelor of Arts degree in psychology. He came to the university of Rochester in the Fall of 2002 and began graduate studies in Brain and Cognitive Sciences. He pursued his research in computational vision and perceptual learning under the direction of Professor Robert A. Jacobs and received the Master of Arts degree from the University of Rochester in 2005.

iii

Acknowledgments

Thanks first of all to my advisor Robert Jacobs, for his contributions to the work described in Chapters 2 and 3, for his guidance throughout the last five years, and for his generosity and patience in meeting with me virtually any time I had a question or needed feedback on an idea. Thanks also to the other members of my committee: William Merrigan and David Williams as well as to David Knill for helpful suggestions regarding the studies described in this dissertation. I am grateful to research assistants Anthony Shook and Katie Young for their help in scheduling subjects for the experiments in Chapters 2 and 3, and I am especially grateful to Kathy Corser and Jennifer Gillis and who, by handling countless administrative tasks, sheltered me from a torrent of bureaucratic requirements and allowed me to instead focus on completing this work. Finally, I am most thankful to my wife Nicole for her patience, encouragement, and tireless support. This work was supported by NIH research grant R01-EY13149 and by AFSOR grant FA9550-06-1-0492.

iv

ABSTRACT

Visual scientists have shown that people are capable of perceptual learning in a large variety of circumstances. Nonetheless, the mechanisms mediating such learning are poorly understood. How flexible are these mechanisms? How are they constrained? We investigated these questions in two studies of perceptual learning. In both studies, we modeled subjects as observers performing probabilistic perceptual inferences to determine how their use of the available sensory information changed as a result of training. The first study consisted of five experiments examining the mechanisms of perceptual cue acquisition. Subjects were placed in novel environments containing systematic statistical relationships among scene and perceptual variables. These relationships could be either consistent or inconsistent with the types of sensory relationships that occur in natural environments. We found that subjects’ learning was biased to favor statistical relationships consistent with those found in natural environments and proposed a new constraint on early perceptual learning to account for these results, defined in terms of Bayesian networks. The second study examined the mechanisms of learning in image-based perceptual discrimination

v

tasks. Previous studies have demonstrated that people can integrate information from multiple perceptual cues in a statistically optimal manner when judging properties of surfaces in a scene. We wanted to determine whether subjects can learn to integrate optimally across arbitrary low-level visual features when making image-based discriminations. To investigate this question, we developed a novel and efficient modification of the classification image technique and conducted two experiments that explored subjects’ discrimination strategies using this improved technique. We found that, with practice, subjects modified their decision strategies in a manner consistent with optimal feature combination, giving greater weight to reliable features and less weight to unreliable features. Thus, just as researchers have previously demonstrated that people are sensitive to the reliabilities of conventionally-defined cues when judging the depth or slant of a surface, we demonstrate that they are likewise sensitive to the reliabilities of arbitrary low-level features when making learning to make image-based discriminations.

vi

Table of Contents

Chapter

1 Background 1.1

1.2

1

Plasticity in the Early Visual System . . . . . . . . . . . . . .

2

1.1.1

Traditional Views . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Psychophysics, Low-Level Stimulus Specificity, and Electrophysiology . . . . . . . . . . . . . . . . . . . . . . .

4

Probabilistic Perceptual Inference and Learning . . . . . . . .

7

2 A Bayesian Network Model of Constraints on Perceptual Learning

15

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

Experiment 1: Auditory cue to motion direction . . . . . . . .

21

2.2.1

22

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

2.2.2 2.3

2.4

2.5

2.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

Experiment 2: Disparity cue to motion direction . . . . . . . .

29

2.3.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.3.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Experiment 3: Brightness cue to motion direction . . . . . . .

34

2.4.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.4.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . .

39

Experiment 4: Disparity cue to light source direction . . . . .

40

2.5.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.5.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Experiment 5: Auditory cue to light source direction . . . . .

45

2.6.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.6.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

2.6.3

Discussion and Conclusions . . . . . . . . . . . . . . .

50

3 Learning Optimal Integration of Arbitrary Features in a Perceptual Discrimination Task

77

viii

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.2

Estimating classification images: a modified approach . . . . .

83

3.3

Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

3.3.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

94

3.3.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

100

Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

3.4.1

Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

106

3.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

114

3.4

3.5

4 Conclusions

119

References

125

ix

List of Tables

Table

3.1

3.2

Correlation between sensitivity and trial number for individual subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

100

Significance statistics for results displayed in Figure 3.10 . . .

110

x

List of Figures

Figure

2.1

The data for two subjects from the vision-audition test trials. The horizontal axis of each graph gives the direction of the comparison, whereas the vertical axis gives the probability that the subject judged the direction of the comparison as clockwise relative to that of the standard. The data points indicated by circles or crosses are for the trials in which the auditory signal in the standard was offset from vertical by the amount δ or −δ, respectively. The dotted and solid lines are cumulative Normal distributions fit to these data points using a maximum likelihood procedure. . . . . . . . . . . . . . . . . . . . . . . .

2.2

The data from the vision-audition test trials for all eight subjects combined. . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

30

31

Data from the motion-disparity test trials for all eight subjects combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

xi

2.4

Data from the motion-brightness test trials for all eight subjects combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5

A sample stimulus from Experiment 4. In this example, the bumps are illuminated from the left. . . . . . . . . . . . . . .

2.6

46

Data from the shading-auditory test trials for all eight subjects combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.8

43

Data from the shading-disparity test trials for all eight subjects combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7

38

49

A simple Bayesian network representing a distant barking dog who may or may not have rabies. The variables corresponding to scene properties are located toward the top of this figure, whereas the variables corresponding to percepts are located toward the bottom. Scene variables do not have parents, though they serve as parents to sensory variables as indicated by the arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.9

54

A Bayesian network depicting the dependence assumptions underlying perceptual judgements in tasks requiring observers to make a fine discrimination along a simple perceptual dimension. 61

xii

2.10 A Bayesian network representing the type of modification that might underlie the acquisition of new cue combination rules. Cue 1 represents a cue whose reliability is fixed, while Cue 2 represents a cue that has become less reliable. The solid black curves represent the final conditional cue distributions given a value (or estimated value) of the scene property. The dashed grey curve represents the conditional distribution for Cue 2 before learning. . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

2.11 A Bayesian network representing the type of modification that might underlie perceptual recalibration. Cue 1 represents an accurate and low-variance cue, whereas Cue 2 represents a cue whose estimates, while low-variance, are no longer accurate. The solid curves represent the final conditional cue distributions for a particular value of the scene variable. The dashed grey curve represents the conditional distribution for Cue 2 before learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

2.12 A Bayesian network characterizing subjects’ modifications of their prior distribution of the light source location in the experiment reported by Adams et al. (2004). . . . . . . . . . . . . .

70

xiii

2.13 A Bayesian network representing the statistical relationships studied in Experiments 1-3. The solid black edges represent dependencies that exist in the natural world, whereas dashed grey edges represent dependencies that do not exist in the natural world but that we introduced in our novel experimental environments. We expect that observers started our experiments with the belief that variables connected by a black edge are potentially dependent, whereas variables connected by a grey edge are not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

2.14 A Bayesian network representing the statistical relationships studied in Experiments 4 and 5. The solid black lines represent pre-existing edges—conditional dependencies that exist in the natural world—while the dashed grey lines represent conditional dependencies that do not exist in the natural world but that we introduced in our novel experimental environments. . . . . . .

3.1

73

An illustrative stimulus set consisting of “fuzzy” square and circle prototypes. From left to right: the square (k + BµA ); the circle (k + BµB ); the constant image (k), which represents the parts of the image that are invariant across stimuli, and the square-circle difference image (B[µA − µB ]). . . . . . . . . . .

89

xiv

3.2

Illustrations of the methods described in Equation 3.2 (top) and Equation 3.3 (bottom) for generating noise-corrupted versions of the “fuzzy square” prototype (stimulus A) introduced in Figure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

90

The 20 basis features used to construct the stimuli in Experiments 1 and 2. Each of these images constitutes a column of the matrix B in Equation 3.3. Mixing coefficients µAi for the vector µA representing Prototype A (see Figure 3.4) are indicated above each of the bases (µBi = −µAi ). White Gaussian noise (in the subspace spanned by B) is generated by independently sampling the noise coefficients ηi from a common Gaussian distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

95

The prototypes used in Experiments 1 and 2 presented in the same format as the example stimuli in Figure 3.1. From left to right: prototype A (k + BµA ), prototype B (k + BµB ), the constant image (k), and the difference image (B[µA − µB ] = 2BµA ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

xv

3.5

Classification images for each of the three subjects who showed learning in Experiment 1. The first column w obs1 displays the subjects’ classification images calculated over the first three sessions; the second column wobs2 displays the classification images calculated over their final three sessions; and the third column wideal displays the optimal template. . . . . . . . . . . . . . .

3.6

102

Individual results for all 4 subjects who participated in Experiment 1. The horizontal axis of each plot indicates the trial number, while the vertical axis represents both the subject’s discrimination efficiency (solid curve) and template efficiency (dashed curve). The correlation coefficient for the fit between these two measures and the p-value representing the significance of this correlation is indicated at the top of each subject’s plot. 103

xvi

3.7

A schematic illustration of the effect of variance structure on the optimal template (red arrows) for a two dimensional stimulus space. Dashed lines represent contours of equal likelihood ( p(x1, x2|Ci) = k ) for category A (red) and category B (green). The solid red lines and arrows represent the optimal decision surface and its normal vector (i.e., the template for category A), respectively. (Left) two prototypes embedded in isotropic noise (Σ = I 2 ). (Center) the variance along dimension x2 is greater than that in x1. (Right) the variance along x1 is greater than that in x2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.8

108

Normalized cross-correlation for each of the four subjects in Experiment 2. The plots depict the fits between each subject’s classification image (wobs ) and the optimal templates for the covariance structure of the noise used in the first (solid lines) and second (dashed lines) halves of the experiment. The change in covariance structure occurred at trial 3601. . . . . . . . . .

111

xvii

3.9

Classification images for each of the four subjects in Experiment 2. The first column displays the optimal template wideal1 calculated for the feature covariance Σ1 used in the first half of the experiment; the second column wobs1 displays the subjects’ classification images calculated over the first 12 sessions; the third column wobs2 displays the classification images calculated over their final 12 sessions; and the final column displays the optimal template wideal2 calculated for the feature covariance Σ2 used in the second half of the experiment . . . . . . . . . .

112

3.10 The differences between the template fits (wfit2 − wfit1 ) plotted in Figure 3.8 averaged over the first (open bars) and second (closed bars) half of trials in Experiment 2. . . . . . . . . . . .

113

1

Chapter 1

Background

Learning in humans is critical for survival and occurs at a variety of levels of information processing. Some of this learning is explicit. When learning how to solve a word problem, for example, students obtain a set of heuristics and rules that they can recall and use when they encounter similar problems in the future. This type of learning is often contrasted with more implicit forms of learning that do not involve memorizing facts or explicit rules and do not require conscious awareness. One type of implicit learning involves learning motor tasks, such as learning to ride a bicycle. This type of learning differs from explicit forms of learning in that it requires repeated practice, and the learner typically has difficulty describing what has been learned. In this paper we will focus on another type of implicit learning: perceptual learning. Perceptual learning refers to a systematic change in the perception of sensory stimuli that occurs as the result of training or experience with perceptual tasks. It is distinct from more general cognitive forms of learning in that, like motor

2

learning, it is implicit and the learning often requires a great deal of training. In addition, perceptual learning does not generalize to analogous problems but is typically quite specific to the physical properties of the trained stimulus. Finally, there is a growing amount of psychophysical and neurophysiological evidence suggesting that perceptual learning is subserved at least in part by plasticity in early sensory areas of the brain.

1.1

Plasticity in the Early Visual System

1.1.1

Traditional Views

Until recently, the prevailing view among vision researchers was that the functional properties of neurons in the early stages of visual processing, while sensitive to experience early in development, are fixed in adulthood. Compelling evidence for this view was provided by Hubel and Wiesel’s discovery of critical periods for the formation of ocular dominance columns in the primary cortex of cats and monkeys. Earlier deprivation studies in the 1930’s through the 1950’s had hinted at the existence of perceptual critical periods, since they found that animals raised from birth in visually impoverished environments (e.g., in complete darkness) showed permanent visual deficits that normal adult animals subjected to the same impoverished environments failed to develop, even when subjected to these environments for longer periods of time. A parallel to these early studies came from case studies of vision in humans following the removal of congenital and adult-acquired cataracts.

3

Children with congenital bilateral cateracts never seemed to acquire normal vision, while adults who had developed cataracts late in life seemed to have their vision completely restored once the cataracts were removed and replaced with artificial lenses. Hubel and Wiesel linked this pattern of results with development of the visual cortex (Hubel, 1988). They had previously discovered that cells in layer 4 of primary visual cortex (V1) were organized into alternating columns within which cells responded primarily to stimulation of either the left or the right retina. In a series of experiments, they raised cats and monkeys with one eyelid sutured closed and found that the animals failed to develop ocular dominance columns and that most cells in V1 responded only to the unclosed eye (e.g., Hubel & Wiesel, 1970; Hubel, Wiesel, & Levay, 1977). They further found that there was a six week critical period for this effect after which monkeys deprived from birth could no longer recover a normal cortex, whereas older monkeys, despite being deprived for as long as a year, maintained normal cortical structure. This evidence for early critical periods, along with Hubel and Wiesel’s discovery of what appeared to be a fairly tidy functional architecture in V1 (for a review, see Hubel & Wiesel, 1977) with well-defined receptive fields that seemed to remain unaffected by experimental manipulations led many researchers to view adult V1 as a fixed array of localized filters or feature detectors. This view pervades, for example, David Marr’s (1982) influential computational treatment of the early visual system, as evidenced in his assertion that ”the human visual system incorporates a

4

’hard-wired’ table of similarities by which the similarities and dissimilarities in the various parameters [of image primitives or elements] may be compared.” Of course researchers realized that adults must necessarily retain some cortical plasticity, since they continue to learn throughout adulthood, learning, for example, new tasks, new objects, and new faces. The learning of such complex percepts, however is thought to involve higher cortical areas so that, traditionally, these areas were assumed to be the locus of learning in adulthood. In the past two decades, however, researchers have amassed a rapidly growing body of evidence that neurons and neuronal circuits in the primary sensory cortices can be altered significantly by sensory experience.

1.1.2

Psychophysics, Low-Level Stimulus Specificity, and Electrophysiology

Psychophysicists have long realized that practicing certain visual tasks results in improvements in perceptual discriminations. The last two decades, however, have produced mounting evidence that this learning might involve early visual centers. Researchers have demonstrated that adults are capable of manyfold improvements in fine discriminations that likely involve early visual areas such as V1, V2, and MT, and many of these of these improvements show remarkable specificity—for position, orientation, and in some cases, even for the trained eye. One well-known feature of cortical architecture in the visual system is that the mean receptive field size of cells increases with the level of a particular part of cortex within the processing hierarchy. In other

5

words, at a given retinal eccentricity, the average size of receptive fields in area V1 are smaller than those in area V2, which are in turn smaller than those in V4 or IT. This increase in size is dramatic, with the receptive field centers of parafoveal cells subtending less than 1◦ of visual angle in area V1, and 20◦ or more in the inferotemporal cortex (IT) of rhesus macaques (Zeki, 1978). This pattern of increasing receptive field size with increasing processing stages provides a heuristic method for determining the level at which learning occurs in different perceptual learning task. That is, a researcher can, by training the task at one retinal location and testing it at varying locations, determine the spatial extent of the learning. Thus researchers can used tests of positional specificity to suggest where along the visual hierarchy the site of plasticity might be. Furthermore, since the responses of cells in higher visual areas become somewhat less sensitive to changes in low-level features such as the orientation and scale of stimulus elements (Maunsell & Newsome, 1987), specificity along these dimensions can also suggest, albeit to a lesser degree, the site of learning. A common procedure in a subset of perceptual learning studies (e.g., Barardi & Fiorentini, 1987; Dill & Fahle, 1997; Fahle, 1993; Karni & Sagi, 1991; O’Toole & Kersten, 1992; Schoups, Vogels, & Orban, 1995) consists of testing baseline discrimination thresholds at several different positions or orientations, training at a particular orientation and/or position until performance asymptotes, and finally testing at new location or orientations. These results typically show learning whose specificity can be broadly characterized as ex-

6

hibiting levels of transfer ranging from no transfer to complete transfer. Using the heuristic outlined above, the levels of transfer between different low-level stimulus configurations in these studies provide varying amounts of evidence for contributions from low-level mechanisms. Learning in tasks that demonstrate little to no transfer are thus likely to involve at least some changes to low-level mechanisms, whereas tasks that demonstrate nearly complete transfer are unlikely to involve changes to low-level mechanisms. Many studies have investigated low-level visual learning, and most of these have found some degree of specificity for the particular configuration, orientation and position of the stimulus elements (for reviews, see Dill, 2002; and Gilbert, Sigman, and Crist, 2001). Of course, while the low-level specificity of learning in a task may suggest the involvement of changes in early mechanisms, it does not provide definitive evidence of such changes. For example, though average receptive field sizes tend to increase with processing level, there are some cells in higher visual areas (e.g., V4) that have small receptive fields of only a few degrees. Fortunately, a number of studies have recently provided neurophysiological evidence that some types of perceptual learning involve changes in early visual areas. In particular, researchers have demonstrated changes in the electrophysiological response properties of V1 and V2 neurons of monkeys after learning in orientation discrimination (Ghose, Yang, & Maunsell, 2002; Schoups, Vogels, Qian, & Orban, 2001) line bisection (Crist, Li, & Gilbert, 2001), and shape discrimination (Li, Piech, & Gilbert, 2004) tasks, and—of particular

7

relevance for the studies described in this paper—changes in the hemodynamic responses of areas V1 and MT in after learning in visual texture discrimination (Schwartz, Maquet, & Frith, 2002) and motion discrimination (Vania, Belliveau, des Roziers, & Zeffiro, 1998) tasks, respectively. Together with the psychophysical results demonstrating low-level specificity, these demonstrations of plasticity in the response properties of early visual areas render traditional feedforward views of the visual system untenable and provide strong evidence that changes in low-level visual mechanisms play a part in mediating perceptual learning. Note however that, while these studies all demonstrate changes in the response properties of neurons in early visual areas as a result of practice, the changes observed seem to vary depending on the perceptual task learned and the functional significance of these changes remains unclear. These studies simply demonstrate that there are low-level perceptual changes involved in perceptual learning. Determining what is learned in perceptual learning requires a different level of analysis.

1.2

Probabilistic Perceptual Inference and Learning Modern approaches to the study of perception acknowledge that sensory

information is uncertain. This uncertainty derives from a variety of sources. In vision, for example, the signals transduced by our receptors represent the result of non-invertible transformations (e.g., 3D to 2D retinal scene projection, full electromagnetic spectrum to three bandwidth-limited cone responses) that

8

admit a large, potentially infinite, class of possible solutions. To this inherent ambiguity are added noise due to stimulus variability, to receptor sampling loss, to neural response variability, and to loss of information during neural transmission. This uncertainty means that no deterministic mapping is possible between a particular sensory array (i.e., the collection of signals impinging upon all of our sensory receptors over a given period of time) and a corresponding event in the world; an observer cannot logically deduce events from sensory signals. Instead, solving the problem of perception—determining what is ”out there” based on the sensory array—requires probabilistic inference. Probabilistic perceptual inferences require that the observer estimate P (θ|x), the probability distribution over scene parameters θ given the data x in the sensory array. In a Bayesian framework, this distribution can be calculated from models of the of sensory image formation process and of environmental structure as P (θ|x) =

P (x|θ)P (θ) P (x)

(1.1)

Probabilistic accounts of perception and perceptual learning often make use of Bayesian ideal observers. These are theoretical observers that perform a given perceptual task optimally given the information available in the sensory array, along with some specified constraints(Geisler, 2003). Ideal observers are useful in that their performance reflects limitations in perceptual judgments due to the uncertainty inherent in a given task. For basic perceptual tasks (i.e., tasks that can be modeled as involving a single decision stage and that do not

9

involve significant costs in motor or general cognitive resources), researchers generally assume that the task can be modeled simply as a problem of selecting the set of scene parameters θ ∗ that maximize the posterior density defined in Equation 1.1. Researchers define θ and x in varying ways across different tasks. For example, in characterizing perceptual judgements involving the discrimination of three-dimensional scene attributes (i.e., in spatial cue-integration tasks), researchers typically let θ represent some continuously-valued scene parameter such as surface depth (e.g., Schrater & Kersten, 2000), surface slant(e.g., Ernst, Banks,& B¨ ulthoff, 2000; Knill, 2003; Knill & Saunders, 2003), object location (Battaglia, Jacobs, & Aslin, 2003) or object thickness (e.g., Ernst & Banks, 2002, Hillis, Ernst, Banks, & Landy, 2002); while the sensory data x are abstracted into perceptual modules, or cues, and are represented through a set of marginal posterior distributions P (θ|ci ), where the ci each represent the sensory data received by one of these perceptual modules. This approach is flexible and has been used not only to account for human performance in perceptual discriminations involving 3D scene attributes, but also to characterize the functional mechanisms underlying perceptual learning in such tasks. In particular, Researchers have used this approach to characterize the changes involved in perceptual cue recalibration (e.g., Atkins, Jacobs, & Knill, 2003; Epstein, Banks, & van Ee, 2001), in perceptual cue reweighting (e.g., Atkins, Fiser, & Jacobs, 2001; Ernst et al., 2000) and in the learning of visual pri-

10

ors (Adams, Graf, and Ernst, 2004). In the study described in Chapter 2, we use a similar approach to describe the changes underlying perceptual cue acquisition. On the other hand, researchers investigating performance in image-based perceptual discriminations such as those involved in texture discrimination, orientation discrimination, motion direction discrimination, and vernier discrimination tend to represent θ as a set of discrete and mutually exclusive categories (e.g., pattern A or pattern B, left or right, target present or target absent), reflecting the forced-choice designs used in studying such tasks. The sensory data x are typically represented directly using the raw signal data (e.g., pixel luminance values)—although in constructing constrained ideal observers researchers sometimes filter the raw signals to emulate the attenuation of physical signals by human sensory systems (Geisler,2003). Investigations of perceptual judgements in such tasks—in the context of perceptual learning—has been dominated by the use of black-box noise models of human observers (e.g., Lu & Dosher, 1999; Pelli, 1981). Such models seek to characterize the noise internal to a human observer in terms of equivalent input noise, the amount of external noise that must be added to signals to degrade the performance of an ideal observer to match human discrimination performance. Note that this approach is quite different from that used in the cue-integration framework described above. The cue integration studies mentioned above also estimate the noise in the observer’s sensory system, but

11

the primary focus of those studies is to determine how observers strategically combine information from multiple noisy sources and in particular, how they use information about the differential uncertainty of various cues in integrating across them. In contrast, characterizing the noisiness of perceptual estimates (in terms of equivalent external input noise) is the primary contribution of black-box noise models. Several authors have proposed specific (gray-box) observer models (e.g., Lu and Dosher, 1999; Gold, Sekuler, & Bennett, 2004), and attempted to use these models to characterize the source of the noise within the observer, but the class of plausible models admits many solutions, some of which can provide conflicting accounts of experimental results (e.g., see general discussion section of Gold et. al, 2004). While the goals of both approaches—characterizing perceptual information integration strategies and measuring the amount of inherent uncertainty in sensory judgments—are both important, the cue integration approach is more informative, because it can be used to isolate to some extent the source of inefficiencies in an observer. If, for example, a human observer with two cues to a scene parameter performs worse in a discrimination task than a constrained ideal observer with equivalent uncertainty in each of the individual cues, then we can conclude that, in addition to any uncertainty due to sensory noise, some part of the human observer’s inefficiency results from a suboptimal cue integration strategy. This approach is especially important when studying perceptual learning. Consider, for example, that in the scenario described above we could use the cue-integration approach to determine whether

12

any improvements in discrimination performance were due to improvements in the measurements provided by individual cues or to improvements in the mechanisms responsible for integrating across the cues. In Chapter 3, we introduce a method that adapts this flexible cue-integration framework to examine learning in image-based perceptual discriminations. In summary, because perceptual judgments are inherently ambiguous and non-deterministic, we model them as probabilistic perceptual inferences. Within this framework, perceptual learning reflects improved probabilistic inference with respect to the learned task. Moreover, by systematically manipulating the information available in the stimulus and comparing the performance of the human observer to that of an ideal observer across these manipulations, we can determine what mechanisms change as a result of practice, and thus discover the functional mechanisms that mediate perceptual learning. In the current paper, we use this framework—casting perceptual learning as improved probabilistic inference—to characterize the changes underlying perceptual learning in perceptual cue acquisition and in visual texture discrimination tasks. Our focus is in characterizing the flexibility of perceptual learning, and the constraints that act on it. In the following chapters we describe two studies of perceptual learning that model subjects as observers performing probabilistic perceptual inferences to determine how their use of the available sensory information changes as a result of training. In Chapter 2 we describe five experiments examining the mechanisms of perceptual cue acquisition. Subjects were placed

13

in novel environments containing systematic statistical relationships among scene and perceptual variables. These relationships could be either consistent or inconsistent with the types of sensory relationships that occur in natural environments. We found that subjects’ learning was biased and propose a new constraint on early perceptual learning to account for these results. Namely, people are capable of modifying their knowledge of the prior probabilities of scene variables that are already considered to be potentially dependent, but they cannot learn new relationships among variables that are not considered to be potentially dependent, even when placed in novel environments in which these variables are strongly related. We formalize this constraint using the notation of Bayesian networks and discuss the computational considerations that make this a desireable constraint for perceptual systems. Chapter 3 examines the mechanisms of learning in image-based perceptual tasks. We introduce a novel and efficient modification of the classification image technique that defines the stimuli in terms of a low-dimensional set of relevant basis features and describe the results of two experiments that explored subjects’ discrimination strategies using this improved technique. In these experiments we found that over the course of learning subjects modified their decision strategies in a manner consistent with optimal feature combination, giving greater weight to reliable features and less weight to unreliable features. Moreover, when we changed the variance structure of the noise added to the signals, subjects modified their templates to reflect the corresponding changes in the ideal template. We conclude that human observers extract information about the reliabilities of arbitrary low-level features and exploit

14

this information when learning to make perceptual discriminations. Finally, in Chapter 4 we discuss the implications of these findings and make suggestions for future investigation.

15

Chapter 2 A Bayesian Network Model of Constraints on Perceptual Learning

2.1

Introduction Acquiring new information about the world requires contributions from

both nature and nurture. These factors determine both what biological organisms can learn and also what they cannot learn. Numerous studies have shown that organisms’ learning processes are often biased or constrained. Perhaps the most famous demonstration that learning processes are biased comes from the work of Garcia and Koelling (1966) who showed that rats are predisposed to learn certain types of stimulus associations and not others. They interpreted their results in terms of a learning bias referred to as “belongingness”— organisms more easily learn associations among types of stimuli that are correlated or participate in cause-and-effect relationships in natural environments. A more recent demonstration that learning processes are biased comes from

16

the work of Saffran (2002). She found that people more easily learned an artificial language when the statistical relationships among sound components were consistent with the dependencies that characterize the phrase structures of natural languages. This chapter proposes a new constraint or bias on early, or low-level, perceptual learning.1

We hypothesize that people’s early perceptual processes

can modify their knowledge of the prior probabilities of scene properties, or of the statistical relationships among scene and sensory variables that are already considered to be potentially dependent. However, they cannot learn new relationships among scene and sensory variables that are not considered to be potentially dependent, even when placed in novel environments in which these variables are strongly related. To illustrate this idea, consider the problem of perceptual cue acquisition. Wallach (1985) proposed a theory of cue acquisition that is representative of other theories in the literature (a closely related theory was originally proposed by Brunswick, 1956; readers interested in this topic should also see Haijiang et al., 2006). He hypothesized that in every perceptual domain (e.g., perception of motion direction) there is at least one primary source of information, useable innately and not modifiable by 1

We use the terms “low-level perception”, “early perception”, or “early perceptual

learning” in the same ways as many other researchers in the perceptual sciences literature (see Fahle and Poggio, 2002, and Gilbert, 1994, for reviews). Although the exact meanings of these terms can be fuzzy—for example, the boundary between early versus late perception is not completely understood—investigators have found these terms to be highly useful.

17

experience. Other perceptual cues are acquired later through correlation with the innate process. Using Wallach’s theory, we consider constraints on the learning processes underlying cue acquisition. One possibility is that these processes are general purpose, meaning they are equally sensitive to correlations between known cues and any signal. For example, let’s suppose that retinal image slip is an innate cue to motion direction, and let’s consider an observer placed in a novel environment in which retinal image slip is perfectly correlated with a novel signal, such as the temperature of the observer’s toes (e.g., leftward retinal slip is correlated with cold toes, and rightward retinal slip is correlated with hot toes). According to Wallach’s theory, it ought to be the case that the observer learns that the temperature of her toes is a perceptual cue to motion direction. For example, the observer may learn that cold toes indicate leftward motion, whereas hot toes indicate rightward motion. Alternatively, it may be that the learning processes underlying cue acquisition are biased such that they are more sensitive to some correlations than to others. In particular, we conjecture that these processes cannot learn new relationships among scene and sensory variables that are not considered to be potentially dependent. It seems likely that an observer placed in the novel environment described above would not believe that motion direction and the temperature of her toes are potentially dependent variables and, thus, the observer’s early perceptual system would fail to learn that the temperature of her toes is a cue to motion direction.

18

In the remainder of this chapter we report the results of five experiments. These experiments evaluate our hypothesis regarding biases in early perceptual learning. They do so in the context of Wallach’s theory of cue acquisition described above, namely that new perceptual cues can be acquired by correlating an existing cue with a novel sensory signal. We then present a simple model, described in terms of Bayesian networks, that formalizes our hypothesis, accounts for our results, and is consistent with the existing literature on perceptual learning. In Experiment 1, subjects were placed in a novel environment that resembled natural environments in the sense that it contained systematic relationships among scene and perceptual variables which normally share systematic relationships. Subjects were trained to perceive the motion direction of a field of moving dots when the visual cue to motion direction was correlated with an ambiguous auditory motion signal. When an object moves in a natural environment, this event often gives rise to correlated visual and auditory signals. In other words, perceived auditory and visual motion signals are both dependent on the motion of objects in a scene (as illustrated by the solid black edges in Figure 2.13) and, thus, people regard motion direction in a scene and visual or auditory signals as potentially dependent. We reasoned that subjects in our experiment should be able to estimate the motion direction of the moving dots based on the auditory and visual signals, and then modify their knowledge of the relationship between motion direction and auditory stimuli (i.e. perform parameter learning by modifying their conditional distribution of the perceived

19

auditory motion signal given the estimated motion direction). We predicted, therefore, that subjects would learn to use the hitherto ambiguous auditory stimulus as a cue to motion direction. As reported below, the experimental results are consistent with our prediction. Experiment 1 can be regarded as a control experiment in the sense that that it verified that our experimental procedures are adequate for inducing observers to learn a new perceptual cue in the manner suggested by Wallach (i.e. by correlating a signal which is not currently a cue with an existing cue). In Experiments 2-5, subjects were placed in novel environments that did not resemble natural environments—they contained systematic relationships among scene and perceptual variables that do not normally share systematic relationships. Cue acquisition requires structure learning in these cases. In Experiments 2 and 3, the visual cue to motion direction was correlated with binocular disparity or brightness signals, respectively; the experimental procedures were otherwise identical to those of Experiment 1. In the natural world, neither brightness nor binocular disparity vary systematically with transverse object motion (i.e. motion in the frontoparallel plane). That is, observers should not consider motion direction and either brightness or binocular disparity as potentially dependent variables. In contrast to Experiment 1, the predictions of Wallach’s hypothesis for Experiments 2 and 3 differ from those of our theory. Wallach’s hypothesis suggests that correlating ambiguous signals with existing cues should be sufficient to induce cue learning. In contrast, our hypothesis claims that observers can only learn relationships between variables

20

that are considered to be potentially dependent. Because transverse motion direction and either brightness or binocular disparity are not considered to be potentially dependent we predicted that subjects in Experiments 2 and 3 would fail to learn to use brightness or binocular disparity signals as cues to transverse motion direction. The experimental results are consistent with this prediction. Experiments 1-3 attempted to teach subjects a new cue to transverse motion direction. To check that there is nothing idiosyncratic about this perceptual judgement, a different task was used in Experiments 4 and 5. Subjects were trained to perceive the light source direction when the shading cue to this direction was correlated with a visual disparity or auditory signal. Because neither binocular disparity nor auditory signals share systematic relationships with light source direction in the natural world, we predicted that subjects would fail to learn that these signals were also cues to light source direction in our novel experimental environments. Again, the experimental results are consistent with this prediction. Taken as a whole, the experimental results are consistent with the hypothesis that the learning processes underlying cue acquisition are biased by prior beliefs about potentially dependent variables such that cue acquisition is possible when a signal is correlated with a cue to a scene property and the signal is potentially dependent on that property.2 2

If the signal is not be-

The term “belief” here is used in the Bayesian sense, to describe a property of the

human perceptual system, and not to describe an observer’s explicit beliefs.

21

lieved to be potentially dependent on the property, cue acquisition fails. In the discussion section, we introduce a Bayesian network model formalizing this hypothesis.

2.2

Experiment 1: Auditory cue to motion direction Subjects in Experiment 1 were trained to perceive the motion direction

of a field of dots when the visual cue to motion direction was correlated with an auditory signal. The experiment examined whether subjects would learn that the auditory signal too is a cue to motion direction. Because moving objects often give rise to both visual and auditory signals in natural environments (i.e., since sounds are created by physical motion), we expected that subjects would consider motion direction and an auditory signal to be potentially dependent and, thus, would learn that the auditory signal is also a cue. 3

3

The visual and auditory stimuli in this experiment are analogous to small particles (e.g.,

sand grains) moving across a surface with an anisotropic texture. The sound produced by such a stimulus depends on the properties of the surface texture and, as in the current experiment, the anisotropic surface texture would cause changes in the mean direction of the moving particles to lead to systematic changes in spectral properties of the resulting sound.

22

2.2.1

Methods

Subjects

Subjects were eight students at the University of Rochester with normal or corrected-to-normal vision and normal hearing. All subjects were naive to the purposes of the study.

Stimuli

Visual stimuli were random-dot kinematograms (RDK) presented for a duration of 1 second. The kinematograms consisted of 309 small antialiased white dots (each subtending approximately 0.65 minutes of visual angle) moving (at a rate of 1.4 degrees per second) behind a simulated circular aperture (with a diameter of 5.72 degrees of visual angle) against a black background. Half the dots in a display moved in the same direction, referred to as the stimulus direction, whereas each of the remaining dots moved in a direction sampled from a uniform distribution. Each dot had a lifetime of approximately 150 ms, after which a new replacement dot appeared in a random position within the aperture. These stimuli were presented on a standard 19-inch CRT with a resolution of 1024 × 768 pixels and a refresh rate of 100 Hz, and were viewed from a distance of 1.5 meters. All experiments were conducted in a darkened room, with black paper obscuring the edges of the CRT.

23

Auditory stimuli consisted of 1 second of ’notched’ white noise played through a pair of headphones. We used auditory noise because we wanted to create ambiguous motion stimuli. Two stimuli defining the endpoints of a continuum, denoted A and B, were each constructed by combining two narrow bands of noise (sampled at 22 kHz). Stimulus A had approximately equal amplitude in the ranges 4000-5000 Hz and 8000-10000 Hz, whereas stimulus B had approximately equal amplitude in the ranges 1-2000 Hz and 6000-7000 Hz. Intermediate stimuli were created by linearly combining stimuli A and B, where the linear coefficients formed a unit-length vector whose endpoint lied on a circle passing through the points (1, 0) and (0, 1) [e.g., the coefficients (1, 0) produced stimulus A, the coefficients (0, 1) produced stimulus B, and √ √ the coefficients (1/ 2, 1/ 2) produced a stimulus midway between A and B].4 Auditory stimuli were normalized to have equal maximum amplitudes. 4

The reader may wonder why we did not use more traditional auditory motion stimuli

with interaural phase and intensity differences. There are two reasons for this. First, we wanted to make the auditory motion ambiguous; setting up systematic interaural differences would bias observers to perceive the auditory stimulus as indicating a particular motion direction. Second, the visual stimuli used in the current experiment represent field motion and not object motion (i.e., the mean velocity of dot fields in our random dot kinematograms varies between stimuli. The mean position of these dots, however, remains constant). Interaural phase and intensity differences result from changes in the position of an object—or in the mean position of a group of objects—because the mean positions of our visual stimuli remain constant, interaural differences are inappropriate for representing their motion.

24

Procedure

The experiment used four tasks, referred to as the vision-only, auditiononly, and vision-audition training tasks, and the vision-audition test task. The vision-only and audition-only tasks allowed us to characterize each subject’s performances on visual and auditory discrimination tasks, respectively. The goal of the vision-audition training task was to expose subjects to an environment in which an auditory signal is correlated with a visual cue to motion direction. The goal of the vision-audition test task was to evaluate whether subjects learned that the auditory signal is also a cue to motion direction. In each trial of the vision-only training task, four visual displays were presented: a fixation square was presented for 500 ms, followed by the first RDK for 1000 ms, followed by a second fixation square for 400 ms, followed by the second RDK for 1000 ms. The stimulus direction of the first RDK, referred to as the “standard” stimulus, was always 0◦ (vertical). The second RDK, referred to as the “comparison” stimulus, had a stimulus direction different from the standard. Subjects judged whether the dots in the comparison stimulus moved to the left (anticlockwise) or to the right (clockwise) of those in the standard (vertical) stimulus. They responded by pressing the appropriate key on the keyboard. At the end of every 10 trials, subjects were informed of the number of those trials on which they responded correctly. The ease or difficulty of the task was varied over trials by varying the stimulus direction of the comparison so that difficult trials contained smaller direction differences between

25

the standard and comparison stimuli than did easy trials. This direction was determined using interleaved 2-up 1-down and 4-up 1-down staircases. Trials were run until there were at least 12 reversals of each staircase. A subject’s approximate 71%-correct and 84%-correct thresholds were set to the average values over the last 10 reversals of the 2-up 1-down and 4-up 1-down staircases, respectively. The audition-only training task was identical to the vision-only training task with the following exception. Instead of viewing RDK, subjects heard auditory stimuli. The standard was an auditory stimulus midway between stimuli A and B defined above, whereas the comparison was either nearer to A or nearer to B. Subjects judged whether the comparison was closer to A or B relative to the standard. Subjects were familiarized with A and B prior to performing the task. Subjects also performed a vision-audition training task in which an auditory signal is correlated with a visual cue to motion direction. Before performing this task, we formed a relationship between visual and auditory stimuli by mapping a subject’s visual thresholds onto their auditory thresholds. This was done using a log-linear function log(dv ) = m log(da ) + b

(2.1)

where dv and da are visual and auditory “directions”, respectively, m is a slope parameter, and b is an intercept parameter. The log-linear function ensured that corresponding visual and auditory stimuli were (approximately) equally

26

salient. The vision-audition training task was identical to the vision-only training task with the following exception. Instead of only viewing RDK, subjects both viewed RDK and heard the corresponding auditory stimuli. They were instructed to focus on the visual motion-direction discrimination task, but were also told that the auditory stimulus might be helpful. Half the subjects were run in the “no-switch” condition, meaning that the relationship between an auditory cue and a response key was the same on this task as it was on the audition-only task. The remaining subjects were run in the “switch” condition. (In other words, for half the subjects the stimulus “direction” of auditory stimulus A was anticlockwise of vertical and the “direction” of B was clockwise of vertical, whereas this relationship was reversed for the remaining subjects.) This was done so that results on the vision-audition training and test tasks could not be attributed to an association between auditory stimuli and response keys learned when subjects performed the audition-only trials. Vision-audition test trials were conducted to evaluate whether subjects learned that the auditory signal is correlated with the visual cue to motion direction and, thus, it too is a cue to motion direction. These test trials were similar to vision-audition training trials with the following differences. First, the presentation order of the standard and comparison were randomized. Subjects were instructed to judge whether the direction of the second stimulus was anticlockwise or clockwise relative to that of the first stimulus. Second, subjects never received feedback. Third, stimuli were selected according to the method of constant stimuli rather than according to a staircase. Importantly,

27

standard stimuli were “cue-conflict” stimuli—the direction of the RDK was vertical but the direction of the auditory stimulus was offset from vertical by either a value δ or −δ, where δ was set to a subject’s 84%-correct threshold on the audition-only training trials. In contrast, the comparison stimulus was a “cue-consistent” stimulus. By comparing performances when the auditory signal in the standard had an offset of δ versus −δ, we can evaluate whether this signal influenced subjects’ judgements of motion direction. Subjects performed the four tasks during two experimental sessions. In Session 1, they performed the vision-only and audition-only training tasks. Before performing these tasks, subjects performed a small number of practice trials in which they were given feedback on every trial. They also performed the vision-audition training task twice. In Session 2, they performed the visionaudition training task, and then performed the vision-audition test task twice.

2.2.2

Results

Two subjects’ results on the vision-audition test task are shown in Figure 2.1. The horizontal axis of each graph gives the direction of the comparison, whereas the vertical axis gives the probability that the subject judged the direction of the comparison as clockwise relative to that of the standard. The data points indicated by circles or crosses are for the trials in which the auditory signal in the standard was offset from vertical by the amount δ or −δ, respectively. The dotted and solid lines are cumulative Normal distribu-

28

tions fit to these data points using a maximum likelihood procedure due to Wichmann and Hill (2001a). To compare a subject’s performances when the offset of the auditory signal in the standard was δ versus −δ, we compared a subject’s point of subjective equality in each case. The point of subjective equality (PSE) is defined as the direction of the comparison at which a subject is equally likely to judge this direction as being anticlockwise or clockwise relative to that of the standard. For example, consider subject BCY whose data is shown in the left graph of Figure 2.1. This subject’s PSE is about −3◦ in the −δ case, and about 2◦ in the δ case, indicating a PSE shift of about 5◦ . For each of the subjects whose data are illustrated in Figure 2.1, their PSE when the offset was −δ is significantly less than their PSE when the offset was δ (both subjects had significant PSE shifts using a significance level of p < 0.05, where the test of significance is based on a Monte Carlo procedure described by Wichmann and Hill, 2001b). Seven of the eight subjects run in the experiment had significant PSE shifts. The graph in Figure 2.2 shows the combined data for all eight subjects, along with the maximum-likelihood psychometric fits for the the pooled data. The average value of the offset δ across all subjects was equivalent to a 4.30◦ rotation in motion direction. Importantly for our purposes, subjects showed a large shift in their point of subjective equality—the PSE shift for the combined data is 3.72◦ (p < 0.001). These data suggest that subjects based their judgements on information from both the visual and auditory signals. Had

29

subjects used only the visual signal, we would have expected no shift in their PSEs. Conversely, if subjects had used only the auditory signal, then a PSE shift of 2δ (8.6◦ on average) would have been expected. The actual PSE shift (3.72◦ on average) was smaller, consistent with the idea that subjects combined information from the visual and auditory signals. In summary, the results suggest that subjects acquired a new perceptual cue—they learned that the ambiguous auditory motion signal was correlated with the visual cue to motion direction and, thus, it too is a cue to motion direction. Furthermore, the subjects used the new cue for the purposes of sensory integration—they combined information from the new auditory cue with information from the previously existing visual cue when judging motion direction.

2.3

Experiment 2: Disparity cue to motion direction Subjects in Experiment 2 were trained to perceive the direction of moving

dots when the visual cue to motion direction was correlated with a binocular disparity signal. The experiment examined whether subjects would learn that the disparity signal too is a cue to motion direction. Because the transverse motion of objects in the natural world does not affect the binocular disparities received by observers, we reasoned that subjects in our experiment would not believe that there is a potential dependency between transverse motion and

30

BCY 1 0.9

p(comparison>standard)

0.8

−δ (fit) +δ (fit) −δ (data) +δ (data)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −30

−20

−10 0 10 motion direction (comparison)

20

30

20

30

LSK 1 0.9

p(comparison>standard)

0.8

−δ (fit) +δ (fit) −δ (data) +δ (data)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −30

−20

−10 0 10 motion direction (comparison)

Figure 2.1: The data for two subjects from the vision-audition test trials. The horizontal axis of each graph gives the direction of the comparison, whereas the vertical axis gives the probability that the subject judged the direction of the comparison as clockwise relative to that of the standard. The data points indicated by circles or crosses are for the trials in which the auditory signal in the standard was offset from vertical by the amount δ or −δ, respectively. The dotted and solid lines are cumulative Normal distributions fit to these data points using a maximum likelihood procedure.

31

1 0.9

p(comparison>standard)

0.8

−δ (fit) +δ (fit) −δ (data) +δ (data)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −30

−20 −10 0 10 20 motion direction (comparison stimulus)

30

Figure 2.2: The data from the vision-audition test trials for all eight subjects combined.

32

disparity, and would therefore be unable to learn that the disparity signal is also a cue to motion direction in our novel experimental environment.

2.3.1

Methods

Subjects

Subjects were eight students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

Stimuli

Motion stimuli were random-dot kinematograms (RDK) identical to those used in Experiment 1, except that, to limit “bleeding” across frames in the stereo condition, only the red gun of the CRT was used. Stimuli containing binocular disparities were created as follows. Stationary dots were placed at simulated depths (all dots in a given display were at the same depth) ranging from -23 cm to 23 cm relative to fixation (or from 127 cm to 173 in absolute depth from the observer), and rendered from left-eye and right-eye viewpoints. Left-eye and right-eye images were presented to subjects using LCD shutter glasses (CrystalEyes 3 from Stereographics). Stimuli with both visual motion and disparity signals were created by placing moving dots at simulated depths, and rendering the dots from left-eye and right-eye viewpoints.

33

Procedure

The procedure for Experiment 2 was identical to that of Experiment 1, except that the auditory signal was replaced by the binocular disparity signal. That is, subjects performed motion-only, disparity-only, and motiondisparity training trials, and motion-disparity test trials. For the motiondisparity training trials, stimuli with both motion and disparity signals were constructed as in Experiment 1, by mapping motion direction values onto disparity values based on the motion and disparity discrimination thresholds obtained in the motion-only and disparity-only training trials. The motiondisparity test trials were functionally identical to those of Experiment 1, with δ now representing offsets from the vertical “direction” in the disparity signal of the standard stimulus.

2.3.2

Results

If a subject had different PSEs when the disparity offset was −δ versus δ, then we can conclude that the subject learned to use the disparity signal as a cue to motion direction. Only one of the eight subjects had significantly different PSEs in the two conditions (at the p < 0.05 level), suggesting that subjects did not learn to use the disparity signal when judging motion direction. The data for all subjects are shown in Figure 2.3. We fit psychometric functions (cumulative Normal distributions) to the combined data from all 8 subjects when the offset in the standard was δ (solid line) and when it

34

was −δ (dotted line). The average value across subjects for the offset δ was equivalent to a 4.49◦ rotation in motion direction. The experimental outcome is that subjects did not learn to use the disparity signal as a cue to motion direction—the 0.04◦ shift in PSEs when the offset was δ versus −δ was not significantly different from zero at the p < 0.05 level.

2.4

Experiment 3: Brightness cue to motion direction Subjects in Experiment 3 were trained to perceive the direction of mov-

ing dots when the visual cue to motion direction was correlated with a visual brightness signal. The experiment examined whether subjects would learn that the brightness signal too is a cue to motion direction. Because the transverse motion of objects in the natural world does not affect their brightness, we reasoned that subjects do not represent a potential dependency between transverse motion and brightness, and would therefore be unable to learn that the brightness signal is also a cue to motion direction in our novel experimental environment.

35

1 0.9

p(comparison>standard)

0.8

−δ (fit) +δ (fit) −δ (data) +δ (data)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −30

−20

−10 0 10 motion direction (comparison)

20

30

Figure 2.3: Data from the motion-disparity test trials for all eight subjects combined.

36

2.4.1

Methods

Subjects

Subjects were eight students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

Stimuli

Motion stimuli were random-dot kinematograms (RDK) identical to those used in Experiments 1 and 2, except that the individual dots were assigned a neutral or pedestal brightness value. The brightness stimuli consisted of stationary random-dot images whose dots all shared a common pixel brightness value which ranged from 78 to 250 on a scale of 0-255. The pedestal pixel brightness of 164 had a luminance of 45.0 cd/m2 . Near this pedestal, luminance values scaled approximately linearly with pixel brightness, with 1 unit of RGB pixel brightness equivalent to 0.786 cd/m2 . Stimuli with both visual motion and brightness signals were created by assigning brightness pixel values to moving dots.

37

Procedure

The procedure for Experiment 3 was identical to those of Experiments 1 and 2, except that the auditory or disparity signals were replaced by a brightness signal.

2.4.2

Results

The motion-brightness test trials contained two conditions—the “direction” of the brightness signal in the standard stimulus was offset from vertical by either an amount −δ or δ. If a subject had different PSEs in the two conditions, then we can conclude that the subject learned to use the brightness signal as a cue to motion direction. None of the eight subjects had significantly different PSEs in the two conditions (at the p < 0.05 level), suggesting that subjects did not learn to use the brightness signal when judging motion direction. The data for all subjects are illustrated in Figure 2.4. We fit psychometric functions (cumulative Normal distributions) to the combined data from all 8 subjects when the offset in the standard was δ (solid line) and when it was −δ (dotted line). The average value across subjects for the offset δ was equivalent to a 5.55◦ rotation in motion direction. The 0.70◦ shift in PSEs in the δ versus −δ cases was not statistically significant at the p < 0.05 level, indicating that subjects did not learn to use the brightness signal as a cue to motion direction.

38

1 0.9

p(comparison>standard)

0.8

−δ (fit) +δ (fit) −δ (data) +δ (data)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −30

−20

−10 0 10 motion direction (comparison)

20

30

Figure 2.4: Data from the motion-brightness test trials for all eight subjects combined.

39

2.4.3

Discussion

We hypothesize that our early perceptual systems are capable of learning novel statistical relationships among scene and sensory variables that are already considered to be potentially dependent, but that they cannot learn new relationships among scene and sensory variables that are not considered to be potentially dependent, even when placed in novel environments in which these variables are strongly related. Our experiments were designed to evaluate this hypothesis in the context of cue acquisition. Experiments 1-3 evaluated whether people could learn new cues to transverse motion (motion in the frontoparallel plane). In Experiment 1, subjects were exposed to an environment in which visual motion direction was correlated with an auditory signal. Because motion in natural environments often gives rise to both visual and auditory signals, it seems reasonable to assume that people believe that there is a potential dependency between motion direction and an auditory stimulus and, thus, we predicted that subjects would succeed in acquiring a new cue. The experimental results are consistent with this prediction. We can regard Experiment 1 as a control experiment—it establishes that our experimental procedures are adequate for inducing cue acquisition and our statistical analyses are adequate for detecting this acquisition. Experiments 2 and 3 exposed subjects to an environment in which visual motion direction was correlated with a binocular disparity signal or a

40

brightness signal, respectively. In contrast to Experiment 1, cue acquisition in these cases requires representing statistical relationships among variables that do not share dependencies in the natural world. Transverse motion in natural environments does not lead to changes in disparity or brightness and, thus, people should not believe that there is a potential dependency between motion direction and disparity or brightness. We predicted that subjects would not acquire new cues to motion direction in these experiments, and the experimental results are consistent with these predictions. There are at least two alternative explanations of our experimental results, however, that should be considered. First, perhaps there is something idiosyncratic about judgements of transverse motion. If so, one would not expect the experimental results to generalize to other perceptual judgements. Second, Experiment 1, where cue acquisition was successful, used signals from different sensory modalities, whereas Experiments 2-3, where cue acquisition was not successful, used signals from a single modality. Perhaps this difference accounts for the differences in experimental outcomes. Experiments 4-5 were designed to evaluate these alternative explanations.

2.5

Experiment 4: Disparity cue to light source direction Subjects in Experiment 4 were trained to perceive the direction of a light

source when the visual cue to light source direction—the pattern of shading

41

across the visual objects—was correlated with a visual disparity signal. The experiment examined whether subjects would learn that the disparity signal too is a cue to light source direction. Because the direction of a light source has no effect on the depth of a lit object in the natural world, we reasoned that subjects should not represent a potential dependency between light source direction and disparity. Thus, we predicted that subjects would be unable to learn that the disparity signal is also a cue to light source direction in our novel experimental environment.

2.5.1

Methods

Subjects

Subjects were eight students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

Stimuli

Figure 2.5 depicts the stimuli used in Experiment 4. The shading stimuli consisted of 23 bumps (hemispheres) lying on a common frontoparallel plane whose pattern of shading provided information about the light source direction. Each bump subtended approximately 26 minutes of visual angle, and the bumps were scattered uniformly within a circular aperture (with a diameter

42

of 6.28◦ ). The light source was rendered as an infinite point source located 45◦ away from the frontoparallel plane along the z-axis (in the direction of the observer). The angular location of the light source varied from −90◦ (light coming from the left) to 90◦ (light coming from the right), with the light source direction in the standard stimulus always set to vertical (0◦ ). In the shading-only training task, subjects viewed the stimuli monocularly with their dominant eyes. In all conditions, the bumps were rendered using only the red gun of the CRT. The stimuli with binocular disparities were identical to those in the shading-only training task, except that the bumps were rendered from lefteye and right-eye viewpoints with flat lighting, so that they appeared as discs of uniform luminance and, as with the static dots in Experiment 2, the discs were placed at simulated depths ranging from -23 cm to 23 cm relative to the observer (with all discs in a given display lying at a common depth). Stimuli with both shading and disparity signals were created by rendering the shaded bumps at simulated depths. In all tasks, each stimulus was presented for one second.

Procedure

The procedure for Experiment 4 was analogous to those of Experiments 1-3. We used shading-only and disparity-only training tasks to characterize each subject’s performance on lighting direction and depth discrimination

43

Figure 2.5: A sample stimulus from Experiment 4. In this example, the bumps are illuminated from the left.

44

tasks, respectively, and then trained subjects during shading-disparity training trials by exposing them to an environment in which disparity was correlated with shading. Finally we tested subjects during shading-disparity test trials to evaluate whether they had learned that the disparity signal is also a cue to light source direction.

2.5.2

Results

The shading-disparity test trials contained two conditions—the “direction” of the disparity signal in the standard stimulus was offset from vertical by either an amount −δ or δ. If a subject had different PSEs in the two conditions, then we can conclude that the subject learned to use the disparity signal as a cue to light source direction. None of the eight subjects had significantly different PSEs in the two conditions (at the p < 0.05 level), suggesting that subjects did not learn to use the disparity signal when judging light source direction. The data for all subjects are illustrated in Figure 2.6. We fit psychometric functions (cumulative Normal distributions) to the combined data from all 8 subjects when the offset in the standard was δ (solid line) and when it was −δ (dotted line). The average value across subjects for the offset δ was equivalent to a 21.32◦ rotation in light source direction. The 0.59◦ shift in PSEs in the δ versus −δ cases was not statistically significant at the p < 0.05

45

level, indicating that subjects did not learn to use the disparity signal as a cue to light source direction.

2.6

Experiment 5: Auditory cue to light source direction Subjects in Experiment 5 were trained to perceive the direction of a light

source when the visual cue to light source direction—the pattern of shading across the visual objects—was correlated with a dynamic auditory signal. The experiment examined whether subjects would learn that the auditory signal too is a cue to light source direction. Because the direction of a light source has no effect on the motion of an object in the natural world—and thus no effect on auditory signals—we reasoned that subjects should not represent a dependency between light source direction and the auditory signal. Thus, we predicted that subjects would be unable to learn that the disparity signal is also a cue to light source direction in our novel experimental environment.

2.6.1

Methods

Subjects

Subjects were eight students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

46

1 0.9

−δ (fit) +δ (fit) −δ (data) +δ (data)

p(comparison>standard)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

−30

−20

−10 0 10 20 motion direction (comparison)

30

Figure 2.6: Data from the shading-disparity test trials for all eight subjects combined.

47

Stimuli

Shading stimuli consisted of 23 bumps (hemispheres) lying on a common frontoparallel plane whose pattern of shading provided information about the light source direction. Each bump subtended approximately 26 minutes of visual angle, and the bumps were scattered uniformly within a circular aperture (with a diameter of 6.28◦ ). The light source was rendered as a diffuse panel source (i.e., as an array of local point sources) located 45◦ away from the frontoparallel plane along the z-axis (in the direction of the observer) and with its surface normal pointing toward the center of the bump array. The angular location of the light source varied from -90◦ (light coming from the left) to 90◦ (light coming from the right), with the light source direction in the standard stimulus always set to vertical (0◦ ). Because we were concerned that subjects might be unable to bind the dynamic auditory signal with static visual stimuli in the combined trials, we changed the visual stimuli by jittering the light source so that the temporal microstructure of the visual stimulus seemed consistent with dynamic (white-noise) auditory signal. To jitter the stimulus, one of the point light sources in the panel array was selected at random and turned off in each frame. This resulted in both flicker and positional jitter. Auditory stimuli used in the auditory-only and shading-auditory trials were identical to those used in Experiment 1.

48

Procedure

The procedure for Experiment 5 was analogous to those of Experiments 1-4. We used shading-only and auditory-only training tasks to characterize each subject’s performance on lighting direction and auditory discrimination tasks, respectively, and then trained the subjects during shading-auditory training trials by exposing them to an environment in which an auditory signal was correlated with shading. Finally we tested subjects during shadingauditory test trials to evaluate whether they had learned that the auditory signal is also a cue to light source direction.

2.6.2

Results

The shading-auditory test trials contained two conditions—the “direction” of the auditory signal in the standard stimulus was offset from vertical by either an amount −δ or δ. If a subject had different PSEs in the two conditions, then we can conclude that the subject learned to use the auditory signal as a cue to light source direction. None of the eight subjects had significantly different PSEs in the two conditions (at the p < 0.05 level), suggesting that subjects did not learn to use the auditory signal when judging light source direction. The data for all subjects are illustrated in Figure 2.7. We fit psychometric functions (cumulative Normal distributions) to the combined data from all 8 subjects when the offset in the standard was δ (solid line) and when it

49

1 −δ (fit) +δ (fit) −δ (data) +δ (data)

0.9

p(comparison>standard)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

−30

−20

−10 0 10 20 motion direction (comparison)

30

Figure 2.7: Data from the shading-auditory test trials for all eight subjects combined.

50

was −δ (dotted line). The average value across subjects for the offset δ was equivalent to a 7.97◦ rotation in light source direction. The 0.073◦ shift in PSEs in the δ versus −δ cases was not statistically significant at the p < 0.05 level, indicating that subjects did not learn to use the auditory signal as a cue to light source direction.

2.6.3

Discussion and Conclusions

Numerous studies have shown that organisms’ learning processes are often biased or constrained. The study described in this chapter has demonstrated that, similar to other learning processes, perceptual learning is biased and has proposed a new constraint on early perceptual learning to account for this bias, namely that people can modify their knowledge of the prior probabilities of scene variables or of the statistical relationships among scene and perceptual variables that are already considered to be potentially dependent but they cannot learn new relationships among variables that are not considered to be potentially dependent. An important goal of this chapter is to formalize these ideas using the notation of Bayesian networks, and to illustrate how previous studies of early perceptual learning can be viewed as instances of parameter learning. Bayesian networks are a tool for representing probabilistic knowledge that has been developed in the artificial intelligence community (Neapolitan, 2004; Pearl, 1988). They have proven useful for modeling many aspects of ma-

51

chine and biological visual perception (e.g., Freeman, Pasztor, & Carmichael, 2000; Kersten, Mamassian, & Yuille, 2004; Kersten & Yuille, 2003; Schrater & Kersten, 2000). The basic idea underlying Bayesian networks is that a joint distribution over a set of random variables can be represented by a directed acyclic graph in which nodes correspond to variables, and edges between nodes correspond to direct statistical dependencies among variables. For example, an edge from node x1 to node x2 means that the distribution of variable x2 depends on the value of variable x1 (as a matter of terminology, node x1 is referred to as the parent of x2 ). A Bayesian network is a representation of the following factorization of a joint distribution:

P (x1 , . . . , xn ) =

n Y i=1

P (xi | pa(xi ))

(2.2)

where P (x1 , . . . , xn ) is the joint distribution of variables x1 , . . . , xn and P (xi | pa(xi )) is the conditional distribution of xi given the values of its parents [if xi has no parents, then P (xi | pa(xi )) = P (xi )]. As an introduction to Bayesian networks, consider the following scenario: an observer pulls into a parking lot, and as she begins to exit her car she hears a Rotweiller’s bark. Looking in the direction of the bark, she sees a distant Rotweiller. The observer’s car is parked close to a building’s entrance, and the observer must decide whether to wait for the dog to leave the vicinity or to try to make it to the entrance before encountering the dog. In making this decision the observer would like to know: (1) Is the dog dangerous? and (2)

52

How far away is the dog? To simplify things, assume that the observer has access to only three pieces of information: the loudness of the dog’s bark and the size of the dog’s image, which are both cues to the distance of the dog, and whether the dog is foaming at the mouth, which lets the observer know whether the dog is rabid and therefore dangerous (for simplicity, assume that only rabid dogs are dangerous). Figure 2.8 shows a Bayesian network representing this situation. The variables corresponding to scene properties are located toward the top of this figure, whereas the variables corresponding to percepts are located toward the bottom. Scene variables do not have parents, though they serve as parents to sensory variables as indicated by the arrows. A Bayesian network is a useful representation of the joint distribution of scene and sensory variables because of the way it represents potential dependencies. Although statistical dependency and causality are not equivalent relationships, Bayesian networks are often interpreted as instances of “generative models” whose edges point in the direction of causality. Consider the edges in Figure 2.8. A change in the physical distance from the dog to the observer causes a change in the perceived size of the dog’s image on the observer’s retina and in the perceived loudness of the dog’s bark at the observer’s ear. Likewise, rabies may lead to the observer perceiving the dog to foam at the mouth. These relationships, however, are not deterministic; the perceived size of the dog’s image and the perceived loudness of the dog’s bark can also vary due to additional factors which are difficult

53

to measure, such as physical or neural noise. The conditional distributions associated with sensory variables represent these uncertainties. Bayesian networks are most useful when they represent relationships among variables in ways that are both sparse and decomposable.5 The structure of the graph in Figure 2.8 has been greatly simplified using our knowledge about causality and, thus, the graph represents potential relationships among variables in a sparse way. We understand, for example, that knowing that the dog has rabies or is foaming at the mouth tells us nothing about the dog’s distance, or its retinal image size, or the loudness of its bark. Consequently, there are no edges linking the former and latter variables. It is precisely these sorts of simplifications, or assumed independencies, that make reasoning computationally feasible. For example, an observer reasoning about whether a dog has rabies only needs to consider whether the dog is foaming at the mouth, 5

A reader might wonder why a fully-connected Bayesian network (i.e. one in which all

scene variables connect to all sensory variables) is not always used. An advantage of such a network is that it could represent any relationship between scene and sensory variables. Unfortunately, as mentioned in this paper and detailed in the literature on machine learning, there is a price to pay for such representational richness—inference and learning in fullyconnected networks is prohibitively expensive in terms of computation. In fact, inference and learning are computationally feasible only in networks with sparse connectivity. Similarly, a reader might wonder why a network is not always used that initially contains no connections, but in which connections are added over time as needed. As before, this type of “structure learning” is prohibitively expensive from a computational viewpoint.

54

Scene Variables

Perceptual Variables

P(size)

P(distance)

P(amplitude)

P(rabies)

Size of dog

Distance to dog

Amplitude of bark

Rabies?

Image size

P(image size |distance to dog, size of dog)

Percieved loudness

Foaming at mouth

P(sound|distance,amplitude)

P(foam|rabies)

Figure 2.8: A simple Bayesian network representing a distant barking dog who may or may not have rabies. The variables corresponding to scene properties are located toward the top of this figure, whereas the variables corresponding to percepts are located toward the bottom. Scene variables do not have parents, though they serve as parents to sensory variables as indicated by the arrows.

55

and can ignore the values of all other variables. Bayesian networks also represent relationships in ways that are decomposable. An observer wishing to estimate the distance to a dog on the basis of the dog’s retinal image size and the loudness of the dog’s bark can do so using Bayes’ rule: P (distance to dog | image size, loudness of bark) ∝ P (image size, loudness of bark | distance to dog) P (distance to dog). (2.3) This calculation can be simplified by noting that, according to the Bayesian network in Figure 2.8, the size of the dog’s image and the loudness of its bark are conditionally independent given the distance to the dog. Consequently, the joint distribution on the right-hand side of Equation 2.3 can be factored as follows: P (image size, loudness of bark | distance to dog) = P (image size | distance to dog) P (loudness of bark | distance to dog). (2.4) The computational advantages of statistical relationships that are sparse and decomposable are difficult to appreciate in a simple scenario with seven random variables, but have enormous importance in real-world situations with hundreds of variables. Indeed, whether reasoning requires the consideration of only a small subset of variables versus the need to take into account all variables, or whether reasoning requires the calculation of high-dimensional joint distributions versus low-dimensional distributions, are typically the most

56

important factors making a problem solvable versus unsolvable in practice (Bishop, 1995; Neapolitan, 2004). Using the notation of Bayesian networks, we can re-state our hypothesis about constraints on early perceptual learning. Recall our hypothesis that people’s early perceptual processes can modify their knowledge of the prior probabilities of scene properties, or of the statistical relationships among scene and sensory variables that are already considered to be potentially dependent. However, they cannot learn new relationships among scene and sensory variables that are not considered to be potentially dependent. In terms of Bayesian networks, our hypothesis states that early perceptual processes can modify their prior probability distributions for scene variables, or their conditional probability distributions specifying how sensory variables depend on scene variables. However, they cannot add new nodes or new edges between scene and sensory variables in their graphical representation. In the machine learning literature, researchers make a distinction between “parameter learning”, meaning learning the prior and conditional probability distributions, and “structure learning”, meaning learning the nodes and edges of the graphical structure. Using the terminology of this literature, our hypothesis states that early perceptual processes are capable of parameter learning, but they are not capable of structure learning.6 Interestingly, parameter learning is often 6

If people’s early perceptual knowledge can be characterized by Bayesian networks but

the structures of these networks are not learned, then this raises the question of where these structures come from. We speculate that people’s network structures are innate, resulting

57

thought to be computationally feasible—the machine learning literature contains several maximum likelihood and Bayesian algorithms that often work well in practice. In contrast, structure learning is widely regarded as intractable— there are currently no general-purpose algorithms for structure learning that work well on moderate or large-sized problems (Rish, 2000).7 from our evolutionary history in an environment with stationary physical laws. If this “strong” view is not strictly correct, we would not be uncomfortable with a “weaker” view in which the structures are fixed in adults, but that structural learning takes place during infancy or early childhood. 7

As a technical detail, it is worth noting that both parameter and structure learning

in Bayesian networks are often regarded as NP-hard problems (Cooper, 1990; Jordan and Weiss, 2002). To illustrate why parameter learning is regarded as NP-hard, keep in mind that inference (determining the conditional distribution of some variables given the values of other variables) is often a sub-problem of parameter learning (e.g., the E-step of the ExpectationMaximization algorithm requires inference). In the machine learning community, inference is typically performed using the junction-tree algorithm. The computational complexity of this algorithm is a function of the size of the cliques upon which message-passing operations are performed. Unfortunately, summing a clique potential is exponential in the number of nodes in the clique. This fact motivates the need to use Bayesian networks with sparse connectivity. Because such networks tend to minimize clique sizes, inference (and, thus, parameter learning) in these networks is often feasible. To illustrate why structure learning is regarded as NP-hard, keep in mind that structure learning is typically posed as a model selection problem within a hierarchical Bayesian framework. The top level of the hierarchy includes binary variables {Mi , i = 1, . . . , K} where Mi indicates whether model i is the

58

Our hypothesis can be divided into two parts: (i) early perceptual processes are capable of parameter learning, and (ii) early perceptual processes are not capable of structure learning. To our knowledge, there are no demonstrations in the scientific literature of structure learning by early perceptual processes. That is, we believe that all examples of early perceptual learning are demonstrations of parameter learning. To illustrate this point, we review several classes of early perceptual learning phenomena. Many studies on perceptual learning report the results of experiments in which observers show improved performance on tasks requiring fine discriminations along simple perceptual dimensions. For example, observers might improve at tasks requiring the discrimination of motion directions of single dots (Matthews & Welch, 1997) or fields of dots (Ball & Sekuler, 1987), of orientations of line segments (Matthews & Welch, 1997), of spatial frequencies within plaid patterns (Fine & Jacobs, 2000), or of offsets between nearly colinear line segments (Fahle, Edelman, & Poggio, 1995). Often the learning demonstrated using such tasks is stimulus-specific in the sense that the learn“correct” model, and the number of such models K is super-exponential in the number of nodes in the network. The middle level includes real-valued variables {θi , i = 1, . . . , N } where θi is the set of parameters for model i. The bottom level is the data, denoted D. The likelihood for model i, P (D|Mi ), is computed as follows: P (D|Mi ) =

R

P (D|θi )P (θi |Mi )dθi .

Note that P (D|θi ) is the likelihood for the parameters θi (used during parameter learning). Also note that this integral is typically not analytically solvable.

59

ing fails to generalize to novel versions of the task using different stimulus positions or configurations. Figure 2.9 shows a Bayesian network depicting the dependence assumptions that might underlie performance on a task requiring observers to make fine discriminations along a simple perceptual dimension. In this figure, the physical scene gives rise to a set of conditionally independent perceptual features that the observer uses to make decisions regarding the value of some scene property. An account of how an observer improves at estimating the scene property on the basis of the perceptual features is as follows. The process of learning consists of improving the estimates of the relationships between the scene property and the values of each of the perceptual features. In terms of Bayesian networks, learning consists of improving the estimates of the conditional distributions associated with the perceptual variables; i.e. the distributions of the values of the features given a value (or an estimated value) of the scene property: P (feature i | scene = x) for i = 1, . . . , N. Of particular interest is the variances of these distributions because these variances suggest how reliable or informative each feature is in indicating the value of the scene property. Features whose values have a large variance for a fixed value of the scene property are relatively unreliable or uninformative. In contrast, features whose values have a small variance for a fixed scene value are more reliable. More accurate estimates of the conditional distributions, and thus their variances, allow an observer to more accurately weight features according to their relative reliabilities, effectively placing greater weight on more reliable features

60

and lesser weight on less reliable features. Most importantly for our purposes, observers’ improved performances can be accounted for solely on the basis of parameter learning; structure learning is not required. Ball and Sekuler (1987) reported an experiment in which observers improved at discriminating motion directions in random-dot kinematograms centered at a training direction, but did not show improved performance when tested with motion directions centered at a novel orthogonal direction. An account of this finding based on Bayesian networks is as follows. Consider a network in which the parent node corresponds to a scene variable representing the direction of motion in the scene, and the N child nodes correspond to sensory feature detectors sensitive to image motion in N different directions. When the observer views a display of a kinematogram containing a motion direction near the training direction, the observer first estimates this direction based on the values of her feature detectors. She then adapts her estimates of the conditional distributions of the values of the feature detectors given the estimated value of the motion direction. Over the course of many trials, the observer learns which feature detectors have a small variance for a given value of the motion direction, meaning that it learns which feature detectors are most reliable for judging motion direction for directions near the training direction.8 8

Because the observer does not modify her distributions that are

Readers familiar with the machine learning literature will recognize that we are con-

jecturing that the observer’s learning rule resembles an Expectation-Maximization (EM) algorithm, an algorithm for maximizing likelihood functions. At first blush, it may seem

61

Scene property

Feature 1

Feature 2

Feature n

Figure 2.9: A Bayesian network depicting the dependence assumptions underlying perceptual judgements in tasks requiring observers to make a fine discrimination along a simple perceptual dimension.

62

conditioned on other motion directions, the observer does not show improved performance when tested with kinematograms whose motion directions are centered at a novel orthogonal direction. A demonstration that observers learn to weight features according to their relative reliabilities in a manner consistent with the account of learning described above was provided by Gold, Sekuler, & Bennett (2004). These researchers examined observers’ perceptual representations through the construction of “classification images”. Briefly, classification images are created by correlating image feature values (e.g., pixel luminances) with the values of a scene property. Image features that vary reliably with the scene property take on extreme values in the classification image, whereas unreliable features take on values near zero. Classification images are often used in the context of perceptual classification. An ideal classification image for a set of stimuli is constructed by correlating the feature values of each stimulus with its correct classification; classification images for individual observers can be created by correlating the feature values of each stimulus with the classification indicated by the observer. For difficult tasks, the ideal classification images tend to be relatively sparse, with few reliable features for discriminating between that the observer is faced with a “chicken-and-egg” problem: the observer first uses her feature detectors to estimate the motion direction, and then uses the estimated motion direction to determine the most reliable features. The EM algorithm is often used to solve such chicken-and-egg problems. The reader is referred to Dempster, Laird, and Rubin (1977) for an explanation as to why this works.

63

stimulus classes. Our account of learning described above predicts that naive observers should initially use a large set of features when discriminating stimuli, and then gradually reduce the influence of many features as they discover which features are most reliable for the task. Gold et al.’s (2004) results suggest that, during the course of learning, observers’ classification images move toward the ideal classification image in exactly this manner, with observers incrementally basing their decisions on a smaller, more reliable subset of the available features. A second class of phenomena studied in the perceptual learning literature is the acquisition of new cue combination rules. Several studies have found that observers modify how they combine information from two visual cues when one of the cues is made less reliable (Atkins, Fiser, and Jacobs, 2001; Ernst, Banks, and B¨ ulthoff, 2000; Jacobs and Fine, 1999). Ernst, Banks, and B¨ ulthoff (2000), for example, placed observers in an environment in which the slant of a surface indicated by a haptic cue was correlated with the slant indicated by one visual cue but uncorrelated with the slant indicated by another visual cue (this slant value varied randomly over trials). Observers adapted their visual cue combination rules so as to place more weight on the information derived from the visual cue that was consistent with haptics, and less weight on the information derived from the other visual cue. This class of learning phenomena can be regarded as conceptually equivalent to the first class of phenomena described above in which observers modify how they com-

64

bine information from multiple feature detectors. Consequently, our account of learning for this second class is very similar to our account for the first class. The top node of the Bayesian network in Figure 2.10 represents a scene variable, such as the slant of a surface, whereas the two child nodes represent corresponding sensory variables based on two perceptual cues, such as slantfrom-visual-stereo and slant-from-visual-texture. Imagine that both cues are normally good indicators of the scene property, but the observer is placed in a novel environment in which cue 1 is reliable but cue 2 is not [e.g., the variance of P (cue 1 | scene = x) is small whereas the variance of P (cue 2 | scene = x) is large]. An account of how an observer improves at estimating the scene property on the basis of the two perceptual cues is as follows. On each trial of an experiment, the observer first estimates the value of the scene property based on the values of all sensory variables. She then improves the estimates of the relationships between the scene property and the values of each of the cues; i.e. the observer modifies her estimates of P (cue 1 | scene = x) and P (cue 2 | scene = x) where x is the estimated value of the scene property. More accurate estimates of these distributions, particularly their variances, allow an observer to more accurately weight cues according to their relative reliabilities, effectively placing greater weight on more reliable cues and lesser weight on less reliable cues. As above, observers’ improved performances can be accounted for solely on the basis of parameter learning, and does not require structure learning.

65

Scene property

P(C1 | S = x)

P(C2 | S = x)

C2

C1

Cue 1

Cue 2

Figure 2.10: A Bayesian network representing the type of modification that might underlie the acquisition of new cue combination rules. Cue 1 represents a cue whose reliability is fixed, while Cue 2 represents a cue that has become less reliable. The solid black curves represent the final conditional cue distributions given a value (or estimated value) of the scene property. The dashed grey curve represents the conditional distribution for Cue 2 before learning.

66

A third class of perceptual learning phenomena is often referred to as “cue recalibration” (e.g., Atkins, Jacobs, and Knill, 2003; Bedford, 1993; Epstein, 1975; Harris, 1965; Mather and Lackner, 1981; Welch, 1986). For example, an observer may wear prisms that shift the visual world 10◦ to the right. As a result, objects visually appear at locations 10◦ to the right of the locations indicated by other sensory signals. Over time, observers notice this discrepancy, and recalibrate their interpretations of the visual cue so that the visual location is more consistent with the locations indicated by other sensory cues. Using our Bayesian network framework, we hypothesize that the observer first estimates the value of the scene property (top node of Figure 2.11) based on the values of all sensory cues (bottom nodes of Figure 2.11). She then modifies her estimates of the conditional distributions associated with the sensory variables: P (cue 1 | scene = x) and P (cue 2 | scene = x) where x is the estimated value of the scene property. Unlike the case of learning new cue combination rules, the modification is not primarily to the estimate of the variance of the distribution associated with a newly unreliable cue. Rather, it is to the estimate of the mean of the distribution associated with a newly uncalibrated cue (due, perhaps, to the shift in the visual world caused by prisms). As before, observers’ improved performances can be accounted for solely on the basis of parameter learning.

67

Scene property

P(C1 | S = x)

P(C2 | S = x)

C1

C2

Cue 1

Cue 2

Figure 2.11: A Bayesian network representing the type of modification that might underlie perceptual recalibration. Cue 1 represents an accurate and low-variance cue, whereas Cue 2 represents a cue whose estimates, while lowvariance, are no longer accurate. The solid curves represent the final conditional cue distributions for a particular value of the scene variable. The dashed grey curve represents the conditional distribution for Cue 2 before learning.

68

The last class of perceptual learning phenomena that we consider here is the acquisition of visual priors. For example, consider observers viewing displays of circular patches that are lighter toward their top and darker toward their bottom. These displays are consistent with a bump that is lit from above or a dimple that is lit from below. Observers tend to assume that the light source is above a scene and, thus, prefer to interpret the object as a bump. Observers in an experiment by Adams, Graf, and Ernst (2004) viewed objects whose shapes were visually ambiguous and also touched these objects, thereby obtaining haptic information disambiguating the objects’ shapes. The shape information obtained from haptics was consistent with an interpretation of the visual display in which the estimated light source location was offset from its expected location based on observers’ prior probability distributions of the light source’s location. Adams et al. found that observers modified their prior distributions so as to reduce the discrepancy between estimated and expected light source locations. The Bayesian network in Figure 2.12 has two scene variables, corresponding to the object’s shape and the light source’s location, and a sensory variable, corresponding to the perceived visual shape of the object. Our account of learning in this setting is as follows. Based on an unambiguous haptic percept (not shown in Figure 2.12) and the ambiguous visual percept, observers estimate the object’s shape. Based on this shape and the perceived visual shape, observers then estimate the light source’s location. Learning occurs due to the discrepancy between the estimated location of the light source and

69

the expected location based on observers’ prior probability distribution of this location. To reduce this discrepancy, observers modify their prior distribution appropriately. Thus, as in the other classes of learning phenomena reviewed above, the acquisition of prior probability distributions can be accounted for through parameter learning, and does not require structure learning. We have reviewed four classes of early perceptual learning phenomena, and outlined how they can be accounted for solely through parameter learning. We hypothesize that all early perceptual learning is parameter learning; that is, all early perceptual learning involves the modification of knowledge of the prior probabilities of scene properties or of the statistical relationships among scene and sensory variables that are already considered to be potentially dependent. Conversely, we hypothesize that early perceptual learning processes are biased or constrained such that they are incapable of structure learning (the addition of new nodes or new edges between scene and sensory variables), meaning that these processes cannot learn new relationships among scene and sensory variables that are not considered to be potentially dependent. In the experimental part of this chapter, we reported the results of five experiments evaluating whether subjects can demonstrate cue acquisition. Figures 2.13 and 2.14 illustrate the relationships between scene and sensory variables in Experiments 1-3 and Experiments 4 and 5, respectively, in terms of Bayesian networks. Here, the solid black edges represent dependencies that exist in the natural world, while the dashed gray edges represent dependencies that do not exist in the natural world but that we introduced in our

70

P (light source direction)

P(shape)

light source direction

Light source direction

Shape

Visual shape P(visual shape | light source direction, shape)

Figure 2.12: A Bayesian network characterizing subjects’ modifications of their prior distribution of the light source location in the experiment reported by Adams et al. (2004).

71

novel experimental environments. For the reasons outlined in the Experimental section, we expected that observers started our experiments with the belief that variables connected by a black edge are potentially dependent, whereas variables connected by a grey dashed edge are not. In Experiment 1, subjects were placed in a novel environment that resembled natural environments in the sense that it contained systematic relationships among scene and perceptual variables which are normally dependent. In this case, cue acquisition requires parameter learning and, as predicted, subjects succeeded in learning a new cue. In Experiments 2-5, subjects were placed in novel environments that did not resemble natural environments— they contained systematic relationships among scene and perceptual variables that are not normally dependent. Cue acquisition requires structure learning in these cases. Consistent with our hypothesis, subjects failed to learn new cues in Experiments 2-5. Taken as a whole, our hypothesis provides a good account of the pattern of experimental results reported here. That is, it explains why people learn in some situations and fail to learn in other situations. In addition to providing an account of experimental data, our hypothesis also has the property of being motivated by computational considerations. As discussed above, machine learning researchers have found that parameter learning is a comparatively easy problem, whereas structure learning is typically intractable. Thus, there are good computational reasons why early perceptual learning processes might be constrained in the ways hypothesized here.

72

Transverse motion

Visual motion signal

Auditory signal

Binocular disparity signal

Brightness signal

Figure 2.13: A Bayesian network representing the statistical relationships studied in Experiments 1-3. The solid black edges represent dependencies that exist in the natural world, whereas dashed grey edges represent dependencies that do not exist in the natural world but that we introduced in our novel experimental environments. We expect that observers started our experiments with the belief that variables connected by a black edge are potentially dependent, whereas variables connected by a grey edge are not.

73

Light Source

Shading Pattern

Auditory Signal

Binocular Disparity Signal

Figure 2.14: A Bayesian network representing the statistical relationships studied in Experiments 4 and 5. The solid black lines represent pre-existing edges— conditional dependencies that exist in the natural world—while the dashed grey lines represent conditional dependencies that do not exist in the natural world but that we introduced in our novel experimental environments.

74

Our theory is limited to early perceptual learning, and is not intended to be applied to late perceptual or cognitive learning. This point can be demonstrated in at least two ways. First, it seems reasonable to believe that learning to visually recognize an object involves structure learning. Gauthier and Tarr (1997), for example, trained subjects to visually recognize objects referred to as “greebles”. A plausible account of what happens when a person learns to visually recognize a novel object as the greeble named “pimo” is that the person adds a new node (along with new edges) to their Bayesian network representation corresponding to this newly familiar object. If this speculation is correct, then it raises the question of why structure learning is computationally feasible for late perceptual learning but intractable for early perceptual learning. It may be that structure learning of higher-level knowledge becomes feasible when a pre-existing structure representing lower-level knowledge is already in place. Second, there have been several demonstrations of “contextually-dependent” perceptual learning that, we conjecture, may be accounted for via late perceptual learning processes performing structure learning. Atkins, Fiser, and Jacobs (2001), for example, trained subjects to combine depth information from visual motion and texture cues in one way when the texture elements of an object were red, and to combine information from these cues in a different way when the elements were blue. In other words, the discrete color of the elements signaled which of two contexts subjects were currently in, and these two contexts required subjects to use different cue combination rules to improve

75

their performance on an experimental task. Because there is no systematic relationship between color and cue combination rule in natural environments, people should not believe that color and cue combination rule are potentially dependent variables, meaning that the type of learning demonstrated here would require structure learning. Related results in the domain of cue acquisition have recently been reported by Haijiang et al. (2006). We speculate that this type of contextually-dependent perceptual learning is due to higher-level learning processes than the processes that we have focused on in this chapter.9

We have described a hypothesis about constraints on early perceptual learning. Admittedly, the hypothesis is speculative. Although the data favoring the hypothesis are currently sparse, its advantages include the fact that it accounts for an important subset of data about perceptual learning that would otherwise be confusing, it uses a Bayesian network formulation that is well specified (and, thus, falsifiable) and mathematically rigorous, and it leads to several theoretically interesting and important research questions. The hypothesis is meant to deal with perceptual learning in general, though the experiments in this chapter have focused on the predictions of our hypoth9

Interestingly, contextually-dependent learning is often regarded as different than most

other forms of learning. For example, researchers in the animal learning theory community distinguish standard forms of learning, which they refer to as associative learning, from contextually-dependent learning, which they refer to as occasion setting (Schmajuk and Holland, 1998).

76

esis for perceptual cue acquisition. This seems to us a natural place to begin, since the hypothesis’s predictions with respect to cue acquisition are straightforward. Future work will focus on delineating and testing the hypothesis’s predictions on other perceptual learning tasks. A primary challenge of such work will lie in developing efficient experimental methods for detecting changes (or the lack thereof) in observers’ representations of variable dependencies.

77

Chapter 3

Learning Optimal Integration of Arbitrary Features in a Perceptual Discrimination Task

3.1

Introduction Vision researchers have long realized that adult observers can be trained

to improve their performance in simple perceptual tasks. Improvements with practice in visual acuity, hue perception, and velocity discrimination, for example, have been documented for over a century (Gibson, 1953). Such perceptual improvements, when they occur as a result of training, are called perceptual learning. Despite its long history of research, the mechanisms of perceptual learning remain poorly understood. Instances of perceptual learning typically exhibit a number of characteristics, including specificity for stimulus parameters (e.g., spatial position and orientation of the stimulus), the simplicity of the

78

tasks learned, and the implicit nature of the learning, that researchers have taken as evidence that perceptual learning occurs at relatively early stages of the perceptual system (Fahle & Poggio, 2002; Gilbert 1994). Thus, many researchers interested in perceptual learning have focused on isolating changes in the neural response properties of early sensory areas following perceptual learning. While this approach has yielded results useful to understanding the neural changes underlying certain types of non-visual perceptual learning such as vibrotactile (Recanzone, Merzenich,& Jenkins, 1992) and auditory (Recanzone, Schreiner,& Merzenich, 1993) frequency discrimination, results in visual learning tasks have been much more sparse (see Das, 1997, and Gilbert, 1994, for reviews) and difficult to interpret. Furthermore it is important to recognize that while this approach addresses questions regarding what neural changes are associated with learning, it does not answer the more central question: what is learned in perceptual learning? One approach that has proven to be fruitful in illuminating the computational mechanisms underlying perceptual discriminations is the ideal observer framework (Geisler, 2003; Knill and Richards, 1996). This approach characterizes a given perceptual task by specifying an ideal observer, a theoretical decision-making agent described in probabilistic terms, that performs the task optimally given the available information. To determine how human observers use information in the perceptual task, researchers compare their performance with that of the ideal observer across manipulations of the task that systematically change the information available in the stimulus. This approach has been

79

particularly successful at characterizing the ways in which observers integrate information across different perceptual modalities (e.g., Battaglia, Jacobs, & Aslin, 2003; Ernst & Banks, 2002; Gepshtein, Burge, Ernst, & Banks, 2005), different visual modules (e.g., Jacobs, 1999; Knill, 2003; Knill & Saunders, 2003), or both (e.g., Atkins, Fiser, & Jacobs, 2001; Hillis, Ernst, Banks, & Landy, 2002) to make perceptual judgements when multiple cues are available. Briefly, in making quotidian perceptual judgements, observers usually have access to a number of perceptual cues. An observer attempting to determine the curvature of a surface, for example, may have access to cues based on visual texture, binocular disparity, and shading, as well as to haptic cues obtained by manually exploring the surface. To make an optimal judgment based on these cues, the observer must combine the curvature estimates from these different cues. Yuille and B¨ ulthoff (1996) demonstrated that, given certain mathematical assumptions, the optimal strategy for combining estimates θˆ1 , ..., θˆn from a set of class-conditionally independent cues (i.e., cues c1 , ..., cn that are conditionally independent given the scene parameter of interest so Q that P (c1 , ..., cn |θ) = ni P (ci|θ) ) consists of taking a weighted average of P the individual cue estimates θˆ∗ = i ωi θˆi (where θˆ∗ represents the optimal estimate based on all available cues) such that the weight for each cue is inversely proportional to the variance of the distribution of the scene parameter givent the cue’s value (i.e., ωi ∝ 1/σi2 ). Researchers have found that, across a variety of perceptual tasks, human observers seem to base their perceptual judgements on just such a strategy. While most of these cue integration stud-

80

ies have focused on strategies used by observers in stationary environments, several (Atkins et al., 2001; Ernst, Banks, and B¨ ulthoff, 2000; Jacobs & Fine, 1999) have investigated how observers change their cue-integration strategies after receiving training in virtual environments in which a perceptual cue to a scene variable is artificially manipulated to be less informative with respect to that variable. In one of these studies, Ernst et al., 2000 manipulated either the texture- or disparity-specified slant of a visually presented surface to indicate a slant value that was uncorrelated with the haptically-defined orientation of the surface. The authors found that after receiving training in this environment, subjects’ perceptions of slant changed such that, in a qualitatively similar fashion to the ideal observer, they gave less weight to the slant estimate of the now less reliable visual cue. This ideal observer framework has thus been useful in characterizing the mechanisms involved in learning to make certain types of perceptual discriminations. However, not all perceptual learning tasks fit neatly into the cuecombination framework described above. Many studies of perceptual learning, for example, have focused on improvements in simple tasks involving vernier acuity, texture discrimination, line bisection, orientation discrimination, and other image-based (i.e., rather than 3D scene-parameter-based) discriminations. To characterize the learning obtained in such tasks using the ideal observer cue-combination framework described above, we must first deal with several conceptual and methodological issues. The first of these issues concerns the seemingly disparate nature of 3D cue-combination tasks on the one

81

hand, and simple image-based discrimination tasks on the other. Consider for example the slant discrimination task described in the previous paragraph. In this case, the slant of the surface is defined visually by two conventional and well-understood cues to surface slant, texture foreshortening and binocular disparity. In a texture discrimination task, however, subjects are not trying to determine the value of some surface parameter such as slant. Instead, they must determine to which of two arbitrarily defined categories a presented texture belongs. What are the cues in this task? Of course, these textures will differ along some set of image features and the subject can identify and use these features as “cues” to the texture category. But do such features function as cues in the same sense as texture foreshortening and binocular disparity? The current study was designed to address this question. We were interested in determining whether the optimal integration of cues described in cue-combination studies such as that of Ernst et al. (2000) is a special property of the limited set of conventionally defined visual cues (e.g., texture compression and disparity gradient cues for slant) or whether people are likewise sensitive and capable of exploiting the relative reliabilities of arbitrarily defined cues such as those of low-level features involved in image-based discriminations. To answer this question, we introduce an efficient modification of the classification technique that allows us to analyze over relatively fine timescales the changes to the weights an observer gives to different features. We then report the results of two experiments that exploit this technique to examine how

82

observers use information about the reliabilities of low-level image features in performing simple perceptual discrimination tasks. Using our modified classification image technique, we investigate whether observers use information in a manner consistent with optimal feature combination (i.e., in a manner analogous to optimal cue combination). In both Experiments, subjects viewed and classified stimuli consisting of noise-corrupted images. The stimuli used in each experiment were generated within a 20 dimensional feature space whose noise covariance structure was varied across conditions. In Experiment 1, subjects were trained to discriminate between two stimuli corrupted with white Gaussian feature noise and their classifications were calculated over time. When we examined their classification images, we found that, with practice, their classification images approached that of the ideal observer. In addition, this improvement in their classification images correlated highly with their increase in performance efficiency, accounting for most of the variance in their performance. In Experiment 2 the variance of the corrupting noise was made anisotropic, such that some features were noisier, and thus less reliable in determining the stimulus class, than others. In the first half of the experiment, half of the features were made reliable and the other half unreliable. In the second half of the experiment, this relationship was reversed so that the features which had heretofore been reliable were now unreliable and vice versa. When we examined the classification images calculated for each subject over time, we found that they modified their decision strategies in a manner consistent with optimal feature combination, giving higher weights to reliable features and lower weights to unreliable features. The results of Experiment 1 suggest

83

that subjects’ learning in these texture discrimination tasks consists primarily of improvements in the optimality of their discriminant functions, while the results of Experiment 2 suggest that in learning these discriminant functions, subjects are able to exploit information about the reliabilities of individual features.

3.2

Estimating classification images: a modified approach

Ahumada (1967, 2002) suggested a method for determining the template, or classification image, used by human observers performing a binary perceptual discrimination task. To discover this template T for an individual observer, the researcher adds random pixel noise ǫ(t) ∼ N(0, I) to the signal s(t) ∈ {s0 , s1 } presented on each trial t. The researcher can then calculate the observer’s classification image by simply correlating the noise added on each trial with the classification r (t) ∈ {−1, 1} indicated by the observer. These classification images reveal the stimulus components used by observers in making perceptual discriminations. Over the past decade, this classification image technique has proven quite useful; researchers have used this technique (or variants thereof) to determine the templates used by observers in a variety of different tasks (e.g., Abbey & Eckstein, 2002; Ahumada, 1996; Levi & Klein, 2002; Lu & Liu, 2006), to compare these observer classification images

84

to those calculated for an ideal observer (optimal templates), and to investigate how these classification images change with learning (e.g., Beard & Ahumada, 1999; Gold, Sekuler, & Bennett, 2004). Despite these successes, the method does suffer from some shortcomings. Chief among these is the enormous dimensionality of the stimulus space. Calculating the classification image for a stimulus represented within a 128x128 pixel space, for example, requires calculating 16,385 parameters (i.e., 1282 regression coefficients plus a bias term). Consequently, researchers require thousands of trials to obtain a reasonable classification image for a single observer, and the correlation of the resulting images with the optimal templates is generally quite low due to the poor sampling of the stimulus space and the concomitant paucity of data points (Gold et al., 2004). Several researchers have attempted to remedy this problem and to boost the significance of such comparisons by restricting the final analysis to select portions of the classification image (e.g., Gold et al., 2004), by averaging across regions of the image (e.g., Abbey and Eckstein, 2002; Abbey, Eckstein, & Bochud, 1999), or by using a combination of these methods (e.g., Chauvin, Worsley, Schyns, Arguin, & Gosselin, 2005). Such measures work by effectively reducing the dimensionality of the stimulus space so that instead of calculating regression coefficients for each pixel, researchers calculate a much smaller number of coefficients for various linear combinations of pixels. Essentially, these researchers add the signal corrupting noise in pixel space but perform their analyses in terms of a lower dimensional basis space.

85

In the current study, we simplify this process by specifying this lowerdimensional basis space explicitly and a priori.1 In addition to its simplicity, this approach has several advantages over traditional methods. First, by specifying the bases in advance, we can limit the added noise ǫ to the subspace spanned by these bases, ensuring that: 1) the noise is white and densely sampled in this subspace, and 2) only features within the spanned subspace 1

Several researchers (e.g., Olman & Kersten, 2004; Li, Levi, & Klein, 2004) have pre-

viously introduced lower-dimensional methods for calculating classification images (or classification objects). Note however that the approaches used in these papers differ from the approach used in the current paper in that they obtain this reduction in dimensionality by assuming that observers have direct access to geometric scene configurations rather than to the photometric input (e.g., pixel intensities) that subjects actually observe. In Li et al. (2004), the authors implicitly assume that observers have direct access to an array whose entries represent the positions of the elements making up a vernier stimulus and that they make decisions based on this vector of positions rather than on the pattern of luminances within the image. Similarly, Olman & Kersten (2004) assume that observers have direct access to variables describing the geometry of the scene (e.g., foot spread, tail length, tail angle, neck length). In these two studies, the stimuli are defined directly in terms of scene variables—though subjects in fact observe these variables through images—and the resulting classification images are linear in the geometrical object space, but not in image space. These approaches may be more useful than image-based approaches for investigating how observers make discriminations in tasks involving representations of three-dimensional scenes (as in Olman & Kersten, 2004) when researchers have an adequate understanding of the internal representations used by observers.

86

contribute to the observer’s decisions (i.e., because all stimulus variance is contained within this subspace). Second, because we specify the bases in advance, we can select these bases in an intelligent way, representing only those features that observers are likely to find useful in making discriminations, such as those features that contain information relevant to the task (i.e., features that vary across the stimulus classes).2 Finally, this approach makes it possible to manipulate the variance of the noise added to different features and thus to vary the reliabilities of these features. This allows us to investigate how observers combine information from different features using methods similar to those that have been used in studying perceptual cue-combination. Mathematically, our approach to classification images is related to Ahumada’s (2002) approach as follows: let g (t) represent the stimulus presented on trial t. Ahumada’s technique generates these stimuli as

g (t) = s(t) + ǫ(t) , 2

(3.1)

Simoncelli, Paninski, Pillow, & Schwartz (2004) provide an extended discussion re-

garding the importance of stimulus selection in the white noise characterization of a signal processing system. Though they are concerned in particular with characterizing the response properties of neurons, their points apply equally well to the challenges involved in characterizing the responses of human observers in a binary discrimination task. Olman & Kersten (2004) provide a related discussion that proposes extending noise characterization techniques to deal with more abstract (i.e., non-photometric) stimulus representations.

87

where s(t) and ǫ(t) are defined as above. If we explicitly represent the use of pixels as bases using the matrix P , whose columns consist of the n-dimensional set of standard bases, we can rewrite Equation 3.1 in a more general form as

g (t) = P (s(t) + ǫ(t) ).

(3.2)

This is possible because P is equivalent to the identity matrix I n . At this point however, it should be clear that by applying the appropriate linear transformation T : P → B to the stimuli s(t) we can exchange P for an arbitrary basis set B to generate stimulus images in the space spanned by B. This is represented by our generative model

g t = k + B(µ(t) + η (t) ),

(3.3)

where µ(t) ∈ {µA , µB } represents a prototype stimulus s expressed in terms of the basis set and η (t) ∼ N(0, I) represents gaussian noise added in the basis space. (Note that when B = P , Equation 3.3 is equivalent to Equation 3.2, with µ(t) = s(t) , k = 0, and η (t) and ǫ(t) distributed identically). The only new term is the constant vector k, which is important here because it provides additional flexibility in choosing the bases that make up B.3

In particular,

this constant term allows us to represent constant (noiseless) features in pixel 3

The constant k is used to represent any constant component of the image. In fact,

because luminance values cannot be negative, traditional approaches to classification images

88

space that do not exist in the space spanned by B. Figures 3.1 and 3.2 illustrate this generative model for a pair of example stimuli. Here the task requires classifying a presented stimulus as an instance of stimulus A (square) or stimulus B (circle). All of the information relevant to this discrimination lies in the difference image (the rightmost image in Figure 3.1). The image shown to the left of this difference image (third from left) represents the part of the stimulus that remains constant across stimulus classes. Representing this part of the stimulus as k allows us to focus on selecting bases B that can adequately represent the difference image. Figure 3.2 shows example stimuli g generated for this task using the models described in Equation 3.2 (top of Figure 3.2) and Equation 3.3 (bottom of Figure 3.2). The method developed by Ahumada for calculating classification images is—despite its successful use by many researchers—-somewhat inaccurate and can potentially be quite inefficient. Ahumada’s method is based on reverse correlation, a technique for determining the linear response characteristics of a signal processing system. In reverse correlation, a researcher feeds Gaussian white noise into a system, records the system’s output and then characterizes the system’s linear response by correlating the input and output signals. Unfortunately however, psychophysical experiments that use the classification image technique rarely present pure noise to observers in practice because this implicitly include a k in the form of a mean luminance image (e.g., a vector of identical postive pixel luminance values).

89

Figure 3.1: An illustrative stimulus set consisting of “fuzzy” square and circle prototypes. From left to right: the square (k+BµA ); the circle (k+BµB ); the constant image (k), which represents the parts of the image that are invariant across stimuli, and the square-circle difference image (B[µA − µB ]).

90

Figure 3.2: Illustrations of the methods described in Equation 3.2 (top) and Equation 3.3 (bottom) for generating noise-corrupted versions of the “fuzzy square” prototype (stimulus A) introduced in Figure 3.1.

91

tends to result in unreliable performance (but see Neri & Heeger, 2002, for a counterexample). Instead, they typically corrupt one of two signals (where one of the signals may be the null signal) with noise and have the observer determine which of the two signals was presented. As a result, the observer is actually exposed to signals from two distributions with different means rather than just one. Ahumada’s method for dealing with this problem is to subtract the means from these two distributions (Ahumada, 2002) and thereafter treat them as a common distribution. At best, ignoring the signal and considering only the noise makes for an inefficient estimate of the observer’s decision template, since it ignores available information. At its worst, ignoring the signal can lead to some rather strange results (consider, for example, that subjects who perform at 50% correct and at 100% correct are indistinguishable using this method). Since one of our goals in this study was to develop a more efficient means of estimating classification images, we calculated the maximum likelihood estimate for these images using the full stimuli (signal + noise) under a Bernoulli response likelihood model. Here, we show that the classification image for the ideal observer (the optimal template) can be expressed as the result of a logistic regression. We assume that the ideal observer knows the prior distributions P (Ci ) and likelihood functions P (x|Ci ) for both stimulus classes Ci , i ∈ {A, B}. Using Bayes’ rule, the probability that an image x belongs to class A is

92

P (x|CA )P (CA ) P (x)

P (CA |x) = =

P (x|CA )P (CA ) , P (x|CA )P (CA )+P (x|CB )P (CB )

(3.4)

With some simple algebra, we can convert this expression into a logistic function of x.

P (CA |x) =

1 , 1 + e−f (x)

(3.5)

where 

P (x|CA )P (CA ) f (x) = log P (x|CB )P (CB )



(3.6)

To express the classification image as the result of a logistic regression however, we must also demonstrate that f (x) in equation 3.6 is linear in x. The stimuli presented on each trial are drawn from a multivariate Gaussian representing one of the two signal categories. Therefore we can express the likelihood terms in equation 3.6 as:

m

1

1

T Σ−1 (x−µ ) i

P (x|Ci ) = (2π)− 2 |Σ|− 2 e− 2 (x−µi )

(3.7)

where µi , i ∈ A, B is the mean (prototype) for class i, Σ is the common covariance matrix for both classes, and m is the dimensionality of the stimulus space. Plugging these likelihoods into equation 3.6 yields

93

   P (CA ) 1 T −1 T −1 (x − µB ) Σ (x − µB ) − (x − µA ) Σ (x − µA ) + log f (x) = 2 P (CB ) (3.8) Finally, by expanding the quadratic terms and simplifying, we demonstrate that f (x) is indeed linear in x:

f (x) = w T x + b,

(3.9)

with Σ−1 (µA − µB )

w= b=

1 (µB 2

+ µA )T Σ−1 (µB − µA ) + log

(3.10) h

P (CA ) P (CB )

i

(3.11)

Equation 3.10 shows that in the case of white Gaussian noise (i.e., when Σ = σ 2 I) the optimal template is proportional to the difference between the signal category prototypes. Note also the similarity of Equation 3.10 to the result wi ∝ 1/σi2 from optimal cue combination. We exploit this relationship in the design of Experiment 2.

3.3

Experiment 1 In Experiment 1, we calculated response classification images for ob-

servers learning to perform an image-based perceptual discrimination task.

94

We expected that our subjects’ performances would improve over time and, based on the results of Gold et al. (2004), that improvements in a subject’s discrimination performance would be accompanied by an increased fit between the observer’s classification image and the ideal template. In addition, we expected that constructing our stimuli from a small set of bases would allow us to calculate robust classification images using a significantly smaller number of trials than are required by the traditional approach of using image pixels as bases.

3.3.1

Methods

Subjects

Subjects were four students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

Stimuli The stimuli were 256×256 pixel (8◦ ×8◦ ) greyscale images presented on a grey background whose luminance of 16.5 cd/m2 matched the mean luminance of the images. All of the stimuli were constructed as linear combinations of the set of basis ”features” illustrated in Figure 3.3.

95

b = 0.2236

b = −0.2236

b = −0.2236

b = −0.2236

b = 0.2236

b = −0.2236

b = −0.2236

b = 0.2236

b = 0.2236

b

b11 = −0.2236

b12 = −0.2236

b13 = −0.2236

b14 = −0.2236

b15 = −0.2236

b

b

b

b

b

1

6

16

= 0.2236

2

7

17

= 0.2236

3

8

18

= 0.2236

4

9

19

= −0.2236

5

10

20

= 0.2236

= 0.2236

Figure 3.3: The 20 basis features used to construct the stimuli in Experiments 1 and 2. Each of these images constitutes a column of the matrix B in Equation 3.3. Mixing coefficients µAi for the vector µA representing Prototype A (see Figure 3.4) are indicated above each of the bases (µBi = −µAi ). White Gaussian noise (in the subspace spanned by B) is generated by independently sampling the noise coefficients ηi from a common Gaussian distribution.

96

The set of 20 basis features was constructed in the following manner. We created 50 32 × 32 pixel images of white gaussian noise which were bandpass filtered to contain frequencies in the range of 1-3 cycles per image. The resulting images were then iteratively adjusted using gradient descent to yield a set of orthogonal, zero-mean images that maximized smoothness (i.e., minimized the sum of the Laplacian) across each image. The images were added to the basis set one by one so that each basis provided an additional orthogonality constraint on the subsequent bases. In other words: at iteration i, image i was modified via gradient descent to be maximally smooth and to be orthogonal to images 1 through (i-1). These orthogonality constraints interacted with the smoothness constraint to produce images that were localized in spatial frequency content such that the first bases produced by our method contained low frequencies and subsequently added bases contained increasingly higher frequencies. We randomly selected twenty of the 50 images to form the basis set that we used to construct our stimuli. Finally, we wanted to make sure that the bases were equally salient. The human visual system is known to exhibit varying sensitivity to different stimuli depending on their spatial frequency content. This differential sensitivity across spatial frequencies is often characterized through the contrast sensitivity function (CSF) which describes the amount of contrast required at different spatial frequencies to obtain a fixed sensitivity level. Thus, as a final step, we normalized the twenty basis features for saliency by setting the standard deviation of luminance distributions in each of the basis images

97

to 1, then multiplied each image by the reciprocal of the contrast sensitivity function value for its peak spatial frequency.4 A set of two prototypes was constructed from this basis set as follows. First, a twenty dimensional vector was formed by randomly setting each of its elements to either 1.0 or -1.0. The result was an image centered within one of the orthants of the space spanned by the basis features. This vector represented prototype A. The vector representing the second prototype, prototype B, was simply the negative of the vector representing prototype A (µB = −µA ). To obtain images of the prototypes, these vectors were multiplied by the matrix representing the 20 basis features and a constant image was added, consisting of the mean luminance plus an arbitrary image constructed in the null space of the basis set (the addition of this arbitrary image prevented the prototypes from appearing simply as contrast-reversed versions of the same image). Finally, the prototypes were upsampled to yield 256 × 256 pixel images. We created only one set of prototypes and all subjects saw the same set. Test stimuli were created according to the generative model described in Equation 3.3. On each trial, one of the two prototypes (A or B) was 4

Contrast sensitivity functions were not measured directly for each subject. Instead, for

the sake of expediency, we used the model of human contrast sensitivity proposed by Mannos and Sakrison (1974), which describes the sensitivity of a human observer, generically, as 1.1

A(f ) = 2.6(0.0192 + 0.114f )e−(0.114f ) .

98

Figure 3.4: The prototypes used in Experiments 1 and 2 presented in the same format as the example stimuli in Figure 3.1. From left to right: prototype A (k+BµA ), prototype B (k+BµB ), the constant image (k), and the difference image (B[µA − µB ] = 2BµA ).

99

selected at random and combined with a noise mask η (t) . The noise masks, like the prototypes, were generated as a linear combination of the basis features. However, for the noise masks, the linear coefficients were sampled from a multivariate Gaussian distribution η ∼ N(0, σ 2 I). Values that deviated more than 2σ from the mean were resampled. The RMS contrast of the signal and the noise mask were held constant at 5.0% and 7.5%, respectively.

Procedure

Each trial began with the presentation of a fixation square, which appeared for 300ms. This was followed by a test stimulus, which was also presented for 300ms. Both the fixation square and the test stimulus were centered on the screen. 150ms after the test stimulus had disappeared, the two prototypes were faded in, laterally displaced 8◦ (256 pixels) from the center of the screen. Subjects were instructed to decide which of the two prototypes had appeared in the test stimulus, and responded by pressing the key corresponding to the selected prototype. Subjects received immediate auditory feedback after every trial indicating the correctness of their response. In addition, after every 15 trials, a printed message appeared on the screen indicating their (percent correct) performance on the previous 15 trials. Each subject performed 12 sessions of 300 trials each over 3 days and the subject’s response, the signal identity, and the noise mask were saved on each trial to allow calculation of the subject’s classification image.

100 Subject WHS RAW BVR SKL

r(df ) r(10) = 0.7657 r(10) = 0.8518 r(10) = 0.8126 r(10) = 0.3745

p p < 0.005 p < 0.001 p < 0.005 p > 0.05

Table 3.1: Correlation between sensitivity and trial number for individual subjects

3.3.2

Results

Figures 3.5 and 3.6, and Table 3.1 summarize the results of this experiment. We wanted to determine: (1) Can subjects learn to discriminate texture stimuli generated in our basis space? (2) How well do improvements in discrimination performance correlate with the optimality of an observer’s classification image? (3) How efficient is our method? That is, how many trials are required to estimate a subject’s classification image? To determine whether our observers learned in this task, we correlated their sensitivity d′ in each session with the total number of trials completed at the end of that session. The results of this correlation across the 12 sessions are shown in Table 3.1. Three of the four subjects showed significant improvement between the

101

first and second halves of training, indicating that subjects could indeed learn to discriminate stimuli in our basis space. We calculated classification images for each session using logistic regression (see Equations 3.4–3.11). Figure 3.5 shows classification images obtained over the first and last quarter of trials for each of the three subjects who showed learning. There are clear changes to the images as a result of learning. To quantify these changes, we calculated the normalized-cross-correlation

T w wobs ideal kwobs kkwideal k

between the subject’s classification

image w obs and that of the ideal observer wideal across time. Normalized-crosscorrelation is often used to represent the degree of ’fit’ between two templates (e.g., Gold et al., 2004, Murray, 2002). The ’fit’ in this case is indicative of the optimality of the template used by a particular subject and we thus refer to the square of the normalized cross-correlation as the subject’s template efficiency (Figure 3.6, dashed curve). We also calculated subjects’ discrimination efficiencies

h

d′obs d′ideal

i2

(Geisler, 2003) for each session to compare the performances

of subjects to that of the ideal observer. Finally, we correlated each subject’s discrimination and template efficiencies across sessions to measure how improvements in discrimination performance correlate with improvements in the optimality of the subject’s classification image. The resulting correlation coefficients and significance statistics appear at the top of the plots in Figure 3.6. The correlations are quite strong, indicating that increases in subjects’ discrimination efficiencies are well explained by the observed improvement in their templates. This finding corroborates a qualitatively similar finding by Gold et al. (2004). Overall, the results of Experiment 1 demonstrate that our

102

Figure 3.5: Classification images for each of the three subjects who showed learning in Experiment 1. The first column wobs1 displays the subjects’ classification images calculated over the first three sessions; the second column w obs2 displays the classification images calculated over their final three sessions; and the third column w ideal displays the optimal template.

103

discrimination efficiency template efficiency

BVR r(10) = 0.8830, p = 0.0001

0.3

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

0

1000

2000 Trial

r(10) = 0.8535, p = 0.0004

0.3

0.25

3000

4000

0

RAW

0

1000

r(10) = 0.6911, p = 0.0128

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0

1000

2000 Trial

4000

3000

4000

r(10) = 0.8201, p = 0.0011

0.3

0.25

0

3000

SKL

WHS 0.3

2000 Trial

3000

4000

0

0

1000

2000 Trial

Figure 3.6: Individual results for all 4 subjects who participated in Experiment 1. The horizontal axis of each plot indicates the trial number, while the vertical axis represents both the subject’s discrimination efficiency (solid curve) and template efficiency (dashed curve). The correlation coefficient for the fit between these two measures and the p-value representing the significance of this correlation is indicated at the top of each subject’s plot.

104

method for obtaining and calculating classification images represents a successful improvement over existing methods for studying perceptual learning. Our use of arbitrary basis features did not preclude learning. Limiting the number of features, however, allowed us to calculate subjects’ classification images over short time scales (< 300 trials) and thus to track changes in subjects’ templates throughout the course of learning. Additionally, the results suggest that most of the variance in subjects’ discrimination performances (i.e., 66% to 81%)5 can be accounted for by improvements in their classification images, so that changes in subjects’ discrimination strategies over time can largely be characterized by calculating their classification images. Together, these characteristics indicate that our method is suitable for determining how observers change their discrimination strategies as a perceptual task is modified. 5

These estimates of explained variance are obtained using the correlation between the

normalized cross correlations

T wobs wideal kwobs kkwideal k

and the sensitivity ratio

d′obs d′ideal .

Unlike in Fig-

ure 3.6, these values were not squared. Squaring the sensitivity measure is necessary for a information-theoretic interpretation of efficiency, but removes information about some of the correlation between observers’ template fits and sensitivities (e.g., classification images that point in the wrong direction yield sensitivities below zero). The r2 values resulting from this correlation are: 0.80 (BVR), 0.76 (RAW), 0.66 (WHS), and 0.81 (SKL).

105

3.4

Experiment 2 Experiment 2 was designed to determine whether observers modify their

templates in a manner consistent with optimal feature combination (i.e., in a manner analogous to optimal cue combination). We investigated this question by manipulating the reliabilities of different features with respect to discrimination judgments like those made by subjects in Experiment 1. Changes made to the relative reliabilities of different features result in corresponding changes to the optimal decision template. By calculating the classification images used by subjects across such manipulations, we can determine whether observers are sensitive to the reliabilities of individual features and modify their templates accordingly. The idea, illustrated in Figures 3.7B and 3.7C, is to change the optimal template across two phases of the experiment by modifying only the variance structure of the noise. If observers use information about feature variance in performing discrimination tasks, then we should observe a change in their classification images between the first and the second phases of the experiment. After the transition, observers’ templates should move away from that predicted by the optimal template for the first set of reliable versus unreliable features, and toward that predicted by the optimal template for the second set. We expected that subjects would take feature reliabilities into account when making discriminations, resulting in classification images that give greater weight to reliable features and lower weight to unreliable features.

106

3.4.1

Methods

Subjects

Subjects were four students at the University of Rochester with normal or corrected-to-normal vision. All subjects were naive to the purposes of the study.

Stimuli and Procedure

The task for observers in this experiment was identical to the task described in Experiment 1. Observers classified a briefly presented stimulus as an instance of either stimulus A or stimulus B. Prototypes A and B were also identical to those used in Experiment 1, and the stimuli for each trial were constructed according to the generative model described in Equation 3.3, except that the noise covariance matrix Σ was not the identity matrix. Observers performed 24 sessions of 300 trials each over 6 days. The procedure for Experiment 2 differed from that of Experiment 1 in that Experiment 2 consisted of two phases, each comprising 12 sessions. Before training, 10 of the 20 basis features were selected at random to be ‘unreliable’ features, so that each subject had a unique set of reliable and unreliable features. We controlled the reliability of an individual feature bi by manipulating its variance σi2 in the noise covariance matrix Σ. Equation 3.10 establishes the relationship between the noise covariance and the optimal template. Exploit-

107

ing the facts that Σ is a diagonal matrix and that µB = −µA , we can express the individual elements of w as:

wi =

2µAi σi2

(3.12)

where σi2 represents the ith diagonal element of Σ. Note that this is similar to the result obtained for optimal weighting of independent cues in the literature on cue combination (e.g., Landy et al., 1995, Yuille & B¨ ulthoff, 1996). The difference here is that instead of simply weighting each feature in proportion to its reliability (i.e., inverse variance), there is an added dependency on the class means, such that observers must weight each feature in proportion to its mean-difference-weighted reliability. In the current study, we removed this dependency by choosing the elements in µA such that their magnitudes are all equal (i.e., |µAi | = |µAj | for all i, j ≤ m) so that the weights composing the optimal template are indeed inversely proportional to the variances of their associated features.6 Figure 3.7 illustrates this dependency for a simple stimulus space consisting of two feature dimensions x1 and x2. In the first half of training (sessions 1-12), the variance of the noise added to the unreliable features was greater than the variance of the noise added to the reliable features (i.e., σunreliable = 5 while σreliable = 1). In the second half 6

In general, if the stimuli are not chosen arbitrarily, |µAi | = 6 |µAj |. Note however, that

since µB = −µA such a centering can be easily accomplished by appropriately scaling the stimulus space.

108

x2

x2

x1

A

x2

x1

B

x1

C

Figure 3.7: A schematic illustration of the effect of variance structure on the optimal template (red arrows) for a two dimensional stimulus space. Dashed lines represent contours of equal likelihood ( p(x1, x2|Ci) = k ) for category A (red) and category B (green). The solid red lines and arrows represent the optimal decision surface and its normal vector (i.e., the template for category A), respectively. (Left) two prototypes embedded in isotropic noise (Σ = I 2 ). (Center) the variance along dimension x2 is greater than that in x1. (Right) the variance along x1 is greater than that in x2.

109

of the experiment the roles of these two sets of features were swapped such that the reliable features were made unreliable and the unreliable features were made reliable. Importantly, the set of reliable and unreliable features were chosen randomly for each subject so that the pair of covariance matrices for the first (Σ1 ) and second (Σ2 ) halves of the experiment were unique to a subject.

3.4.2

Results

We wanted to determine whether subjects adjusted their discrimination strategies in a manner consistent with optimal feature combination when the variance of individual features was modified. As in Experiment 1, we calculated classification images for each of our subjects and quantified the fit between these images and the templates used by an ideal observer using normalizedcross-correlation. In contrast to Experiment 1, however, each subject made discriminations under two different generative models using covariance matrices Σ1 and Σ2 , respectively. Thus we defined two optimal templates for each subject; one for the generative model used in sessions 1-12 (w ideal1 , appropriate for Σ1 ) and one for the generative models used in sessions 13-24 (wideal2 , appropriate for Σ2 ). Figure 3.8 plots the normalized-cross-correlation between the calculated classification image w obs and the templates w ideal1 (solid lines) and w ideal2 (dashed lines) for each of the four subjects as a function of the number of trials. Figure 3.9 displays the visible change in the classification

110 Subject DLG JDG MSB MKW

t(df ) t(5) = −5.7661 t(5) = −3.3911 t(5) = −13.3369 t(5) = −27.4861

p p < 0.005 p < 0.05 p < 0.0001 p < 0.00001

Table 3.2: Significance statistics for results displayed in Figure 3.10

images used by subjects between the first and second halves of the experiment.

These plots demonstrate that subjects modified their decision templates in accordance with our predictions, employing templates that fit more closely with w 1 when the noise covariance structure was defined by Σ1 , and modifying their templates to more closely match w2 during the second half of the experiment, when the covariance structure was defined by Σ2 . To quantify these results we compared the average difference between the template fits wfit2 − wfit1 (where wfiti represents the normalized-cross-correlation between template w i and a subject’s classification image) across the first and second halves of the experiment using a t-test. These differences are plotted in Figure 3.10 and the corresponding significance statistics displayed in Table 3.2. In summary, using the methods introduced in Experiment 1 for obtaining and calculating classification images, Experiment 2 examined whether human observers exploit information about the reliabilities of individual features when performing an image-based perceptual discrimination task. We manipulated the reliabilities of our features by changing the covariance structure

111

0.8

wfit wfit

JDG

1

normalized dot−product

normalized dot−product

DLG

2

0.6 0.4 0.2 0 0

2000

4000 trial

0.8 0.6 0.4 0.2 0 0

6000

2000

0.8 0.6 0.4 0.2 0 0

2000

4000 trial

6000

MKW normalized dot−product

normalized dot−product

MSB

4000 trial

6000

0.8 0.6 0.4 0.2 0 0

2000

4000 trial

6000

Figure 3.8: Normalized cross-correlation for each of the four subjects in Experiment 2. The plots depict the fits between each subject’s classification image (w obs ) and the optimal templates for the covariance structure of the noise used in the first (solid lines) and second (dashed lines) halves of the experiment. The change in covariance structure occurred at trial 3601.

112

DLG

JDG

MSB

MKW

Figure 3.9: Classification images for each of the four subjects in Experiment 2. The first column displays the optimal template w ideal1 calculated for the feature covariance Σ1 used in the first half of the experiment; the second column w obs1 displays the subjects’ classification images calculated over the first 12 sessions; the third column wobs2 displays the classification images calculated over their final 12 sessions; and the final column displays the optimal template w ideal2 calculated for the feature covariance Σ2 used in the second half of the experiment

113

0.5 0.4 0.3

wfit2− wfit1

0.2 0.1 0 −0.1 −0.2 −0.3 −0.4 −0.5

DLG

JDG

MSB

MKW

Figure 3.10: The differences between the template fits (wfit2 − wfit1 ) plotted in Figure 3.8 averaged over the first (open bars) and second (closed bars) half of trials in Experiment 2.

114

over time. Our results show that subjects change their classification images to track changes in the optimal template, suggesting that they indeed use information about the reliabilities of individual features, giving greater weight to more reliable features in a manner analogous to optimal cue combination.

3.5

Discussion Researchers have repeatedly demonstrated that, with practice, observers

can learn to significantly improve their performance in many perceptual discrimination tasks. The nature of this learning, however, is not well understood. The two experiments described in this paper contribute to our understanding of perceptual learning by studying how observers improve their use of stimulus information as a result of practice with a discrimination task. First, we introduced a modification of the classification image technique that through its improved efficiency allows us to track the changes to observers’ templates as the result either of learning or of experimental manipulations. We investigated whether observers use information in a manner consistent with optimal feature combination (i.e., in a manner analogous to optimal cue combination). In both Experiments, subjects viewed and classified stimuli consisting of noise-corrupted images. The stimuli used in each experiment were generated within a 20 dimensional feature space whose noise covariance structure varied across conditions. In Experiment 1, subjects were trained to discriminate between two stimuli corrupted with white Gaussian feature noise

115

and their classifications were calculated over time. Examination of their classification images reveals that, with practice, their decision templates approached that of the ideal observer. Moreover, this improvement in their classification images correlated highly with their increase in performance efficiency, accounting for between 65% and 80% of the variance in their performance. Consistent with the findings of Gold et. al (2004), these results suggest that the learning demonstrated in these perceptual discrimination tasks consists primarily of observers improving their discriminant functions to more closely match the optimal discriminant function. But what does it mean to say that improvements in perceptual discrimination tasks result primarily from the learning of optimal discriminant functions? Discriminant functions encode information about several distinct aspects of the stimuli to be discriminated. The first of these is the prior probability over stimulus categories. If one type of signal is more likely than another, then an optimal observer judging the category membership of an ambiguous stimulus should assign a higher probability to the more likely category. The second of these is the mean signals in each category—the category prototypes. The importance of this aspect of the stimuli for the discriminant function is obvious. Deciding which of two noise-masked signals was presented is quite difficult if the observer cannot identify the signals in the absence of a noise mask. In white noise, the optimal discriminant surface between two signal categories is perpendicular to the vector describing the difference between the category prototypes. Most studies using classification image techniques,

116

by using white noise masks and flat category priors exclusively, have primarily examined this aspect of perceptual discriminant functions–asking how well observers represent signal prototypes in making perceptual discriminations. The third aspect of the task encoded in discriminant functions is the structure of the variance or noise in the features that define the stimuli. As demonstrated in Equation 3.10 and in Figure 3.7, changes made to the feature covariances can dramatically alter the optimal template for a discrimination task. No previous work with classification images had (to our knowledge) explored how observers use information about noise structure. Thus, in Experiment 2, we applied a framework previously used in cue integration studies to determine how observers use information about class-conditional variance across features in a perceptual discrimination task. We were particularly interested in determining whether observers can integrate optimally across noisy features in a manner consistent with optimal cue combination. Thus, emulating a procedure used in many cue-integration experiments, we manipulated the reliabilities of different features by increasing the variance in a subset of the features to make these features unreliable. As described above, this manipulation altered the optimal template for the resulting discrimination tasks. In both Experiments 1 and 2 subjects’ classification images, with practice, approached the optimal template, demonstrating that human observers are sensitive to the variances of individual features—even when these features are chosen arbitrarily—and that they use information about these variances in making perceptual judgements. In addition, that subjects in Experiment 2 changed their templates in response to changes in the reliabilities of features, giving greater weight to

117

reliable features and less weight to unreliable features, suggests that observers use this information in a manner consistent with optimal cue combination. In summary, our results suggest that learning in image-based perceptual discrimination tasks consists primarily of changes that drive the discriminant function used by human observers nearer to that used by the ideal observer. Moreover, in learning these discriminant functions, observers seem to be sensitive to the individual reliabilities of arbitrary features, suggesting that optimal cue integration in vision is not restricted to the combination of estimates from a set of canonical visual modules (e.g., texture and disparity-based estimators for slant) in making surface-based discriminations, but is instead a more general property of visual perception that generalizes to simple image-based discrimination tasks. Though the current study only investigated feature integration in a single texture discrimination task, we believe that this task is representative of many other simple discrimination tasks. However, future research is needed determine whether our results generalize to similar training in other tasks (e.g., vernier discrimination, motion direction discrimination, orientation discrimination). Finally, note that the current paper uses a normative approach to modeling what observers learn through practice with a perceptual discrimination task. This approach focuses on the structure of the task that an observer must solve, on the relevant information available to the observer, and on the fundamental limits that these factors place on the observer’s performance. In contrast to process-level models of perceptual learning (e.g.,Bejjanki, Ma,

118

Beck, & Pouget, 2007; Lu & Dosher, 1999; Otto, Fahle, & Zhaoping, 2006; Petrov, Dosher, & Lu, 2005; Teich & Qian, 2003; Zhaoping, Herzog, & Dayan, 2003) the normative approach used here is largely agnostic with respect to either physiological or algorithmic implementation details (Marr, 1982). Our results demonstrate that people can learn to use information about the covariance structure of a set of arbitrary low-level visual features. We leave the question of how this learning is implemented in the brain as a problem for future work.

119

Chapter 4

Conclusions

Practice affects performance in a wide variety of perceptual tasks, and this perceptual learning is believed to be important in helping us to adapt to changes in our physical environment. Over the past few decades researchers have discovered several interesting properties about this learning—that it tends to be specific to physical attributes of the trained stimuli, that it can result in long term changes in perceptual judgements, and that it is at least partially mediated by changes to the response properties of neurons in early sensory areas. However, our understanding of the mechanisms mediating perceptual learning remains limited. The work described in this paper contributes to our understanding of these mechanisms by describing a general statistical framework for characterizing the changes in perceptual inference that accompany instances of perceptual learning, by introducing novel experimental procedures and methods for examining learning, and by describing the results of two studies investigating the flexibility and the limits of perceptual learning.

120

In Chapter 2, we were interested in determining what constraints, if any, act on perceptual learning. To this end, we ran five experiments evaluating whether subjects can demonstrate cue acquisition, meaning that they can learn that a sensory signal is a cue to a perceptual judgement. In Experiment 1, subjects were placed in a novel environment that resembled natural environments in the sense that it contained systematic relationships among scene and perceptual variables which are normally dependent. In this case, subjects succeeded in learning a new cue. In Experiments 2-5, subjects were placed in novel environments that did not resemble natural environments— they contained systematic relationships among scene and perceptual variables that are not normally dependent. Subjects failed to learn new cues in Experiments 2-5. Thus, the results of the first study suggest that the mechanisms of early perceptual learning are biased such that people can only learn new contingencies between scene and sensory variables that are considered to be potentially dependent. To account for the results of these experiments, we proposed a new constraint on early perceptual learning: that people are capable modifying their knowledge of the prior probabilities of scene variables or of the statistical relationships among scene and perceptual variables that are already considered to be potentially dependent, but they cannot learn new relationships among variables that are not considered to be potentially dependent, even when placed in novel environments in which these variables are strongly related. We formalized these ideas in terms of Bayesian networks: people’s early perceptual systems are capable of parameter learning but not structure learning. We then reviewed four classes of early perceptual learn-

121

ing phenomena described in the perceptual learning literature, and outlined how they can be accounted for solely through parameter learning. Finally we present the hypothesis that all early perceptual learning is parameter learning and discuss the computational motivations for this constraint. In Chapter 3, we extended a statistical framework used for studying cue integration in three-dimensional perceptual judgements to allow us to examine learning in image-based discrimination tasks. A number of studies have demonstrated that people often integrate information from multiple perceptual cues in a statistically optimal manner when judging properties of surfaces in a scene. For example, subjects typically weight the information based on each cue to a degree that is inversely proportional to the variance of the distribution of a scene property given a cue’s value. We wanted to determine whether subjects similarly use information about the reliabilities of arbitrary low-level visual features when making image-based discriminations, as in visual texture discrimination. To investigate this question, we developed a novel and efficient modification of the classification image technique and conducted two experiments that explored subjects’ discrimination strategies using this improved technique. We created an orthonormal basis consisting of 20 lowlevel features, and stimuli were created by linearly combining the basis vectors. Prototype signals for visual categories were defined as specific linear combinations of elements from this basis set. In Experiment 1, subjects were trained to discriminate between two prototype signals corrupted with white Gaussian feature noise (i.e., noise that is “white” in our feature space but colored in pixel

122

space) and their classification images, or decision templates, were estimated over time. When we examined these templates, we found that, with practice, their templates approached that of the ideal observer. This improvement in their decision templates correlated highly with their increase in performance efficiency. In Experiment 2 the variance of the corrupting noise was made anisotropic, such that some features were noisier, and thus less reliable in determining the stimulus class, than others. In the first half of the experiment, half of the features were made reliable and the other half unreliable. In the second half of the experiment, this relationship was reversed. When we analyzed the subjects’ classification images over time, we found that they modified their decision strategies in a manner consistent with optimal feature combination, giving greater weight to reliable features and less weight to unreliable features. As the variance structure of the noise was changed, subjects modified their templates to reflect the corresponding changes in the ideal template. The data suggest that subjects were sensitive to our features and weighted each feature based on its reliability as defined by the feature covariance matrix. We conclude that human observers extract information about the reliabilities of arbitrary low-level features and exploit this information when learning to make perceptual discriminations. More generally, our results suggest that optimal integration is not a characteristic specific to conventional visual cues or to judgements involving three-dimensional scene properties. Rather, just as researchers have previously demonstrated that people are sensitive to the reliabilities of conventionally-defined cues when judging the depth or slant of

123

a surface, we demonstrate that they are likewise sensitive to the reliabilities of arbitrary low-level features when making image-based discriminations. Overall, our results demonstrate a remarkable degree of flexibility in perceptual learning that is mediated at least in part by an improvement in our ability to integrate information across different cues, suggesting that the human perceptual system is adept at representing its own uncertainty in individual perceptual cues and at exploiting information about this uncertainty to optimally combine the information from these cues when making perceptual judgements. On the other hand, the results of the first study (Chapter 2) also demonstrate that this learning is constrained to favor statistical relationships that consistent with those found in natural environments, suggesting that this flexibility in learning is somewhat constrained. We discuss in Chapter 2 the computational advantages of such constraints. Briefly, though flexibility and constraint in learning seem to be at odds, learning is in fact impossible without constraints. Constraints make learning problems tractable by reducing the dimensionality of the distributions—or the parameters—that must be learned. The resulting learning processes are restricted in terms of what they can represent—as the example of Chapter 2 shows, they may not be able to represent unlikely statistical relationships—but they are also more agile and should thus be better equipped to respond relatively quickly to likely changes in statistical relationships. We realize that this work is preliminary and provides only a rough picture of the mechanisms involved in perceptual learning and of the constraints

124

that act on these mechanisms, and that much research remains to be done. To determine whether perceptual learning mechanisms are truly incapable of structure learning as suggested in Chapter 2, for example, we must focus on testing the predictions of this constraint for other perceptual learning tasks. A major part of this work will lie in developing experimental methods that more directly detect changes in observers’ representations of variable dependencies. Regarding the work described in Chapter 3, open issues include determining whether the representations of feature uncertainty learned by observers are general enough to be exploited for optimal integration in unrelated perceptual tasks that make use of the same features. In addition, both studies described in the current paper used relatively simple statistical relationships. An important question regarding perceptual learning mechanisms is how they are limited in terms of the complexity of the distributions that they can represent. Can they only learn to represent correlations coarsely? Are they capable of learning higher order (e.g., 3rd and4th order) relationships? These are all open questions; the results described in the current paper provide only a rough picture of the mechanisms involved in perceptual learning and of the constraints that act on these mechanisms. Nonetheless, by outlining a statistical framework for examining the changes in perceptual inferences that occur as the result of training, and by making explicit the goal of characterizing limits on perceptual learning, we have, with this work, taken an important first step in characterizing the mechanisms that underlie perceptual learning.

References

Abbey, C. K. & Eckstein, M. P. (2002). Classification image analysis: Estimation and statistical inference for two-alternative forced-choice experiments. Journal of Vision, 2, 66-78. Abbey, C. K., Eckstein, M. P., & Bochud, F. O. (1999). Estimation of humanobserver templates for 2 alternative forced choice tasks. Proceedings of SPIE, 3663, 284-295. Adams, W. J., Graf, E. W., & Ernst, M. 0. (2004). Experience can change the ‘light-from-above’ prior. Nature Neuroscience, 7, 1057-1058. Ahumada, A. J. (2002). Classification Image Weights and Internal Noise Level Estimation, Journal of Vision, 2, 121-131. Ahumada, A. J. (1996) Perceptual classification images from vernier acuity masked by noise. Perception, 25, 18. Ahumada, A. J. (1967). Detection of tones masked by noise: A comparison of human observers with digital-computer-simulated energy detectors

126

of varying bandwidths. Unpublished doctoral dissertation, University of California, Los Angeles. Atkins, J. E., Fiser, J., & Jacobs, R. A. (2001). Experience-dependent visual cue integration based on inconsistencies between visual and haptic percepts. Vision Research, 41, 449-461. Atkins, J. E., Jacobs, R. A., & Knill, D. C. (2003). Experience-dependent visual cue recalibration based on discrepancies between visual and haptic percepts. Vision Research, 43, 2603-2613. Ball, K. & Sekuler, R. (1987). Direction-specific improvement in motion discrimination. Vision Research, 27, 953-965. Battaglia, P. W., Jacobs, R. A., and Aslin, R. N. (2003). Bayesian integration of visual and auditory signals for spatial localization. Journal of the Optical Society of America A, 20, 1391-1397. Beard B. L., & Ahumada, A. J. (1999). Detection in fixed and random noise in foveal and parafoveal vision explained by template learning, Journal of the Optical Society of America A, 16, 755-763. Beard B. L., & Ahumada, A. J. (1998). A technique to extract relevant image features for visual tasks. In: B.E. Rogowitz & T.N. Pappas, eds. Human Vision and Electronic Imaging III, SPIE Proceedings Vol. 3299, pp. 7985.

127

Bedford, F. (1993). Perceptual learning. The Psychology of Learning and Motivation, 30, 1-60. Bejjanki V.R., Ma W.J., Beck J.M., & Pouget A. (2007). Perceptual learning as improved Bayesian inference in early sensory areas. Poster presented at Computational and Systems Neuroscience Conference (CoSyNe), Salt Lake City, UT. Berardi, N., & Fiorentini, A. (1987). Interhemispheric transfer of visual information in humans: Spatial characteristics. Journal of Physiology, 384, 633-647. Bishop, C. M. (1995). Chapter 2: probability density estimation. In C. M. Bishop, Neural Networks for Pattern Recognition. New York, NY: Oxford University Press. pp. 33-76. Braddick, O., Campbell, F.W. & Atkinson, J. (1978). Channels in vision: Basic aspects. In R. Held, H.W. Leibowitz & H. Teuber (Eds.) Perception. Berlin: Springer-Verlag. pp. 3-38. Brunswick, E. (1956) Perception and the Representative Design of Psychological Experiments. Berkeley, CA: University of California Press. Chauvin, A., Worsley, K. J., Schyns, P. G., Arguin, M., & Gosselin, F. (2005). Accurate statistical tests for smooth classification images. Journal of Vision, 5, 659-667.

128

Cooper, G.F. (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42, 393-405. Crist R. E., Li W., & Gilbert C. D. (2001). Learning to see: experience and attention in primary visual cortex. Nature Neuroscience. 4(5), 519-525. Das, A. (1997). Plasticity in adult sensory cortex: a review. Network, 8, R33-R76. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 34, 1-38. Dill, M. (2002). Specificity versus invariance of perceptual learning: The example of position. In M. Fahle and T. Poggio (Eds.) Perceptual Learning. Cambridge, MA: MIT Press. Dill, M., & Fahle, M. (1997). The role of visual field position in patterndiscrimination learning. Proceedings of the Royal Society of London B, 264, 1031-1036. Epstein, W. (1975). Recalibration by pairing: A process of perceptual learning. Perception, 4, 5972. Ernst, M.O. & Banks, M.S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429-433.

129

Ernst, M. O., Banks, M. S., & B¨ ulthoff, H. H. (2000). Touch can change visual slant perception. Nature Neuroscience, 3, 69-73. Fahle, M., Edelman, S., & Poggio, T. (1995). Fast perceptual learning in hyperacuity. Vision Research, 35, 3003-3013. Fahle, M. & Poggio, T. (2002) Perceptual Learning. Cambridge, MA: MIT Press. Gilbert C. D., Sigman, M., & Crist R. E. (2001). The neural basis of perceptual learning. Neuron, 31, 681-697. Fine, I. & Jacobs, R. A. (2000). Perceptual learning for a pattern discrimination task. Vision Research, 40, 3209-3230. Freeman, W. T., Pasztor, E. C., & Carmichael, O. T. (2000). Learning lowlevel vision. International Journal of Computer Vision, 40, 25-47. Garcia,J. & Koelling, R. A. (1966). The relation of cue to consequence in avoidance learning. Psychonomic Science, 4, 123-124. Gauthier, I. & Tarr, M. J. (1997). Becoming a “Greeble” expert: Exploring mechanisms for face recognition. Vision Research, 37, 1673-1682. Geisler, W. S. (2003). Ideal observer analysis. In: L. Chalupa and J. Werner (Eds.), The Visual Neurosciences. Boston: MIT Press, 825-837.

130

Gepshtein, S., Burge, J., Ernst, M. O., & Banks, M. S. (2005). The combination of vision and touch depends on spatial proximity. Journal of Vision, 5, 1013-1023. Ghose, G. M., Yang, T., & Maunsell, J. H. R. (2002). Physiological correlates of perceptual learning in monkey V1 and V2. Journal of Neurophysiology, 87 (10), 1867-1888. Gibson E. J. (1953). Improvement in perceptual judgments as a function of controlled practice or training. Psychological Bulletin, 50, 401-431. Gilbert, C. D. (1994). Early perceptual learning. Procedures of the National Academy of Sciences USA, 91, 1195-1197. Gold, J. M., Sekuler, A. B., & Bennett, P. J. (2004). Characterizing perceptual learning with external noise. Cognitive Science, 28, 167-207. Haijiang, Q., Saunders, J. A., Stone, R. W., and Backus, B. T. (2006). Demonstration of cue recruitment: Change in visual appearance by means of Pavlovian conditioning. Proceedings of the National Academy of Sciences USA, 103, 483-488. Harris, C. S. (1965). Perceptual adaptation to inverted, reversed, and displaced vision. Psychological Review, 72, 419-444. Hillis, J. M., Ernst, M. O., Banks, M. S. & Landy, M. S. (2002). Combining sensory information: Mandatory fusion within, but not between senses. Science, 298, 1627-1630.

131

Hubel, D. H. (1988). Deprivation and development. In D.H. Hubel, Eye, Brain, & Vision. New York, NY: Scientific American Library. Hubel, D. H., Wiesel, T. N. (1970). The period of susceptibility to the physiological effects of unilateral eye closure in kittens. Journal of Physiology, 206, 419-436. Hubel, D. H., Wiesel, T. N. (1977). Ferrier lecture: Functional architecture of macaque monkey visual cortex. Proceedings of the Royal Society of London. Series B, Biological Sciences, 198, 1-59. Hubel, D. H., Wiesel, T. N., & Levay, S. (1977). Plasticity of ocular dominance columns in monkey striate cortex.Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 278, 377-409. Jordan, M. I. Weiss, Y. (2002) Graphical models: Probabilistic inference. In M. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks (2nd Edition). Cambridge, MA: MIT Press. Jacobs, R. A. (1999). Optimal integration of texture and motion cues to depth. Vision Research, 39, 3621-3629. Jacobs, R. A. & Fine, I. (1999). Experience-dependent integration of texture and motion cues to depth. Vision Research, 39, 4062-4075. Karni, A. & Sagi, D. (1991). Where practice makes perfect in texture discrimination: Evidence for primary visual cortex plasticity. Proceedings of the National Academy of Sciences, 88, 4966-4970.

132

Kersten, D., Mamassian, P., & Yuille, A. (2004). Object perception as Bayesian inference. Annual Review of Psychology, 55, 271-304. Kersten, D., & Schrater, P. R. (2000). How optimal depth cue integration depends on the task. International Journal of Computer Vision, 40, 7189. Kersten, D. & Yuille, A.(2003). Bayesian models of object perception. Current Opinion in Neurobiology, 13, 1-9. Knill, D. C. (2003). Mixture models and the probabilistic structure of depth cues. Vision Research , 43, 831-854. Knill, D. C., & Richards, W. eds. (1996). Perception as Bayesian Inference. Cambridge: Cambridge University Press. Knill, D. C. & Saunders, J. (2003). Do humans optimally integrate stereo and texture information for judgments of surface slant? Vision Research , 43, 2539-2558. Landy M. S., Maloney, L. T., Johnston, E. B., & Young, M. (1995). Measurement and modeling of depth cue combination: in defense of weak fusion. Vision Research, 35, 389-412. Levi, D. M., & Klein, S. A. (2002). Classification Images for detection and postion discrimination in the fovea and parafovea. Journal of Vision, 2, 46-65.

133

Li, R. W., Levi, D. M., & Klein, S. A. (2004). Perceptual learning improves efficiency by re-tuning the decision ’template’ for position discrimination. Nature Neuroscience, 7, 178-183. Li, W., Piech, V., & Gilbert, C. D. (2004) Perceptual learning and top-down influences in primary visual cortex. Nature Neuroscience 7, 651657. Lu, Z.-L., & Dosher, B. A. (1999) Characterizing human perceptual inefficiencies with equivalent internal noise. Journal of the Optical Society of America A Special Issue: Noise in imaging systems and human vision, 16, 764-778. Lu, H. & Liu, Z. (2006). Computing dynamic classification images from correlation maps. Journal of Vision, 6, 475-483. Mannos J., & Sakrison, D. (1974). The effects of a visual fidelity criterion on the encoding of images. IEEE Transactions on Information Theory 20, 4, 525-536. Marr, D. (1982). Vision New York, NY: W.H. Freeman and Company. Mather, J. & Lackner, J. R. (1981). Adaptation to visual displacement: Contribution of proprioceptive, visual, and attentional factors. Perception, 10, 367-374. Matthews, N. & Welch, L. (1997). Velocity-dependent improvements in singledot direction discrimination. Perception and Psychophysics, 59, 60-72.

134

Maunsell, J. H. R., Newsome, W. T. (1987). Visual processing in monkey extrastriate cortex. Annual Review of Neuroscience, 10, 363-401. Murray, R. F. (2002). Perceptual organization and the efficiency of shape discrimination. Unpublished Doctoral Dissertation, University of Toronto, Canada. Neapolitan, R. E. (2004). Learning Bayesian Networks. Upper Saddle River, NJ: Pearson Education. Neri P. & Heeger D. J. (2002). Spatiotemporal mechanisms for detecting and identifying image features in human vision. Nature Neuroscience, 5, 812816. Olman, C., & Kersten, D. (2004). Classification objects, ideal observers, and generative models. Cognitive Science, 28, 227-240. O’Toole, A. J., & Kersten, D. J. (1992). Learning to see random-dot stereograms. Perception, 21(2), 227-243. Otto, T. U., Fahle, M. H., & Zhaoping, L. (2006). Perceptual learning with spatial uncertainties. Vision Research, 46, 3223-3233. Pelli, D. G. (1981). Effects of Visual Noise. Unpublished Doctoral Dissertation, Cambridge University. Petrov, A., Dosher, B., & Lu, Z.-L. (2005). The Dynamics of Perceptual Learning: An Incremental Reweighting Model. Psychological Review, 112, 715-743.

135

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann. Recanzone G. H., Merzenich M. M., & Jenkins, W. M. (1992). Frequency discrimination training engaging a restricted skin surface results in an emergence of a cutaneous response zone in cortical area 3a. Journal of Neurophysiology, 67, 1057-1070. Recanzone, G. H., Schreiner, C. E., & Merzenich, M. M. (1993). Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys. Journal of Neuroscience, 13, 87-103. Rish, I. (2000). Advances in Bayesian learning. Proceedings of the 2000 International Conference on Artificial Intelligence. CSREA Press. Saffran, J. R. (2002). Constraints on statistical language learning. Journal of Memory and Language, 47, 172-196. Schmajuk, N. A. & Holland, P. C. (1998). Occasion Setting. Washington, D.C.: American Psychological Association. Schoups, A. A., Vogels R., & Orban G. A. (1995). Human perceptual learning in identifying the oblique orientation: retinotopy, orientation specificity and monocularity. Journal of Physiology, 483, 797-810.

136

Schoups, A., Vogels, R., Qian, N., & Orban, G. (2001) Practising orientation identification improves orientation coding in V1 neurons. Nature, 412, 549-553. Schrater, P. R. & Kersten, D. (2000). How optimal depth cue integration depends on the task. International Journal of Computer Vision, 40, 7189. Schwartz, S., Maquet P., & Frith C. (2002). Neural correlates of perceptual learning: a functional MRI study of visual texture discrimination. Proceedings of the National Academy of Science, 99, 17137-17142. Simoncelli, E. P., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterizing neural responses with stochastic stimuli. In: M. Gazzaniga (ed.) The Cognitive Neurosciences, 3rd edition. Boston: MIT Press pp. 327-338. Teich, A. F., & Qian, N. (2003). Learning and adaptation in a recurrent model of V1 orientation selectivity. Journal of Neurophysiology, 89, 2086-2100. Vania L. M., Belliveau J. W., des Roziers E. B., Zeffiro T. A. (1998). Neural systems underlying the learning and representation of global motion. Proceedings of the National Academy of Sciences, 95, 12657-12662. Wallach, H. (1985). Learned stimulation in space and motion perception. American Psychologist, 40, 399-404.

137

Welch R. B. (1986). Adaptation of space perception. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of Perception and Human Performance (Vol. 1). New York: Wiley-Interscience. Wichmann, F. A. & Hill, N. J. (2001a). The psychometric function: I. Fitting, sampling, and goodness of fit. Perception & Psychophysics, 63, 1293-1313. Wichmann, F. A. & Hill, N. J. (2001b). The psychometric function: II. Bootstrap-based confidence intervals and sampling. Perception and Psychophysics, 63, 1314-1329. Yuille, A. L., & B¨ ulthoff, H. H. (1996). Bayesian decision theory and psychophysics. In: D.C. Knill and W. Richards (eds.) Perception as Bayesian Inference Cambridge: Cambridge University Press pp. 123-161. Zeki, S. M. (1978) Functional specialization in the visual cortex of the monkey. Nature, 274, 423-428. Zhaoping, L., Herzog, M., & Dayan, P. (2003). Nonlinear ideal observation and recurrent processing in perceptual learning. Network: Computation in Neural Systems, 14, 223-247.

Suggest Documents