Attention may involve a collection of competing 'smart' low-level bottom- .... The FOA Track function is called every time the smart cursor moves within the ...
AIM: Attentionally-Based Interaction Model for the Interpretation of Vascular Angiography Francis K.H. Quek, Cemil Kirbas, and Fady Charbely
Vision Interfaces and Systems Laboratory (VISLab) Electrical Engineering and Computer Science Department The University of Illinois at Chicago y Department of
Neurosurgery The University of Illinois at Chicago Hospitals
1
Abstract
We propose a model to interpret Neurovascular XRA images interactively. This attentionallybased interactive model, AIM, exploits human interaction as part of the solution. AIM posits two channels of interaction: context (“what to look for”), and focus-of-attention (“where to look”) as the locus of spatial information exchange between the user and the machine. In an AIM system, the user specifies a context (e.g. a carotid vessel) and directs the attentional spotlight to focus machine processing.AIM involves the user with the computer as integral partner and facilitates varying degrees of human intervention in the process. A hierarchy of context abstractions permits the system to function more autonomously (doing high-level tasks like extracting an arterial vessel) in routine interpretation, and to require more user intervention (e.g. locating arterial wall boundaries) as the image complexity increases. This is especially important in medical imaging where the medical professional must have ultimate control and confidence in the system. Such technology can have significant impact on the design of radiological systems. Index Terms - Attentionally-Based Interaction Model (AIM), X-Ray Angiograms (XRA), Machine Vision, Selective Attention, Neurovascular 1
2
Glossary
AIM : Attentionally-Based Interaction Model of the brain CT : Computed Tomography FOA : Focus of Attention MRA : Magnetic Resonance Angiography MRI : Magnetic Resonance Imaging Neurovascular Network : Ensemble of arteries supplying the brain Neurovascular Tree : see Neurovascular Network Selective Attention : The ability of humans to attend to selected sensory information or cognitive activity Stenosis : Narrowing of a blood vessel XeCT : Xenon Computed Tomography XRA : X-Ray Angiograms
3
Acknowledgement
This work has been supported by the Whitaker Foundation under Biomedical Engineering Research Grant 96-0458.
2
4
Introduction
We propose a model for the interpretation of medical images pertaining to the human neurovascular system. This attentionally-based interactive model, AIM, is founded upon human selective attention. It exploits human interaction as part of the solution. AIM permits us to build systems that require little human effort in routine interpretation and that involves greater human involvement as the image complexity increases. AIM will hence facilitate flexible systems that enhance interpretation. Modern computer science recognizes that the way in which a user interacts with a computer has significant impact on productivity and the frequency of errors in computer work. While much attention has been paid to the analysis and interpretation of radiological medical images, scant research resources has been allocated to the interface for such systems. Two complementary factors motivate our interaction model. First, we need to alleviate the human radiological image interpreter’s cognitive and perceptual load. While humans are capable of expert, high-level interpretation of such images, the tedium of repetition and the strain of precise hand-eye coordination lead to productivity losses and errors in analysis. Furthermore, fatigue and interpersonal differences introduce undesired variables in the interpretation of neurovascular images, and the measurement of vessel diameters, flow density, and lengths. Second, we need to overcome the shortcomings of machine perception and intelligence. While computer algorithms can produce very accurate and repeatable interpretation and measurements, they very often lack ‘global’ knowledge and context. An inherent uncertainty exists in the analysis, segmentation, and interpretation of image data. High level contextual information is necessary before one can apply specific processes to different portions of the data. Conversely, a system has to depend on the output of these processes to obtain the high level information. Our AIM paradigm that couples computation tightly with high level human input to address both factors. We provide the psycholinguistic motivation for our model in section 5, and detail our model in section 6.1.
3
5
Model of Human Spatial Perceptual Communication
Key to the development of an effective human-machine interaction strategy is an understanding of what is “intuitive” to the human user. In order to define a model for human-computer communication, we need first to determine the channels for such communication. Second, we need to assign the level[s] of abstraction at which this communication takes place. In our quest for the appropriate channels for human-machine discourse of spatial and perceptual information, the questions we ask are: How do humans exchange such information? What native mechanisms of the human perceptual system respond to such intercourse? What kinds of feedback are necessary to assist the ‘speaker’ (here used loosely to represent the originator of the information)? And, how might we apply computation to approximate this interaction? In the course of this proposal, we shall define our model, providing some psychological/biological and linguistic motivation for it, and describe our architecture for supporting our model and outline the machine vision/image processing applied.
5.1
Motivational Example
Consider the interaction in which one map reader provides directions for another. The first person may say, “We shall pick up the highway here [pointing roughly to a portion of a highway], and get off this exit [pointing again].” This intercourse embodies the gist of our model. Two kinds of information are conveyed: context (what to look for) and focus of attention (FOA) (where to look). The level of abstraction employed in the discussion is that of task level objects (highways and exits). We posit that these elements of communication are necessary and sufficient for the exchange of spatial and perceptual information. Our model adopts these two channels and provides for several levels of context abstraction. Figure 1 depicts the two channels of interaction in the context of MRA interpretation. The speaker designates the focus of attention by pointing and specifies the context (a medial cerebral artery). The interlocutor receives the focus information and directs attention to the region of focus. Context information is used to filter the region of focus to locate the designated object. It is obvious from this example that the use of context requires knowledge on the part of the communicant.
4
Middle Cerebral Artery
Middle Cerebral Artery Found it! Communicator
Communicant
Figure 1: Our model of human communication of spatial-perceptual information
5.2
Motivational Basis
“The idea of a limited capacity attentional system” is a “central feature” of most human performance models [1, 2, 3]. A necessary corollary to this is that this attentional mechanism must be able to filter out unwanted information while attending to the attentional target[4]. Modern psycholinguistic theories of discourse stress the importance of modeling and directing of the attentional space of one’s interlocutor for coherence conversation [5, 6, 7]. We posit that spatial and semantic attention provide a well founded basis for interactive systems in the perceptually intensive task of interpreting medical images. While it is not our intention here to do a detailed analysis of visual attention, we shall nonetheless provide a brief overview of the more foundational works. The reader is referred to an excellent and exhaustive review of the selective attention literature provided by Allport [8]. Sagi and Julesz [9] report experiments in which they measured the processing time of human subjects during target detection and target discrimination as a function of the number of targets in the test. It was found that the processing time increases with number of targets for discrimination, and remains constant for detection of feature orientation. While humans perform visual detection and localization in parallel, discrimination has to be done serially. “Knowing ‘what’ even a single feature is requires time-consuming search by focal attention.” Treisman et. al. [10, 11, 12, 13] posit in their feature integration theory (FIT) that when ‘separable features’ exist in conjunction, such conjunctions can be correctly fused only with directed
5
attention. These separable features are detected by preattentional mechanisms. According to FIT, spatially directed attention is required to combine such features in the recognition process. When the requisite attentional resources cannot be brought to bear, ‘illusory conjunctions’ arise. Posner et. al. [14, 1] investigated the effects of ‘orienting’ of attention on target detection time (where detection is defined as being “aware or conscious of the stimulus”). They found that precues to the ‘attentional spotlight’ aided discrimination and detection significantly. Posner makes a distinction between orienting which is elicited by central cues and the drawing of attention by peripheral cues. The former is under ‘strategic control’ while the latter is ‘relatively automatic’. Posner’s baseline experiments were designed to evaluate the effect of central cues in detection of stimuli. In these experiments, Posner et. al. ascertained the effect of precues for three different tasks. The first is a simple reaction time test in which subjects were required to push a button regardless of stimulus position. In the second choice spatial experiment, subjects were required to respond whether the target was above or below the cue. In the third choice symbolic experiment, subjects were required to report if a target was a character or a digit. They reported that precuing facilitated all three tasks. More recent research shown that object-level cues also facilitate effective recognition [15] Eriksen et. al. [16, 17] proposed extending the properties of Posner’s spotlight metaphor in such a way that makes a camera zoom lens a ‘more apt analogy’. Their experiments recorded the reaction times (RT) of subjects searching eight-letter circular displays for one of two target letters. One to four adjacent character locations were precued by underlines at varying stimulus onset asychronies (SOA). They showed that speed in identifying a target increases with increasing SOA of precue, with the asymptote at 200 ms SOA. It was also shown [17] that visual attention focus size can be manipulated by precuing, that attentional resource is evenly distributed within the focus region, and that reaction time within the focal region varies with size of focus only when the system is saturated. This supports their principle of optimal attention allocation by which they posit that a subject ‘makes good judgment’ about the amount of resources a task requires and allocates sufficient resources. Hence performance reductions (increase in RT) may not result from increased focus size as more resources are brought to bear. Reduction occurs when resources become fully utilized. Significant cost is incurred, however, when the attentional precue was erroneous. When the stimulus appeared in a non-cued location, there was an appreciable increase in RT over the 6
non-cued condition. It appears that the subjects persisted in focus mode and serially scanned the display [16] (as opposed to temporarily expanding the focus area). To determine the boundary effects at the ‘edge’ of the focus area, they measured the disruptive effect of ‘incompatible’ stimuli at the periphery of the focus area. This disruption decreased progressively between 0:5 to 1:5 visual angle irrespective of the size of cued region (between 1 to 3 characters). With a change of focus, the precision of the new focus increases over time. Desimone and Duncan [4] emphasize the neural mechanisms of attention. They assert that the phenomenon observed as attention is an emergent behavior resulting from competing processes rather than a monolithic spotlight mechanism controlled from some central focus. Within this paradigm, the observable selective nature of attention is the result of a control mechanism that favors the process of interest (e.g. look for red things) among a set of bottom-up competing processes. Desimone and Duncan also posit a bottom-up bias mechanism that may permit non-target processes to compete for visual search. The advantage of this paradigm is that non-target processes may still emerge is key items of visual interest even if they were not the control targets. Desimone and Duncan also address the interplay between the object and spatial attention mechanisms. It is possible to bias a subject to attend to a subpart within the visual field. This can negate the effect of the top-down control to acquire the target object if it is not in the attended region. This spatial bias can be thought of as contributing the bias in the bottom-up process competition. Lavie and Driver [18] addressed the relationship between spatial-based and object-based control of attention. The former holds that spatial precuing and proximity dominate visual attention while the latter holds that object inclusion (features belonging to the ‘same object’) dominate the attentional mechanism. This recent controversy in attention research circles has found prominent proponents for each position. They advance the divided attention paradigm that seeks to unify these competing schools. They designed a set of four experiments to address the interaction between these attentional controls by manipulating both spatial proximity and object factors as variables at the same time. They conclude that object-based constraints play a role in attention even if the contributing features (i.e. parts of the object) are far apart. Such object-based constraints are, however, heavily influenced by spatial constraints. This is especially true for the ‘initial spatial setting of attention’. If a subject is precued to very specific and narrow spatial focus, object-based constraints can become irrelevant. Hence, attention can be seen as an interaction between spatial and object 7
controls. Hinrichs and Polanyi [19] examined a corpus of Spatial Planning Protocols to study how speakers construct plans interactively. In their experiment, subjects played a “Traveling through Europe” game. Two subjects played against a researcher on a game board showing a map of nine cities. The goal of the game was to devise an itinerary of travel to visit all the cities with the smallest number of moves along legal routes. The subjects could update and change plans at any time during the game. It was found that without considering pointing gestures made by the subjects, the interactive discourse was incomplete, but when gestures were factored in, the elliptical fragmentary discourse became intelligible plans of action. Hurlbert and Poggio [20] suggest that computer vision needs mechanisms for focus of spatial attention in recognition tasks. They cite several psychophysical experiments which indicate that human vision possesses a parallel feature-based pre-attention mechanism as well as a serial search mechanism. The former mechanism detects (but does not identify) separable features and provides direction for visual focus. The latter is a ‘searchlight’ based serial mechanism for feature conjunctions and recognition. The authors further suggest that the focus of attention may serve to suppress processing in irrelevant domains and enhance processing in the salient region. They further suggest that the power of attentional mechanisms is neither fully understood nor tapped in machine vision research. Recently, efforts that include attentional mechanisms in computer vision have been reported [21, 22]. Tsese concentrate on “automatic” attentional mechanisms to direct the processing spotlight. They concentrate on artificial intelligence or neural net approaches for determining the new objects or regions of interest. We apply the attentional mechanism as the conduit of interaction between the user and the system.
8
6
Attentionally-Based Interaction Model (AIM)
For an interactive approach to be effective, it must exploit the facility of humans to operate or communicate within the bounds of the approach. We advance a selective attention-inspired interaction paradigm for the analysis of multimodal (MRA, CT, XeCT, and XRA) neurovascular images. Our proposed AIM approach addresses the challenges posed in section 4 by exploiting the strengths of both humans and machines to overcome the others’ weaknesses.
6.1
AIM Interaction Model
In section 5.2 we reviewed research that highlight the role of human attention (both spatial and semantic) in perception and communication. While many aspects of human attention are yet to be ascertained, we do know the following: 1. Attention facilitates the direction of limited cognitive resources; 2. Spatial cues (or prompts) are effective in directing the attentional ‘spotlight’ for recognition; 3. Central semantic cues at both the feature and object levels facilitate performance in visual attention tasks; 4. Attention may involve a collection of competing ‘smart’ low-level bottomup processes driven by high level top-down targeting controls; 5. Both the spatial-based and objectbased mechanism function in unison to manage resource allocation; and, 6. A model of what is in the attentional space of one’s interlocutor is critical to maintaining effective communication. As we intimated in the example in section 5.1, AIM defines two interaction channels: a semantic context (what to look for) and a focus-of-attention FOA (where to look). These correspond to the object-based and spatial-based mechanisms respectively. In an AIM system, AIMS, the user selects the context from a menu (or a schematic diagram of the neurovascular system) and manipulates a FOA cursor (FOAC) through the data using a 2D or 3D pointing device. The AIMS processes the data in the region of the FOAC to locate and highlight entities matching the selected context in real-time as the cursor moves. When the correct entity is highlighted, the user acknowledges this and the system extends the dialog by tracing the entity (e.g. the highlighted vessel), all the while providing feedback via animated highlighting. In essence, we seek to replace the communicant in figure 1 with a machine interpreter. In this way, the contexts and the FOA drive the top-down control mechanisms in both the user and AIMS, keeping the communication between the user and machine situated. The AIMS provides the low-level bottom-up processes and highlights to minimize the attention load of the user. 9
Imaging
Data
Domain Knowledge Contexts
Data Information User Interface
Operator Selector
Operator Library
Extraction Process Knowledge
Interpretation Status
Interpretation Results
Figure 2: Block Diagram of AIM Architecture Since the system knows what it is looking for, the problem becomes one of detection of a specific entity. Hence, very specifically tuned detectors may be used. It is not critical if the detector highlights the wrong entity because the user can retarget the detector by simply moving the pointing device. AIM defines an abstraction hierarchy of contexts. A context may vary in abstraction from full scene interpretation through object level, feature level and point level contexts. To illustrate this, consider a system to extract and measure vascular entities in a neurovascular system imaged using an X-ray angiogram (XRA). The scene-level context may be the structure of the entire neurovascular tree, the object level context may be a segment of the carotid artery, the feature level context may be the edge or boundary of an artery segment, and the data level may be the specific data point (e.g. the user traces the boundary points on an XRA by hand). This provides robustness for the system for our mission-critical task in the process of patient care. This guarantees the ability to obtain the necessary vessel extraction and measurement even if the higher-level recognition algorithms fail (e.g. owing to patient pathology or data quality).
6.2
AIM System Architecture
Given our AIM model, we need a system and software architecture on which to integrate the system elements. This is important if our AIMS is to be extensible to accommodate new algorithms, imaging modalities and entities of interest (e.g. other brain structures such as the interhemispheric chasm of the brain) without requiring extensive ad hoc reengineering for each upgrade. Figure 2 is a block diagram of the preliminary architecture to realize our interaction model. Our objectoriented architecture will comprise two distinct components: The user interface and the domain
10
knowledge representation. Such modularity is important if our system is to be portable across platforms and display/interaction technologies (e.g. for both 2D and 3D interpretation environments). In the system domain knowledge, the Context database maintains knowledge about the entities of interest (vessels of different types, and the operators necessary to extract them from the different kinds of image data). The data information maintains meta-information about the data (e.g. file name sequences, where data is stored, resolution of each image type etc.). The Operator Selector selects the appropriate FOAC operator from the Operator Library. The Process knowledge database maintains knowledge of the order of analysis and extraction for the neurovascular tree. This permits the system to predict the next entity to be extracted so that the user does not have to remember the order thus reducing process errors (e.g. missing or mixed-up reading).
6.3
Interface Architecture
Figure 4 shows the screen layout of the AIM interface used in the interpretation of complex images. The Working Area is the primary area in which the operator interacts with the system in the interpretation task. The Focus of Attention (FOA) is directed by manipulating the Smart Cursor over the displayed subject image in the Working Area with a pointing devise. Feedback for the interpretation process is provided in the Working Area by highlighting the objects under the Smart Cursor which satisfy the current context. The Schematic Map, which represents the vessels in the Neurovascular area, is used to provide object level context information (e.g. a specific carotid artery) for the system. It also provides graphical feedback for the measurement process. The Graphical Dialog Box provides an additional avenue for graphical feedback. This box serves two purposes. First, it serves as a magnification window for the Smart Cursor. Second, it facilitates ‘graphical prompting’. The status windows at the bottom left of the screen provide coordinate and other status information to the user in a textual format. Similarly, the status window at the bottom right of the screen provides vessel and measurement process status information to the user in a textual format. The pulldown menu items on the top of the screen permit alteration of the system parameters and selection of data sets on which to operate.
11
Client Context Context Select Context Deselect
System Services Activation Deactivation
FOA Track
Mouse Moved
FOA Enter
Enter Window
FOA Exit Button Handler
Exit Window
Context Button
Smart Cursor
Button Pressed
Figure 3: AIM Processing Architecture
7
Processing Architecture
AIM adopts a client-server processing strategy to manage the relation between the system services and particular contexts. Figure 3 shows this client-server interface. Everything on the right of the figure pertains to the system. These services are constant across different client contexts. All the screen management, menu organization, coordinate transformations among the FOA, Working Area, Scaled View and image data are performed by the system. The system also provides a library for drawing various geometrical entities (lines, points, rectangles and run-length-regions), moving the smart cursor, screen management (such as scrolling and scaling the screens), and menu management (such as adding contexts and context environments). Each client context must handle six calls from the system: Context Select, Context Deselect, FOA Track, FOA Enter, FOA Exit, and Button Handler (as shown on the left side of figure 3). Context Select and Context Deselect are triggered when the operator activates and deactivates a context button respectively. This permits the client context to perform any necessary initialization and clean up. The FOA Track function is called every time the smart cursor moves within the Working Area. This function is passed the FOA subimage, a pointer to the parent image and the necessary image coordinates. The client context is free to draw to the Working Area using the system library since the system maintains a refresh raster map to restore the screen under the smart cursor when the cursor moves again. The FOA Enter and FOA Exit functions are called when the user moves the system cursor into and out of the Working Area. When the cursor is in the Working Area, the system changes the cursor icon changes into the smart cursor icon. The Button Handler function is activated whenever a pointer (usually a mouse) button is pressed. This usually means that the operator accepts the interpretation suggested by the graphical feedback provided by the FOA Track
12
Working Area
Graphical Context Representation
System Information Smart Cursor
Graphical Dialog Box
Process Status
Figure 4: AIM Interface function. Since the system is built within the object oriented paradigm (our system is implemented in C++), contexts are objects in the system. New contexts can therefore be added to the system as objects which handle these six functions. Contexts may be active, inactive or stacked. When a context is selected from the Context Palette, it becomes the active context and its Context Selection function is activated. If there is a previously active context, its Context Deselect function is called and it is made inactive. In the event that an abstraction shift is triggered (e.g. an object context triggers a component feature context), the original context is pushed onto a stack along with its associated working memory (or resources), and the new context is activated. We call the tuple of a context and its resources a interpretation environment. The system, therefore maintains a stack of such interpretation environments. When the new context completes its task, its interpretation environment is popped off the stack and it is made inactive. At this point, the original context once again becomes the active context. The resources allocated to a context include an array to hold the FOA subimage as well as the refresh raster map. AIM maintains a central resource pool of these arrays which it apportions to contexts when they are activated and retrieves when the contexts terminate.
13
8
AIM in Interpretation of Vascular Angiography
Owing to the strong man-machine interaction model behind it and the user interface design principles used in the development of the interface, the AIM interface we developed is simple, yet effective and easy to use. Figure 4 shows the AIM interface developed for the interpretation of XRA images of. The XRA image is displayed in the Working Area. This area is scrollable, so any size of XRA images can be displayed in this area. The user can increase or decrease the size of the the image at any time by choosing appropriate selection item from the menubar. The Focus of Attention (FOA) is represented by the box shown in the image and is directed by the mouse. The Schematic Map is in the upper left corner of the screen. As we mentioned previously, this map gives object level context to the system and gives the user graphical feedback. When the user points a vessel on the Schematic Map by using the mouse, the information about this vessel is displayed in the window in the lower right corner of the screen. This information contains the vessel number (used as an index), the vessel name, the width and any comment written by the user about the measurement process of this vessel. If no measurement has been taken for this vessel, the vessel has the default width stored in the database. When the user wants to select a vessel to be interpreted by the system, he simply clicks a mouse button while the pointer is on the vessel. The vessels are highlighted in different colors to give feedback on the selection and measurement processes. The Graphical Dialog Box is on the lower left corner of the screen. Part of the image under the FOA, detected vessel edges, and magnified views of these two sub-images are displayed in the Graphical Dialog Box. An important feature of the AIM system is the similarity between the interpretation task and the method currently used by the experts. In our system, every patient’s XRA images are stored on the storage device in a separate directory. System creates thumbnail images of the original images, which are 128128 and 256256 pixels wide. The user can see all of the 128128 versions of the thumbnail images at the same time on a window similar to the lightboard used by the doctors to inspect the XRAs. When the pointer is held on one of the thumbnail images for a few seconds, the bigger, 256256, version of that image is displayed on the screen. This allows the user to inspect the image more carefully. When the user decides to choose one image for interpretation, he simply clicks the mouse button and the system brings the original version of this image into the Working
14
Area. Another important feature of the system is that the interpretation process is done in real time. Instead of processing the entire image, only the area under the FOA is processed, resulting in very fast processing speeds. The strong feedback mechanisms, both graphical and textual, during the interpretation process makes the system effective and easy to use. We believe another feature, the History of Measurement, will be very useful. Information about the measured vessels, including image name, coordinates of measured points, width, and FOA size, are stored in a database for every measurement processes. We can think of this database as a collection of snapshots of the measurement process. At any time the user can retrieve this information from the database and see the snapshot of that vessel’s measurement moment. The selection of the vessels are done on the Schematic Map. In the History of Measurement mode, when the user clicks the mouse button while the pointer is on a vessel, the snapshot is brought to the screen. This feature allows different experts to see the exact measurement locations for each vessel when they inspect the measurements done by the others.
9
Conclusion
Neurovascular disease (e.g. aneurysms and vessel stenosis) contribute significantly to the number of patients who undergo neurosurgery, and the ability to extract and measure the vessels of the neurovascular tree is essential for the treatment of such patients. Hence tools which will help neuroscientists get more accurate and detailed information on the neurovasculars are necessary. So far, little attention has been paid to user interface in medical image interpretation systems. In recent years, the computer industry has recognized the importance of the interface to user productivity and effectiveness. We expect the same trend to be mirrored in medical and radiological systems-where the emphasis is still on capability and not on user efficiency, effectiveness and productivity. The importance of such factors to the quality and cost of medical care is evident. The AIM model we developed, which is motivated by linguistic and cognitive psychology, combines humans high-level reasoning with the computer’s computational power to extract brain vessels interactively. The abstraction hierarchy facilitates varying degrees of human intervention in the process. This is especially important in medical imaging where the medical professional must 15
have ultimate control and confidence in the system. This system will help specialized users, eg. radiologists, physicians, neurosurgeons etc., get more accurate measurements of the vessels with less effort, which will save their time and lead to better treatment planning of the neurovascular disease. While our current system functions in 2D XRA images, the theory may be extended to operate in 3D. In 2D, the FOA cursor (FOAC) points to an active square area in the data. In 3D, this area becomes a volume (a cube) in the image stack. The user moves the FOAC in 3D and the system locates entities satisfying the active context within the focus cube. The challenges to building such a system lies not in the volumetric image processing but in the complexities with interacting and viewing volumetric information. Since AIMS is a realtime interpretation system, 3D AIMS requires fast volumetric rendering to navigate through volume. While traditional surface-based visualization methods are able to produce visually pleasing renderings, they require prior segmentation of the data. Since the purpose of AIM is to interpret and segment the data volume, fast volumetric rendering approaches are required for the user to interact directly with the unsegmented 3D data. Furthermore, navigating through the 3D data volume is more complex than in 2D. Intuitive 3D manipulation mechanisms are needed for medical professionals. Another system extension suggested by our theory is utilization of speech-based controls to provide greater flexibility in modifying parameters of a particular context. This is especially needed as the contexts become more complex, and transitions between contexts become more fluid. Consider the situation in which a radiologist is required to extract the surface of a tumor in a data volume. Tumor densities in magnetic resonance images may vary with respect to their contrast with the surrounding tissue. The surface of the tumor may curve and interact with blood vessels. In such a situation, the system may need to transition among contexts and modify the parameters of the smart cursor operators on-the-fly. By incorporating speech recognition technology, the radiologist may provide contextual information by ‘telling’ the AIMS that the surface contrast in a particular 3D FOA is ‘lighter’ or ‘darker’. She may tell the system to ‘switch to tumor-blood vessel boundary’ to switch the contextual model. Such an extension is congruent with our AIM model for two important reasons. First, it enhances 16
the conversation model advanced by AIM. Second, it separates the modes for specifying context and FOA into vocal and motor control respectively. This addresses the problem of overloading the motor control capacity of the user (mouse used for both context specification and FOA control). While we did not incorporate speech technology into our current system for logistical reasons, the current state-of-art in speech recognition provides reliable performance in recognizing sets of keywords that are pertinent to the interpretation of classes of images.
References [1] M.I. Posner, “Orienting of attention”, Quarterly Journal of Experimental Psychology, vol. 32, pp. 3–25, 1980. [2] Deborah Mayhew, Principles and Guidelines in Software User Interface Design, PrenticeHall Inc., 1992. [3] J. Preece, Y. Rogers, H. Sharp, Benyon D., S. Holland, and T. Carey, Human-Computer Interaction, Addison-Wesley Pub. Co., 1995. [4] R. Desimone and J. Duncan, “Neural mechanisms of selective visual attention”, Annual Review of Neuroscience, vol. 18, pp. 193–222, 1995. [5] Judy Delin, “Presupposition and shared knowledge in it – clefts”, Language and Cognitive Processes, vol. 10, pp. 97–120, 1995. [6] Susan E. Brennan, “Centering attention in discourse”, Language and Cognitive Processes, vol. 10, pp. 137–167, 1995. [7] Peter C. Gordon, Barbara J. Grosz, and Laura A Gilliom, “Pronouns, names, and the centering of attention in discourse”, Cognitive Science, vol. 17, pp. 311–347, 1993. [8] A. Allport, “Attention and control: Have we been asking the wrong questions? A critical review of twenty-five years”, in Attention and Performance XIV, pp. 1238–1251. MIT Press, 1993. [9] Dov Sagi and Bela Julesz, ““Where” and “what” in vision”, Science, vol. 228, pp. 1217– 1219, 1985. 17
[10] A. Treisman, “Search, similarity, and integration of features between and within dimensions”, Journal of Experimental Psychology: Human Perception and Performance, vol. 17, pp. 652– 676, 1991. [11] A. Treisman, “Properties, parts, and objects”, in K. R. Boff, L. Kaufman, and J. P. Thomas, editors, The Handbook of perception and human performance. Wiley-Interscience, New York, 1987. [12] A. Treisman, “Perceptual grouping and attention in visual search for features and for objects”, Journal of Experimental Psychology: Human Perception and Performance, vol. 8, pp. 194– 214, 1982. [13] A. Treisman and G. Gelade, “A feature integration theory of attention”, Cognitive Psychology, vol. 12, pp. 97–136, 1980. [14] M.I. Posner, C.R.R. Snyder, and B.J. Davidson, “Attention and the detection of signals”, Journal of Experimental Psychology: General, vol. 109, pp. 160–174, 1980. [15] V. Nougier, B. Rossi, C. Alain, and F. Taddei, “Evidence of strategic effects in the modunation of orienting of attention”, Ergnomics, vol. 39, pp. 1119–1133, 1996. [16] C.W. Eriksen and Y.Y. Yeh, “Allocation of attention in the visual field”, Journal of Experimental Psychology, vol. 11, pp. 583–597, 1985. [17] C.W. Eriksen and J.D. St. James, “Visual attention within and around the field of focal attention: A zoom lens model”, Perception and Psychophysics, vol. 40, pp. 225–240, 1986. [18] N. Lavie and J. Driver, “On the spatial extent of attention in object-based visual selection”, Perception and Psychophysics, vol. 58, pp. 1238–1251, 1996. [19] Erhard Hinrichs and Livia Polanyi, “Pointing the way: A unified treatment of referential gesture in interactive discourse”, Papers from the Parasession on Pragmatics and Grammatical Theory, Chicago Linguistics Society, vol. 22nd Meeting, pp. 71–78, 1986. [20] Anya Hurlbert and Tomaso Poggio, “Visual information: Do computers need attention?”, Nature, vol. 321, pp. 651–652, 1986.
18
[21] JK Tsotos, SM Culhane, WYK Wai, YH Lai, N Davis, and F Nuflo, “Modeling visual attention via selective tuning”, Artificial Intelligence, vol. 78, pp. 507–545, 1995. [22] J. Basak and SK. Pal, “PSYCOP – A psychologically motivated connectionist system for object perception”, IEEE Transactions on Neural Networks, vol. 6, pp. 1337–1354, 1995.
19