Within the multimedia application these hypercomponents can be ... structuring is critical to the development of high-quality visual information applications.
HyperImages: Using object recognition for navigation through images in multimedia David Lowe, Athula Ginige School of Electrical Engineering, University of Technology, Sydney P. O. Box 123, Broadway, 2007, Australia
ABSTRACT Multimedia involves the use of multiple forms of communication media in an interactive and integrated manner. At present, textual data is the media predominantly used to provide the interactivity due to the ease with which discrete semantic elements are identified. It is common practice to follow links from words or phrases within text to associated information elsewhere in the database. To achieve a similar degree of functionality with visual information typically requires that each image (or video sequence) be processed by hand, indicating the objects and locations within the image - a process that is excessively expensive and time-consuming for large databases. This paper describes the implementation of a simple object recognition system that allows the specification of 3-dimensional models that can then be used to recognise objects within any image, in an analogous fashion to words within text. This enables image data to be become a truly active media, within a multimedia database. It provides a significantly enhanced level of functionality while keeping the authoring effort to a minimum. The basic algorithms are described and then an example application is outlined, along with feedback from users of the system. Keywords: multimedia, authoring, active images, object recognition, computer vision, navigation
1. INTRODUCTION Multimedia is a rather eclectic technology, drawing on a number of enabling technologies - information theory, man-machine interfaces, information technologies, and database handling, to name but a few. These technologies are combined to create an application which purportedly integrates a number of different media into an interactive whole. Note the key points in this description: integrates, media, and interactive. To date most multimedia systems have not successively reached this goal. Although they tend to be interactive and integrated, and include multiple forms of media, these media are not combined into a cohesive whole - i.e. all forms of media are not integrated. In fact most existing applications which call themselves multimedia would be more appropriately called multiple media hypertext systems. They tend to be hypertext systems (i.e. the textual information provides the interactivity) with the additional media added on, but not truly integrated into the application We can understand this a little better by considering the use of textual data within multimedia, an area that has received considerable research attention. The text can be stored, analysed, manipulated, and generated synthetically. Essentially the text can be treated as consisting of discrete entities - hypercomponents - (words, sentences, paragraphs, etc.) which obey a series of syntactic rules describing the inter-relationships. Within the multimedia application these hypercomponents can be used to create nodes, anchors, links, etc. These elements provide the navigation functionality which supplies the interactivity - the core of any multimedia application. A number of authoring tools exist which assist in the conversion of textual information into an appropriate structure. The evolution of visual information is still at a much lower level, and is predominantly treated as a passive media. Visual information in its raw form is highly unstructured and yet very commonplace (documentary video tapes, image and photographic collections and databases, etc.). Many visual information applications (such as medical imaging, robotics, and interactive multimedia) make use of the visual information in a highly structured format. Conceptually visual information can be treated in the same way as text; consisting of discrete entities which obey certain syntactic rules. The main problem however lies in identifying these entities and the associated rules, and then interpreting these. When this is achieved visual
information can become an active media as powerful (and in many cases, more so) than textual information. It is this premise which has driven this work. Although there has been considerable work on structuring textual information, there has been minimal work performed on structuring the visual information to suit multimedia applications. For example, the development of practical multimedia systems requires the use of both suitable information and the appropriate structuring of this information. This information structuring is critical to the development of high-quality visual information applications. One of the major obstacles hindering the advancement and commercial acceptance of these applications is the cost of structuring the vast amount of visual information. At present most existing visual information databases have been handcrafted; a process which is excessively expensive and time consuming. This paper discusses the role of active visual information in multimedia. Section 2 introduces the concept of multimedia and shows that most traditional multimedia systems incorporate only active textual data. Active visual information is introduced, and it is shown why this is beneficial. Methods of enabling active visual information in multimedia are then considered. Section 3 discusses an application which implements one of these methods of providing active visual information. This application is based on an object recognition system that allows the specification of 3-dimensional models that can then be used to recognise objects within images, in an analogous fashion to words within text. This enables image data to become a truly active media, within a multimedia database. It provides a significantly enhanced level of functionality while keeping the authoring effort to a minimum. Section 4 describes the results of using the application, and Section 5 provides relevant conclusions and ideas for further research. The focus of this paper is on the use of active image data within multimedia, and ways in which it can be achieved, rather than the specifics of the particular object recognition scheme used in this study.
2. MULTIMEDIA SYSTEMS AND ACTIVE VISUAL INFORMATION 2.1. Traditional Multimedia Applications Multimedia is a technology which has been enjoying significant attention within the last few years. Multimedia involves the use of multiple forms of communication media (such as text, audio, video, images, etc.) being used interactively. At present the most common forms of multimedia applications are surrogate travel, telemarketing, telemedicine and interactive learning. These applications can all make extensive use of a number of different media, including text, audio, and visual information. One of the key elements of any multimedia application is interactivity - the ability to interact with the media in such a way that improves the accessibility, useability and presentation of the information. The level of interactivity which can occur is related to the way in which we treat each of the media.
Node Textual data has been widely used to construct and manipulate information in multimedia systems. If we consider the use of textual data within multimedia we can see that text can be stored, analysed, manipulated, generated synthetically, and extracted for use elsewhere. Essentially textual information can be grouped into 'nodes'. Each node can be treated as consisting of discrete entities (words, sentences, paragraphs etc.) which obey a series of syntactic and semantic rules describing the interelationships.
Node
Links
A node is simply a grouping of certain information together to form a unit. The node can be made up from different types of information, such as text, images, video, sound, etc. This page of information which you are currently reading is a node. Note that it contains some text, and an image. Typically (in traditional multimedia systems) the textual data is used for the linking, and
Anchor
Figure. 1: Typical linking of information in a multimedia application: The links are restricted to the textual information, and other forms of information act as annotations to the text.
Traditional forms of electronic information (text, numerical data, etc.) are highly structured. This structure is typically critical to the effective use of this information. For example, in multimedia systems text can be readily stored, analysed, and manipulated. Essentially textual information is grouped into 'nodes' (where each node is a piece of information on a specific topic). Each node consists of discrete entities (words, sentences, paragraphs etc.). Appropriate entities can be used as specialised components. For example, a word can act as an anchor point for links to other nodes (e.g. if the user 'clicks' the mouse on the word, the link is traversed, and the destination node is displayed). The effective use of this textual information relies strongly on appropriate structuring of the information. This is also true for most alternative uses of similar information. The evolution of visual information (such as images and video) is still at a much lower level. In general it is poorly structured. For example, in traditional multimedia systems it is predominantly treated as a passive media - essentially acting as an annotation to the textual information. This is illustrated in Figure 1. A number of tools exist which assist in the process of structuring the raw textual information into a form suitable for multimedia6. This requires the partitioning of the information into appropriate nodes, the identification of the hypercomponents, and specifying appropriate links. Recent authoring tools (such as HART11) have begun to focus on providing a greater degree of support for the author during this process. Support is typically provided through both procedural guidance (assisting in the process of converting the information) and intelligent assistance (providing contextdependant choices for the author - such as suggesting appropriate keywords from the text). To date, these tools still focus almost entirely on textual information. At present there is minimal support for assisting the authoring process of visual information. By improving this support the development of visual information applications becomes much simpler, and available to a much larger audience.
2.2. Active Visual Information Multimedia Applications Conceptually visual information should be able to be treated in the same way as text. For example, we would like the user to be able to 'click' the mouse button on appropriate objects with the image or video sequence, and then suitable information about that object should be presented. This is illustrated in Figure 2.
Node
Node
Node
Textual data has been widely used to construct and
A node is simply a grouping of certain information
manipulate information in multimedia systems. If we
together to form a unit. The node can be made up
consider the use of textual data within multimedia we can see that text can be stored, analysed,
from different types of information, such as text, images, video, sound, etc.
manipulated, generated synthetically, and extracted
Links
This page of information which you are currently
for use elsewhere. Essentially textual information
reading is a node. Note that it contains some text,
can be grouped into 'nodes'. Each node can be
and an image. Typically (in traditional multimedia
treated as consisting of discrete entities (words,
systems) the textual data is used for the linking, and
sentences, paragraphs etc.) which obey a series of syntactic and semantic rules describing the interelationships.
Anchor
Textual data has been widely used to construct and manipulate information in multimedia systems. If we consider the use of textual data within multimedia we can see that text can be stored, analysed, manipulated, generated synthetically, and extracted for use elsewhere. Essentially textual information can be grouped into 'nodes'. Each node can be treated as consisting of discrete entities (words, sentences, paragraphs etc.) which obey a series of syntactic and semantic rules describing the interelationships.
Figure. 2: Multimedia application incorporating active visual information: The elements of the visual information now become hypercomponents which can be used for linking in the same way as the textual hypercomponents. The main problem with visual information lies in identifying the visual entities within the information to be used in the structuring process. When this is achieved, visual information is able to be treated as an active media as powerful (and in many cases, more so) than textual data. By active, this means that the visual information is used interactively within the application, rather then simply being an annotation to the other media (such as text). Identifying entities and rules within
text is relatively straightforward, being simply a matter of identifying words, and then sentences etc. To identify objects in visual information is significantly more difficult.
2.3. Creating active visual information The volume of research that has been performed on the role and representation of image data within multimedia systems is relatively small. The primary research focuses for visual information in multimedia have been on information management, data modelling, database issues and standardisation issues12. In general image and video data, when used, has been stored in the form that was most convenient given the hardware and software tools provided: tools that were typically not designed with multimedia in mind. Alternatively hardware that was developed for alternative imaging applications has been adapted for use. An example of this is the use of the JPEG coding scheme (with the associated hardware CODEC) to code and decode the image data. Almost all these schemes suffer from the same limitation, the image representation is not particularly appropriate for a multimedia application. Edgar, Steffen and Newman4 look at current technologies in this area and the effect that these have had. This includes advances in image handling capabilities, the development of image compression standards, new storage techniques and network applications. The more fundamental issue of the image representation and structuring itself is however not raised. In order to adapt image data to a multimedia environment and make the visual information truly active, the usual techniques to date have involved adding additional information to the representation without modifying the underlying structure. An example of this approach is manually specifying the location and orientation of objects within the image (for example, drawing around them using the mouse) and storing this data with the image to create a 'marked-up' image3. An alternate approach is to manually add a textual description of the image. This description can then be processed along the same lines as normal textual information1. Both of these methods are time consuming, cumbersome, and prone to error - especially for large image sets, as each image needs to be independently manually annotated. Some research has been attempting to make image data active, rather than passive. In almost all cases3 this work has focussed on unrealistically simplistic images, usually computer generated, two-dimensional or sketches. Nevertheless, this work does indicate the validity and possible applications of this research area. If we wish visual information to be widely used as an active media in multimedia applications then we need to improve our ability to handle visual information during the authoring process. As mentioned above, a number of multimedia authoring tools provide both procedural guidance and intelligent assistance for textual information. The first of these two forms of assistance involves identifying the individual hypercomponents within the media. For text, this involves extracting words, sentences, paragraphs, and other suitable blocks of text. For images this will involve identifying regions (foreground, background, surfaces, etc.), objects, and other relevant information. For video this involves identifying episodes, shots in addition to foreground, back ground and objects. The intelligent assistance involves structuring the extracted components in a suitable fashion - identifying appropriate anchor points, links, keywords (or keyobjects) etc. In order to handle visual information effectively during the authoring process we need to investigate methods for providing these two forms of assistance. Ideally the ultimate aim would be to be able to treat the image data in a purely analogous fashion to the way in which text is currently treated. This involves abstracting the concept of modular data in which an `objects' components are treated identically from an external point of view irrespective of the form of the data; text, audio, image, video, animation. The individual atoms of each data set can be identified, along with their context, allowing it to be used as the source of automatically generated links. The project of which this paper forms a small part is aimed at developing, investigating, and evaluating tools which will provide suitable assistance for the author of visual information. These tools are essential if we wish to add value to existing visual information. In multimedia, it will enable authors to more closely integrate the visual information with the other forms of information to be used, resulting in significantly higher quality multimedia applications (greater useability, higher cohesion, and greater flexibility). We are no longer in a position where it is satisfactory (or even in many cases possible) to hand-craft applications.
3. DEMONSTRATION APPLICATION When structuring text-based systems, procedural guidance is provided by automating the process of extracting words, sentences, paragraphs, etc. Intelligent assistance is provided by identifying suitable words to acts as keywords, anchors points, and possible destination nodes for links. If we are to achieve a similar level of support for visual information then we need to automate the process of extracting objects from the visual information, and then identifying suitable objects to be used in the structuring process. The most obvious method of providing this assistance is to use object recognition schemes. In order to investigate the applicability of this, a demonstration application was developed which integrated a simple object recognition scheme into a multimedia application.
3.1. Application Description The multimedia application which was developed, although very simplistic in comparison to existing multimedia applications, demonstrates one possible technique for enabling the use of active image data rather than passive image data, as well as providing a platform for investigating the general performance and change in functionality. No attempt was made to create a commercial quality multimedia application, as the primary aim was to investigate the implications of the imaging principles involved, rather than the various multimedia principles. As a result the overall operation and performance of the system from a multimedia point of view is quite simplistic. Only sufficient functionality to investigate the image algorithms has been incorporated. The application that was chosen was a database based on grocery items. A large selection of groceries were included in the database and the information on each of these was included in textual, audio and image formats. The information that was available varied over a broad range, from broad category descriptions to specific items. Within the broader categories it was possible to move to more specific items using the appropriate links based on selecting individual words, or selecting objects within the images. The database consisted of approximately 30 objects and 50 images, many of which contained multiple objects. The objects and images were arranged into a hierarchical structure. The development and usage of this application are detailed elsewhere9.
3.2. Recognition Process The recognition process which was used formed the central core of the multimedia application. As such, this will be described in some detail (full details are given elsewhere7). In previous work related to image representations and image coding 10, a representation was developed which decomposed an image into a number of hierarchical information layers. These layers contained image features such as edge information (and other primary image discontinuities), texture, shading, and colour. It was recognised that these layers could be independently used in the recognition process. Very little research has been performed on the use of multiple forms of image information in the recognition process and the way in which this data should be fused. Image data typically contains a broad variety of information types and significant research has been focused on the use of individual information forms. There has however been minimal attention given to the relative importance of the various information forms under various conditions, and the way in which they may be fused. Aloimonos2 has performed work on the general problem of fusing data from various sources. “Most of the basic problems in computer vision, as formulated, admit infinitely many solutions. But vision is full of redundancy and there are several sources of information that if combined can provide unique solutions for a problem. Furthermore, even if a problem as formulated has a unique solution, in most cases it is unstable. Combining information sources leads to robust solutions. ... Another approach to the solution of ill-posed problems would be to look for information sources in order to augment the number of physical constraints and achieve uniqueness of the parameters to be computed.” Aloimonos' work has focussed on shape computation from various sources, such as combining several of the following: shading, motion, stereo, texture, contour. Wu13 has also performed similar work. Using these ideas as a basis, a recognition
scheme which used all the information layers available was developed. The system, although relatively primitive, aimed to identify known polygonal objects within natural scenes. Since the primary aim was to investigate the use of object recognition in multimedia the algorithms were not optimised to a great degree, or extended to more generalised cases. One of the most general methods adopted for object recognition is the matching of image features to object features, and then the analysis of these matches to determine object and scene parameters. It is worth noting that this is often an inherently iterative approach, since the matchings, especially at a low-level, are often ambiguous with a large number of possible interpretations. It is typically not possible to select the most probable interpretation until higher level analysis has been performed. For example, a probable set of low-level matches may be obtained, and then from this the object parameters can be determined. It may then turn out that the object parameters are contradictory and the original matches have to be abandoned. This naturally leads to a scheme where, based on the image information, possible matchings are generated and subsequently hypotheses relating to possible image interpretations are generated. Each hypothesis should have a particular probability associated with its likelihood. These interpretations are analysed and subsequently either abandoned or accepted. Each level of the processing should be able to provide feedback to the earlier stages. Additionally, the various forms of information contained within the image can be of varying use for generating particular hypotheses. For example, texture is only of peripheral use in the early stages of most identification problems, but very important in a few. Since each layer of information can be used for identification (and it is impossible to say which forms of information will be important without knowledge of the image content) each layer of the image information hierarchy is therefore used to attempt to perform identification simultaneously. Additionally each layer is capable of providing feedback to the other layers based on its interpretation of the scene. The basic structure of the algorithm implementation is shown in Figure 3. The first stage of processing involves the pre-processing of the image to extract the information hierarchy. Once this has been achieved each layer of the hierarchy is used to independently find matches between image features and object features. Each matching will have a probability associated with it. These matches are then combined to generate hypotheses as to the images content (i.e. hypothesising the existence of specific known objects). Again, each hypothesis will have an associated probability. These hypotheses are then assessed, and the resultant accepted hypotheses interpreted to identify the relevant parameters.
Original Scene Capture Image Canonical Pixel Image Extract Information Edge
Texture Shading Colour
Hierarchical Representation Match Features Feature Matches
Generate Hypotheses The initial phase of the algorithm is the matching of image and object features. This matching process needs to occur Object Hypotheses for all layers of the hierarchy. The actual matching process, and the form of the features to be matched will be Assess Hypotheses dependant on the layer of the hierarchy being processed. The most important of the feature matches is based on the Scene Desscription edge information. A number of methods for performing this were investigated, and a technique based on extracting Figure. 3: Recognition algorithm structure: The system high significance structures made from sets of straight lines 8 performs recognition by using the parallel extraction was used . In this technique, lines are extracted from the and matching of feature sets and the subsequent edge information, and then pairs of lines are investigated generation of object hypotheses. for their level of significance (based on factors such as cotermination, parallel lines, etc.). Sets of lines of highsignificance are matched to corresponding sets of lines extracted from the object models. Texture, shading, and colour information is matched simply by identifying image regions with similar characteristics to surfaces of the object models.
Once features matches, or mismatches, have been generated they need to be combined to obtain hypotheses relating to the proposed existence of objects within the image. These hypotheses do not at this stage need to be consistent with each other. Feature matches or correspondences can indicate many possible interpretations of the same image features. All interpretations should be formalised as hypotheses. The hypothesis assessment will choose the most likely of the hypotheses. In this work only single images are being considered, and we can therefore only identify the hypothesised objects location (in 3D world co-ordinates) and orientation, and a significance level associated with the hypothesis generation. The object size was assumed to be fixed as was all other information regarding the objects physical structure and appearance. The hypotheses are generated based on the edge matches (as these are the only ones which can provide sufficient information to determine the object position and orientation). Once the hypothesis has been generated its significance level is then adjusted based on the matches from the shading, texture and colour information. Once the hypotheses have been generated they need to be assessed to select those that are most likely. This will also involve taking into account contradictions between the various hypotheses. It is highly likely that various hypotheses will contain overlapping features, and therefore the acceptance of one hypothesis will preclude the possibility of others. The assessment involves mapping the hypotheses back into the image and analysing the degree of conformance. This includes considering both the degree of edge conformance and the degree of surface uniformity, between the hypothesised object and the image. The hypothesis with the best degree of conformance is accepted and then all contradicting hypotheses are disregarded. This process continues until no further acceptable hypotheses remain.
3.3. Application Implementation The multimedia application was developed on a Sun SPARCstation running OpenWindows version 3 with an X-Windows interface. The application implementation was performed in such a way as to maximise the flexibility in terms of developing additional databases. The application used several different specification files. The first of these was a dictionary file. This contained a list of elements and for each element three data files were specified; an audio file, a text file, and an image file. Thus each element in the dictionary will have a complete set of information available. The dictionary files do not specify the database, but rather act as a reference for the particular elements. The second specification file was the database file itself. This file specifies a particular database application. It contains a list of dictionary files, model files (to specify the models used to recognise objects), and static element links. Using this approach allows the user to develop a library of element data (specified in the dictionary files) and then to build databases simply by combining the relevant dictionaries. This separates the data gathering and conversion process from the authoring process; a significant advantage over most existing multimedia tools. Finally, the application itself was coded in such a way as to make the user interface as simple and logical as possible. The images are processed offline to extract the information hierarchy. They are then loaded in when required for a particular database element. The display routines decode this data and regenerate the images progressively so that the user has an indication of the image content as early as possible. If the user then selects an object within the image (by clicking on it with the mouse) then the recognition scheme will attempt to recognise the object using the information representation rather than the having to regenerate the canonical image. If the recognition is successful the application will jump to an appropriate database element (i.e. image, text and audio that is relevant to the selected object). Figure 4 is a snapshot of the screen containing the multimedia application. The user has just moved the mouse over an object in the image and pressed the left mouse button. The object was identified and the appropriate link will be subsequently traced. This application did not attempt to address any of the broader issues of information handling, database management, etc. The sole purpose was to investigate one method for automating the process of generating active image information in a multimedia application, and to then consider the implications of this in terms of the effective change in the application functionality.
Figure 4: Snapshot of screen containing the multimedia application: A link which has an object from within an image as an anchor has just been triggered and is about to be traversed.
4. RESULTS AND PERFORMANCE After the application was developed and a typical database installed, the system was evaluated. This evaluation included both a consideration of authoring issues, and useability issues. The useability encompassed both a subjective users point of view, as well as quantitative measurements of the performance.
4.1. Authoring Evaluation The process of authoring the database required the author to manually generate the models used for the object recognition. Once this step was completed the author did not need to consider the visual information again. The application which was developed handled the creation of automatic links. The most difficult part of the authoring process was therefore the generation of the object models. This required the specification of a wireframe model of the object and detailing the surface shading, colour and texture. It was found that for this approach to be of practical use, this process would need to be automated. The model specification required a comparable effort to the manual identification of the objects (i.e manual authoring). The primary difference between a manual authoring approach and this automated approach is related to ongoing effort. The effort required is related to the number of objects to be identified, rather than the number of images in the database. Once the models have been created, additional images can be added to the database without requiring any additional authoring effort (apart from any data capture and conversion which is required).
4.2. Useability Assessment Once the application was completed and the demonstration database installed a number of people were asked to use the system and comment on its suability. Before these comments are outlined a number of relevant issues should be raised regarding the individuals who tested the application and factors that may have affected their reactions. The total number of people who assessed the system was fourteen. Each of these people completed an extensive survey relating to their experience with computers, software, and multimedia prior to using the demonstration application. The initial level of familiarity with this style of software was quite variable. Six of the people had used a multimedia application previously, 10 had used some form of context-sensitive help system, and all but two had used computers to at least a moderate degree. A number of additional points are worth noting. • The functionality of the application is relatively low when compared to commercial systems. This has several implications. The first of these is that this immediately made the users of the system less satisfied with the system overall. Also it is likely to have had the effect of making the additions to the functionality (in terms of the extra imaging abilities) seem better in comparison. • The quantity of material in the database is quite limited, as is its scope. As a result the users can very quickly explore the entire database, and then subsequently become bored with it. • The use of a multimedia database that allowed active image data was somewhat of a novelty. As a result the general impressions were probably more favourable. A more extended period of use would be required to remove this type of effect. The users were each given half an hour to investigate the application. The first half of this time the images were set to be passive and then for the following fifteen minutes the images were allowed to be active. They were then asked to complete a questionnaire regarding their impressions of the system and to answer various questions rating the performance of the system. Table 1 lists the responses to the questions that were asked of the users of the multimedia application.
Question
% of users responding 1
2
3
4
5
X
The use of active images significantly improved the useability of the multimedia application
0
7
29
43
21
0
The inability to recognise all objects was a significant problem
0
14
21
36
29
0
The layered loading of images was distracting
7
0
43
29
14
7
The time taken to load images was acceptable
14
21
43
14
0
7
The system was sufficiently robust in terms of not recognising objects incorrectly
0
0
7
14
64
14
The layered loading of images was useful in obtaining an early indication of the image content
0
14
50
14
0
21
When a link was available from both the text the image, in general the image was used
0
21
21
14
43
0
More intelligence needs to be added to the destination of the links from the images
7
21
29
14
0
36
Table 1: User ratings of the multimedia application: This lists the range of values for various ratings for the listed questions (5=strongly agree, 4=agree, 3=neutral, 2=disagree, 1=strongly disagree, X=don't know) From the comments and ratings provided by the users it is possible to draw a number of conclusions about the performance of the system and to identify possible problems. • In general the users seemed to be quite impressed with the general idea of active image data. All major criticisms were aimed at implementation and usage issues rather than the concept. This is born out by both their comments and the degree to which the image data was used. • The major criticism that users had of the system related to the data links. Often these links led to unexpected places, or led to information which was too specific. This problem is however a multimedia issue, rather than an imaging issue. • The users were almost unanimous in their acceptance of the robustness of the system. There was no criticism at all relating to the incorrect identification of objects, though slight problems were encountered with selected objects not being identified. • The major advantage that was identified by the users of the system was the improved ease of use that was obtained by making the image data active. In general the users found it much easier to locate and select an object within an image, than a word within a document. • Several users criticised the amount of time that it took to load the image data (these users were, in all cases, those who were least experienced with computer systems). This however is a problem which is not directly related to the particular algorithms. Typically 70 to 90 % of the image retrieval time is taken up by network activity, X-Window communications, and other processing tasks which are independent of the specific image representation. Only minimal tests were performed analysing the quantitative performance of the application. Most of the necessary testing has been performed individually on the coding and vision algorithms.9 These involved assessing the compression ratios that were obtained for a variety of images, the accuracy and robustness of the recognition algorithms, and the processing times required. The only quantitative testing that was performed directly on the multimedia application was to ensure that the integration of the algorithms did not lead to a degradation of the performance. The only variation that occurred was in the processing time that was required. This is as expected due to the change in environment, platform and background processing. The timing requirements were reasonable in all cases apart from the recognition of objects in complex images. As was mentioned previously this was solved by performing this phase off-line and storing the results in a file which could be accessed during execution. Thus each image, after it is captured, is processed, identifying all known objects within the image. It is then compressed and stored.
5. ANALYSIS Based on the development and use of the prototype HyperImage application, we can draw a number of significant conclusions related to future directions of authoring visual information in multimedia applications.
5.1. The role of active visual information The application which was developed strongly illustrated the important role which visual information can play in multimedia applications. One of the most telling statistics was the degree to which links from visual information was used in preference
to corresponding links from textual information. It was quite obvious (and to be expected) that the users found it simpler to interact with the visual information than the textual information. From this we can conclude that if multimedia is to reach its potential then we need to ensure that the visual information is fully integrated into the multimedia applications. This includes the ability for the visual information to provide interactivity. We have recognised that for this to occur manual authoring is impractical - we need to consider ways to avoid the need to hand-craft these applications.
5.2. Using object recognition Although the application which was developed had a high level of success, this was rather artificial. It certainly illustrated the appropriateness of using active visual information. However it should be recognised that the visual information within the application was limited to relatively simple objects. Every object was rectilinear, relatively simple in shape, and had a simple shading, texture, and colour. It was only because the objects were so simple, that the object recognition scheme had such success in correctly identifying objects. For an object recognition scheme to be effective in implementing active visual information in multimedia it needs to satisfy at least two criteria. Firstly it must be robust, reliable, and consistent for a very wide range of applications and objects. Secondly, it must be very straightforward to expand the object database which it uses to identify objects. The object recognition scheme used for HyperImage satisfies neither of these criteria. Much research is occurring in the field of computer vision, and great success has been achieved in restricted application domains and for restricted image sets. Nevertheless, a general object recognition scheme which could handle the broad range of visual data present in multimedia applications is likely to be a considerable distance off. In the long term, object recognition will become increasingly important in multimedia authoring and multimedia applications (as has been foreshadowed by experiments such as HyperImage). In the shorter term however, object recognition would appear to be insufficiently mature, except in perhaps very isolated cases.
5.3. Other approaches to authoring Having accepted that object recognition is likely to be too impractical for use for general multimedia applications, we need to consider possible alternatives. Previously, our view of the approach to automating the authoring process was outlined. Support is provided through both procedural guidance (assisting in the process of converting the information) and intelligent assistance (providing context-dependant choices for the author). Considering this view, and recognising that we cannot completely automate the authoring process, we can semi-automate the process. In order to provide an effective analysis of the visual information in multimedia, we can either restrict the visual information - which is impractical for this application - or restrict the analysis which we are performing. Authoring assistance can be provided by combining appropriate analysis tools with the interaction of the author to guide the analysis. For example, procedural guidance can be performed by using analysis tools (such as segmentation) to assist the multimedia author in identifying possible objects. The author will interact with the analysis tools, guiding them where necessary, but providing the necessary control. The analysis tools are used essentially to provide assistance to the author, rather than performing the entire authoring process. The authors current research is following this path - investigating methods of using image analysis tools to semi-automate the authoring process which has previously been performed by hand.
6. CONCLUSIONS This paper has discussed the role which visual information can play in multimedia applications. For multimedia to achieve its immediate potential, the visual information needs to become truly integrated. This in turn requires that it become an active media. In order for a given media to be used as active data, the individual components of that media need to be identified during the authoring process. Traditionally, image data is manually marked-up during authoring. For large image databases - which will become increasingly common - this is excessively expensive and time-consuming. We need to consider ways in which we can automate the authoring process.
The paper described the results obtained from a prototype multimedia application called HyperImage. This application integrated a recognition scheme into the application. This illustrated how important visual information can be in improving the useability of multimedia systems. It also showed the role that object recognition can play in providing this useability. Unfortunately computer vision is still too immature to provide a sufficiently high level of robustness and accuracy for multimedia authoring, except for very specialised applications. We can nevertheless develop tools which will semi-automate the authoring process. A typical example would be an integrated tool, based on image segmentation, which makes suggestions to the user regarding possible objects within images. The user would have control over these analysis tools, which act to assist rather than replace the multimedia author. The implementation of the active image media is the first phase of what the authors see as the development of the ability to treat all media generically. Thus the application becomes independent of the particular media type. Each media type allows static and dynamic links from its various semantic components, contextual analysis etc. - the range of options now available with textual media. This research has tried to begin the work on treating image data in the same fashion as textual data.
7. REFERENCES 1.
S. Al-Hawamdeh, B. C. Ooi, R. Price, T. H. Tng, Y. H. Ang, and L. Hui, “Nearest Neighbour Searching In A Picture Archive System,” International Conference on Multimedia Information Systems, ACM and ISS, 1991, pp 17-33
2.
J. Aloimonos, “Visual shape computation,” Proceedings od the IEEE, vol. 76, no. 8, pp 899-916, Aug 1988
3.
P. Constantopoulos, J. Drakopoulos, and Y. Yeorgaroudakis, “Retrieval of multimedia documents by pictorial content: A prototype system,” International Conference on Multimedia Information Systems, ACM and ISS, McGraw Hill, 1991, pp 35-48
4.
T. H. Edgar, C. V. Steffen, and D. A. Newman, “Digital storage of image and video sequences for interactive media integration applications: A technical review,” International Interactive Multimedia Symposium; Perth, W.A., January 27-31 1992, pp 279-284
5.
A. Ginige, and C. Fuller, "Magazine of the Future: A Vision and a Challenge", IEEE Multimedia, summer 1994
6.
K. Liew, “HyperDoc: WinWord to WinHelp document conversion macros”, Shareware software, September 1993.
7.
D. B. Lowe, “Image Representation via Information Decomposition,” PhD Thesis, School of Electrical Engineering, University of Technology, Sydney, December 1992
8.
D. Lowe, Perceptual Organisation and Visual Recognition, Kluwer Academic Publishers, 1985
9.
D. B. Lowe and A. Ginige, “The Use of Object Recognition in Multimedia,” Image and Vision Computing NZ '93, Auckland, New Zealand, August 16-18, 1993
10.
D. Lowe, A. Ginige, "A Hierarchical Structure for Spatial Domain Coding of Video Images", The Australian Video Communications Workshop, Melbourne, Australia, July 1990, pp 195 - 203.
11.
J. Robertson, E. Merkus, and A. Ginige, "The Hypermedia Research Toolkit (HART)", European Conference on Hypermedia Technologies ‘94, United Kingdom, 1994.
12.
C. Y. Wong, “Research Directions in Hypermedia,” International Interactive Multimedia Symposium; Perth, Australia, Curtin University of Technology, January 27-31 1992, pp 299-310
13.
L. J. Wu, “Image coding using visual modelling and composite sources,” International conference on Digital Signal Processing, Florence, pp492-497, September 1984