The Use of Object Recognition in Multimedia - CiteSeerX

8 downloads 0 Views 199KB Size Report
Multimedia involves the use of multiple forms of communication media in an interactive and integrated manner. At present, textual data is the media com-.
The Use of Object Recognition in Multimedia David B. Lowe, Athula Ginige School of Electrical Engineering, University of Technology, Sydney Ph: +61 2 330 2526, Fax: +61 2 330 2435 E-mail: fdbl, [email protected]

Abstract Multimedia is a technology which has been enjoying considerable attention within the last few years. Multimedia involves the use of multiple forms of communication media in an interactive and integrated manner. At present, textual data is the media commonly used to provide the interactivity (due to the ease with which discrete elements are identi ed). It is common practice to follow links from words or phrases within text to associated information elsewhere in the database. To achieve a similar degree of functionality with image data typically requires that each image be processed by hand, indicating the objects and locations within the image. This paper describes a simple object recognition system which allows the speci cation of 3-dimensional models which can then be used to recognise objects within any image, in an analogous fashion to words within text. This enables image data to be become a truly active media, within a multimedia database. It provides a signi cantly enhanced level of functionality with minimal additional e ort required during data entry. The basic algorithm is described and then an example application is outlined, along with feedback from users of the system.

1. Introduction A survey of the elds which make use of image data (coding, vision, synthesis, graphics etc.) reveals that the eld is very fragmented, with little underlying structure and only minimal interaction between the various elds. This has not been a major limitation to date, as most of the applications have been speci c to a particular form of image usage. This does not hold true for multimedia, where a number of di erent image algorithms need to be combined; coding, synthesis, recognition etc. This can best be understood by considering the usage of textual data within multimedia, an area that has received considerable research attention. The text can be stored, analysed, manipulated, and generated synthetically. Essentially the text can be treated as consisting of discrete entities (words, sentences, paragraphs etc.) which obey a series of syntactic rules describing the inter-relationships. The evolution of image data is still at a much lower level, and is predominantly treated a passive media. Conceptually image data can be treated in the same way as text; consisting of discrete entities which obey certain syntactic rules. The main problem however lies in identifying these entities and the associated rules, and then interpreting these. When this is achieved image data will become an active media as powerful (and in many cases, more so) than textual data. It is this premise which has driven this work.

In earlier work, an image representation was developed which was appropriate to both image coding and object recognition. The representation involved a spatial decomposition of various layers of information within an image. These layers included feature information (edges, and edge sharpness) texture, shading and colour. This representation was then applied to both image coding and object recognition. With the image coding, each layer can be coded individually, allowing optimisation of the coding algorithm to the particular layer. This resulted in a progressive coding scheme that gave exceptionally high compression ratios. For the recognition scheme each layer of the representation was used to generate object hypotheses based on feature matching (with the features used depending upon the layer). The hypotheses from each layer were then correlated and assessed to obtain the nal image interpretation. A very robust image recognition system resulted. This paper describes the application of this representation in the implementation of a simple multimedia database. It allowed, in a restricted fashion, the image data to become truly active by integrating the recognition scheme into the application. Section 2 describes the representations relevance to a multimedia application. Sections 3 and 4 outline the implementation and testing, and section 5 summarises the ndings.

2. Multimedia Systems 2.1 What is Multimedia ? Multimedia is a technology which has been enjoying signi cant attention within the last few years. Multimedia involves the use of multiple forms of communication media (such as text, audio, video, images etc.) being used interactively. At present the most common forms of multimedia applications are surrogate travel, telemarketing, telemedicine and interactive learning. These applications can all make extensive use of image and video data. The ability to interact with the images is an important concept. To date, most of the research on the interactivity of multimedia has focussed on text, and improving the ability to link textual data into the other media. Little research has been performed on extending image, video and audio data along similar lines.

2.2 Multimedia Requirements A multimedia system places certain requirements on the format of the data that is to be used. In order to obtain the necessary degree of interactivity, the image data should be represented or coded in such a way that gives the ability to view the images in an interactive manner, to create links from objects within an image, to automatically index images in terms of their image content and to browse quickly through video information. The image data should not just be a passive media but be an active part of the multimedia system in the same way as text (e.g. it is possible to to manipulate a piece of text, moving words; in the same way it should be possible to manipulate images, selecting objects, moving them etc.). Ideally the ultimate aim would be to be able to treat the image data in a purely analogous way to that which text is currently treated. This involves abstracting the concept of modular data in which an `objects' components are treated identically from an external point of view irrespective of the form of the data; text, audio, image, video, animation. The individual atoms of each data set can be identi ed, along with their context, allowing it to be used as the source of automatically generated links.

2.3 Image Speci cations In order to allow image data to be used in a multimedia environment requires that the data types used to represent these media be appropriately designed. The previous section described the major requirements that a multimedia system places on the data representation. From this we can derive certain speci cations for a representation and coding scheme for image data that is generalised for multimedia applications. Currently, the DCT is the most common basis for coding image data, being the basis of the JPEG, MPEG and H261 coding schemes. The primary focus of these schemes has been data compression and this issue still exists, insofar as it is still important to attempt to reduce the sheer quantity of data that exists when an image or video sequence is represented in its canonical form. Any coding scheme will have to consider the issue of reducing the quantity of data used to represent a given image or video sequence. However when considering multimedia there are a number of signi cant issues in addition to the problem of compression. The ability to handle full interactivity has several implications. The rst of these is the need for a large degree of exibility. We will need to both be able to handle the images in low level forms for storage, transmission and other low level functions, as well as to be to make use of higher level information which deals with the content of the image rather than its representation. Another implication is that we must be able to recognise, utilise and create links to individual elements of the image. This implies that the data should be coded in a form which allows spatial knowledge of the image to be utilised, i.e. the coding scheme should be spatially oriented. A scheme which adopts the more common frequency coding techniques (such as the DCT) inhibits the access to individual elements of the image. This is a major restriction which limits the interactivity that can be obtained. In addition to being able to access spatially discrete image elements, we should be able to generate and utilise hooks into these elements, thus providing links between the various media types. An example of this will be the synchronisation of video and audio tracks, the ability of video or image events to trigger other actions etc. This can be achieved in a number of ways. For example, the ability to recognise objects within a scene is one method of generating hooks into an image.

2.4 Existing Representations The volume of research that has been performed on the role and representation of image data within multimedia systems is relatively small. The primary research focuses have been on information management, data modelling, database issues and standardisation issues [1]. In general image and video data, when used, has been stored in the form that was most convenient given the hardware and software tools provided: tools that were typically not designed with multimedia in mind. Alternatively hardware that was developed for alternative imaging applications has been adapted for use. An example of this is the use of the JPEG coding scheme (with the associated hardware CODEC) to code and decode the image data. Almost all these schemes su er from the the same limitation, the image representation is not particularly appropriate for a multimedia application. Despite this there is minimal research being performed on looking at these problems and methods of solving them. Edgar, Ste en and Newman [2] look at current technologies in this area and the e ect that these have had. This includes advances in image handling capabilities, the development of image compression standards, new storage techniques and network applications. The more fundamental issue of the image representation itself is however not raised. In order to adapt image data to a multimedia environment the usual technique is to add

additional information to the representation without modifying the underlying structure. Examples of this approach include manually specifying the location and orientation of objects within the image and storing this data with the image, or giving a textual description of the image and then using this description for cross-referencing [3]. Some research has been attempting to make image data active, rather than passive. This means that the image data can used for accessing other data in the application. In almost all cases [4] this work has focussed on unrealistically simplistic images, usually computer generated, two-dimensional or sketches. Nevertheless, this work does indicate the validity and possible applications of this research area.

3. Demonstration Application 3.1 Application Description In order to illustrate the application of the image representation to a system which combines coding and recognition techniques, a demonstration multimedia database was developed. This database, although very simplistic in comparison to existing multimedia applications, demonstrates the range of possibilities using the image representation to enable active image data rather than passive image data, as well as providing a platform for investigating the general performance. No attempt was made to create a commercial quality multimedia application, as the primary aim was to investigate the implications of the imaging principles involved, rather than the various multimedia principles. As a result the overall operation and performance of the system from a multimedia point of view is quite simplistic. Only sucient functionality to investigate the image algorithms has been incorporated. The application that was chosen was a database based on grocery items. A large selection of groceries were included in the database and the information on each of these was included in textual, audio and image formats. The information that was available varied over a broad range, from broad category descriptions to speci c items. Within the broader categories it was possible to move to more speci c items using either the static links, or using dynamic links based on selecting individual words, or selecting objects within the images. The database consisted of approximately 30 objects and 50 images, many of which contained multiple objects. The objects and images were arranged into a hierarchical structure. The development and usage of this application is detailed elsewhere [5].

3.2 Implementation and Structure The multimedia application was developed on a Sun SPARCstation running OpenWindows version 3 with an X-Windows interface. The application implementation was performed in such a way as to maximise the the exibility in terms of developing additional databases. The application used several di erent speci cation les. The rst of these was a dictionary le. This contained a list of elements and for each element three data les were speci ed; an audio le, a text le, and an image le. Thus each element in the dictionary will have a complete set of information available. The dictionary les do not specify the database, but rather act as a reference for the particular elements. The second speci cation le was the database le itself. This le speci es a particular database application. It contains a list of dictionary les, model les (to specify the models used to recognise objects), and static element links. Using this approach allows the user to develop a library of element data (speci ed in the dictionary les) and then to build databases simply by combining the relevant dictionaries.

Figure 1: Snapshot of screen containing the multimedia application

This separates the data gathering and conversion process from the authoring process; a signi cant advantage over most existing multimedia tools. Finally, the application itself was coded in such a way as to make the user interface as simple and logical as possible. The images are coded oine and then stored in their coded form. They are then loaded in when required for a particular database element. The display routines decode this data and regenerate the images progressively so that the user has an indication of the image content as early as possible. If the user then selects an object within the image (by clicking on it with the mouse) then the recognition scheme will attempt to recognise the object using the information representation rather than the having to regenerate the canonical image. If the recognition is successful the the application will jump to an appropriate database element (i.e. image, text and audio that is relevant to the selected object). Figure 1 is a snapshot of the screen containing the multimedia application. The user has just moved the mouse over an object in the image and pressed the left mouse button. The object was identi ed and the appropriate link will be subsequently traced.

4. Results and Performance The application was developed and a typical database installed. The program was then assessed both from a subjective users point of view, as well as by making quantitative measurements of the performance.

4.1 Subjective Assessment Once the application was completed and the demonstration database installed a number of people were asked to use the system and comment on its usability. Before these comments are outlined a number of relevant issues should be raised regarding the individuals who tested the application and factors that may have a ected their reactions.

Question

% of users responding 1 2 3 4 5 X

The use of active images signi cantly improved the 0 7 29 43 usability of the multimedia application The inability to recognise all objects was a signi cant 0 14 21 36 problem The layered loading of images was distracting 7 0 43 29 The time taken to load images was acceptable 14 21 43 14 The system was suciently robust in terms of not 0 0 7 14 recognising objects incorrectly The layered loading of images was useful in obtaining 0 14 50 14 an early indication of the image content When a link was available from both the text and 0 21 21 14 the image, in general the image was used More intelligence needs to be added to the destina- 7 21 29 14 tion of the links from the image

21

0

29

0

14 7 0 7 64 14 0

21

43

0

0

36

Table 1: User ratings of the multimedia application: This lists the range of values for various

ratings for the listed questions (5=strongly agree, 4=agree, 3=neutral, 2=disagree, 1=strongly disagree, X=don't know)

The total number of people who assessed the system was fourteen. The initial level of familiarity with this style of software was quite variable. Six of the people had used a multimedia application previously, 10 had used some form of context-sensitive help system, and all but two had used computers to at least a moderate degree1. A number of additional points are worth noting. 1. The functionality of the application is relatively low when compared to commercial systems. This has several implications. The rst of these is that this immediately made the users of the system less satis ed with the system overall. Also it is likely to have had the e ect of making the additions to the functionality (in terms of the extra imaging abilities) seem better in comparison. 2. The quantity of material in the database is quite limited, as is its scope. As a result the users can very quickly explore the entire database, and then subsequently become bored with it. 3. The use of a multimedia database that allowed active image data was somewhat of a novelty. As a result the general impressions were probably more favourable. A more extended period of use would be required to remove this type of e ect. The users were each given half an hour to investigate the application. The rst half of this time the images were set to be passive and then for the following fteen minutes the images were allowed to be active. They were then asked to complete a questionnaire regarding their impressions of the system and to answer various questions rating the performance of the system. Table 1 lists the responses to the questions that were asked of the users of the multimedia application. From the comments and ratings provided by the users it is possible to draw a number of conclusions about the performance of the system and to identify possible problems. Each person assessing the system competed an extensive survey relating to their experience with computers, software, and multimedia prior to using the demonstration application 1

1. In general the users seemed to be quite impressed with the general idea of active image data. All major criticisms were aimed at implementation and usage issues rather than the concept. This is born out by both their comments and the degree to which the image data was used. 2. The major criticism that users had of the system related to the data links. Often these links led to unexpected places, or led to information which was to speci c. This problem is however a multimedia issue, rather than an imaging issue. 3. The users were almost unanimous in their acceptance of the robustness of the system. Their was no criticism at all relating to the incorrect identi cation of objects, though slight problems were encountered with selected objects not being identi ed. 4. The major advantage that was identi ed by the users of the system was the improved ease of use that was obtained by making the image data active. In general the users found it much easier to locate and select an object within an image, than a word within a document. 5. Several users criticised the amount of time that it took to load the image data (these users were, in all cases, those who were least experienced with computer systems). This however is a problem which is not directly related to the particular algorithms. Typically 70 to 90 % of the image retrieval time is taken up by network activity, X-Window communications, and other processing tasks which are independent of the speci c image representation.

4.2 Quantitative Analysis Only minimal tests were performed analysing the quantitative performance of the application. Most of the necessary testing has been performed individually on the coding and vision algorithms and is described in [5]. These involved assessing the compression ratios that were obtained for a variety of images, the accuracy and robustness of the recognition algorithms, and the processing times required. The only quantitative testing that was performed directly on the multimedia application was to ensure that the integration of the algorithms did not lead to a degradation of the performance. The only variation that occurred was in the processing time that was required. This is as expected due to the change in environment, platform and background processing. The timing requirements were reasonable in all cases apart from the recognition of objects in complex images. As was mentioned previously this was solved by performing this phase o -line and storing the results in a le which could be accessed during execution. Thus each image, after it is captured, is processed, identifying all known objects within the image. It is then compressed and stored.

5. Summary This paper has outlined the integration of coding and object recognition algorithms into a multimedia application. Although this particular application was, in most respects, quite primitive it still demonstrated how the information representation of the image could be used to make the image data an active media rather than a passive media and hence signi cantly improve the functionality. The structure of the representation led naturally to a very simple integration of the coding and vision techniques that had been previously developed. They were both based on the same representation and could thus be combined with a minimum of diculty. In this particular application much of the processing (such as the original information representation extraction, coding and object recognition) could be performed o -line. This gave a

system which, from a user perspective, lead to almost instant results. Although this may not always be the case, in most cases requiring vision and coding to be combined it should be true. In general the user assessment of this particular application was very favourable. Probably the most telling result was the proportion of time the users were using the links from image objects to navigate, rather than the textual links (though the `novelty' factor must be considered). The comments almost universally indicated that the use of active image data signi cantly improved the systems functionality. Although most existing multimedia authoring tools allow image data to be used for linking, this is typically achieved by manually specifying the regions of the image that are to contain links. This becomes impractical for large databases. The only reasonable solution is to automate the process of identifying objects within the images to be used; to create truly active images. The implementation of the active image media is the rst phase of what the authors see as the development of the ability to treat all media generically. Thus the application becomes independent of the particular media type. Each media type allows static and dynamic links from its various semantic components, contextual analysis etc.; the range of options now available with textual media. This research has tried to begin the work on treating image data in the same fashion as textual data.

6. Acknowledgements The authors wish to express their gratitude to both OTC Ltd., for the Telecommunications Student Awards, and the Australian Research Council, for the Post-Graduate Research Awards, both of which assisted in funding this research.

References 1

C.Y. Wong. Research directions in hypermedia. In International Interactive Mul-

2

timedia Symposium; Perth, W.A., pages 299{310. Curtin University of Technology, Promaco Conventions P/L, January 27-31 1992. T. H. Edgar, C. V. Ste en, and D. A. Newman. Digital storage of image and video sequences for interactive media integration applications: A technical review. In International Interactive Multimedia Symposium; Perth, W.A., pages 279{284. Curtin University of Technology, January 27-31 1992.

3

S. Al-Hawamdeh, B. C. Ooi, R. Price, T.H. Tng, Y. H. Ang, and L. Hui.

4

5

Nearest neighbour searching in a picture archive system. In International Conference on Multimedia Information Systems, pages 17{33. ACM and ISS, McGraw Hill, 1991. P. Constantopoulos, J. Drakopoulos, and Y. Yeorgaroudakis. Retrieval of multimedia documents by pictorial content: A prototype system. In International Conference on Multimedia Information Systems, pages 35{48. ACM and ISS, McGraw Hill, 1991. D. B. Lowe. Image Representation via Information Decomposition. PhD thesis, School of Electrical Engineering, University of Technology, Sydney, December 1992.

Suggest Documents