Cassandra Swain. Joern Ostermann. Company/Institution: Columbia University, IBM, AT&T ... February 1999, Lancaster, UK. Title: Proposal for MPEG-7 Image ...
Proposal Id:
P480 Proposal for MPEG-7 Image Description Scheme
Name: (Columbia University) Seungyup Paek Ana B. Benitez Shih-Fu Chang
(IBM) Chung-Sheng Li John R. Smith Lawrence D. Bergman
(AT&T) Atul Puri Cassandra Swain Joern Ostermann
Company/Institution: Columbia University, IBM, AT&T
1) Which item is proposed? b) Description Scheme
2) Do you have other proposal(s) related to this one? Please list the proposal Ids. P471 to P485
1
Form for Description Schemes
1. Which media has your DS been developed for and/or applied to? (e) Still Images
2. What is the main functionality of your DS? • • • •
Object-oriented multilevel abstraction of image content Hierarchical structure of objects Entity relation graphs for general relationships among objects Feature categorization (media, semantic, visual) and clustering for efficient indexing • Code downloading for linking external features • Modality transcoding for linking objects to other modalities • Flexible description framework for adding new features
3. Is your DS generic to different content domains or is it content domainspecific (e.g. biomedical, etc)? If your description scheme is content-domainspecific please answer the following: It is a generic DS. It has been used to derived other DS proposals for video, multimedia, synthetic, and specific domains, such as remote sensing, home media, etc. 3.1. Which domain and tasks does your solution supports? 3.2. Is there evidence of support from professional organizations for the technology proposed and/or from applicable domain current practices? 3.3. Is this applicable to other domains? Do you have evidence of this? 4. State up to three MPEG-7 applications (for examples you may refer MPEG-7 Applications Document) to which your proposal applies. • Storage/retrieval of image databases at multiple levels
2
• Agent-based visual content filtering • Personalized multimedia presentation 5. Will you provide a presentation to explain your DS proposal and answer questions about it? Yes.
6. Will you provide a demonstration to show how your DS meets the evaluation criteria? Are you using the MPEG-7 test sets or are you providing your own test sets? Yes. We will use part of the MPEG-7 test set and some of our own test sets.
3
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO ISO/IEC JTC1/SC29/WG11 MPEG99/P480 February 1999, Lancaster, UK
Title: Status: Source: Authors:
Proposal for MPEG-7 Image Description Scheme Proposal Columbia University, IBM, AT&T (Columbia University) (IBM) Seungyup Paek Chung-Sheng Li Ana B. Benitez John R. Smith Shih-Fu Chang Lawrence D. Bergman
(AT&T) Atul Puri Cassandra Swain Joern Ostermann
1 INTRODUCTION We propose a description scheme (DS) for image content that meets the requirements outlined in the MPEG-7 requirements document. While an image DS can be generated by any description definition language (DDL), we use eXtensible Markup Language (XML) to represent the proposed image DS. The use of XML is motivated by its recognized advantages: self-describing capability, intuitive readable format, and extensibility. In section 2.2, we discuss the reasons for use of XML in this proposal for image DS. Our Image DS consists of several basic components: objects, object hierarchy, entity relation graph, and feature structure. Each image description includes a set of objects. The objects can be organized in one or more object hierarchies. Relationships among objects can also be expressed in one or more entity relation graphs. Each object has one or more associated features. Each object can accommodate any number of features in a modular and extensible way. The features of an object are grouped together according to the following categories: visual, semantic, and media. Multiple abstraction levels of features can be defined. Each object may be associated with objects in other modalities through transcoding. The feature classes are explained in section 3. In the image DS, each feature of an object has one or more associated descriptors. Each feature can accommodate any number of descriptors in a modular and extensible way. We include examples of descriptors for different features, to show how descriptors can be associated with
1
each feature. Using the XML framework, we also include specific Document Type Definition (DTD) for each example descriptor. For a given descriptor, the image DS also provides a link to external descriptor extraction code and descriptor similarity code. We will demonstrate the utility of having these links with respect to the systems we are developing. We will demonstrate how our image DS is used in two search and filtering systems. These systems have capabilities that critically depend on interoperable image descriptions. The first system is a search engine that allows users to query based on semantic, visual, and media features. The diverse set of features is generated automatically, semi-automatically, and manually by a wide variety of systems both in our laboratory and by other groups we are collaborating with. The second system is a new version of MetaSEEk, a metasearch system for mediation among multiple search engines for audio-visual information. Our proposal is organized as follows. In section 2, we first briefly discuss how our proposal uniquely addresses the requirements of MPEG-7 and discuss the specific role of XML in our proposal. Section 3 highlights the general concepts and examples of the proposed DS while a special case is illustrated by an example in Annex B. Section 4 discusses the principles we use in the creation and organization of DTDs while the detailed DTD’s themselves are presented in Annex A. Section 5 describes the ongoing work on demonstration (via search/filtering systems) to illustrate the strength of our proposed DS. Finally we summarize the key points of our proposal in section 6.
2 DESIGN PRINCIPLES AND GOALS The proposed DS is designed to address the MPEG-7 requirements for visual content description. We first explain how the design satisfies most of MPEG-7 requirements and further focus on specific requirements of flexibility and extensibility that are addressed by the use of eXtensible Markup Language.
2.1 Satisfying General MPEG-7 Requirements The proposed DS satisfies the proposal criteria stated in the MPEG-7 evaluation document as follows. Object-oriented multi-level abstraction The proposed DS uses “object” as the fundamental entity in describing image content at various levels, which can be defined along different dimensions. For instance, objects can be used to describe image regions, groups of regions, and the entire image frame based on its spatial composition structure. High-level objects can be used to describe groups of primitive objects based on some semantic or feature criteria. Different types of features can be used for different levels of objects. As an example, visual features are adequate for objects corresponding to physical components in the image, while semantic features can be applied to objects at any level. Effectiveness Our object-oriented DS provides an efficient framework for accommodating various types of images and their features in different domains. In section 5, we will discuss our
2
current efforts in using the proposed DS to support several large-scale image search and filtering systems. Applications Domain The proposed image DS is a generic one for general applications. It has been used as a base DS to derive domain specific DS’s (Proposal #475 Petroleum DS, Proposal #476 Remote Sensing/Medical DS, Proposal #479 Home Media DS), video DS (Proposal #481), integration DS for multimedia (Proposal #473 and #477), synthetic DS (Proposal #472), and the for content archives (Proposal #482). Comprehensiveness We use an emerging standard from W3C, XML, to represent our proposed DS in this document, although any specific DDL selected by MPEG-7 can be used to serve the same purpose as well. We will demonstrate that new features can be easily accommodated using the hierarchical structures and entity relation structures. Flexibility The flexibility of the proposed DS is achieved by (1) allowing parts of the DS to be instantiated; (2) using efficient categorization of features and clustering of objects (using the indexing hierarchy described later); and (3) supporting efficient linking, embedding, or downloading of external feature descriptor and execution codes. Extensibility Elements defined in the DS can be used to derive new elements for different domains. As mentioned earlier, it has been used as a base DS in other DS proposals for specific domain, video, or integrated content. Scalability One unique aspect of the DS is the capability to define multiple abstraction levels based on any arbitrary set of criteria. The criteria can be specified in terms of visual features (e.g., size), semantic relevance (e.g., relevance to user interest profile), or service quality (e.g., media features). Simplicity The DS specifies a minimal set of components: objects and a set of required features. Additional objects and features can be easily added in a modular and flexible way.
2.2 Focus on Flexibility and Extensibility: Extensible Markup Language (XML) From among the various MPEG-7 requirements, the flexibility and extensibility requirements are critical for widespread acceptance of MPEG-7 in many application domains. We propose use of XML to satisfy these requirements; a brief historical overview of XML leading to justification of its use in our proposal is presented next. SGML (Standard Generalized Markup Language, ISO 8879) is a standard language for defining and using document formats. SGML allows documents to be self-describing, i.e. they describe their own grammar by specifying the tag set used in the document and the structural relationships that those tags represent. However, full SGML contains many optional features that are not
3
needed for Web applications and has proven to be too complex to current vendors of Web browsers. The World Wide Web Consortium (W3C) has created an SGML Working Group to build a set of specifications to make it easy and straightforward to use the beneficial features of SGML on the Web [10]. This subset, called XML (eXtensible Markup Language), retains the key SGML advantages in a language that is designed to be vastly easier to learn, use, and implement than full SGML. A major advantage of using XML is that it allows the descriptions to be selfdescribing, in the sense that they combine the description and the structure of the description in the same format and document. XML also provides the capability to import external DTDs (e.g. DTDs for feature descriptors) into the image description scheme DTD in a highly modular and extensible way. We will expand on this and see an example of this in the next section.
3 IMAGE DESCRIPTION SCHEME In this section, we present the image DS for interoperable image descriptions. To clarify the explanation, we use the example shown in Figure 1. Using this example, we will walk through the image DS expressed in XML. Along the way, we will clearly explain the use of each XML element that is defined for the proposed image DS. The complete set of XML rules for the tags in the image DS are defined in our XML document type definitions (DTD) and are presented in the section 4. For complete description of our DTDs, please see Annex A. o0
g0
o4 o5
o3
o2
o1
o0 (a)
Figure 1: a) Objects in an image. b) Objects in an image.
Our image description scheme consists of several basic components: image object, object hierarchy, entity relation graph, features, and code download for feature extraction and similarity matching. We explain these components by giving examples in this section. Specific DTD structures of these components will be discussed in the section 4 and Annex A. Figure 2 and Table 1 summarize the proposed image DS in UML notation. The proposed description scheme applies not only to photograph images but also to images from specialized scenarios with very precise needs. As examples, we are going to consider satellite images (or remote sensing images), and service images in sections 3.6 and 3.7, respectively. Satellite images are usually extremely large images for which many objects may be extracted; therefore, a method to index the objects within the descriptions based on a feature and/or semantic criteria is needed for efficient retrieval (section 3.6). Service images may required to be transcoded into different media modalities to reach all the different types of users (section 3.7).
4
1 Object Set 1..*
0..* Image DS
Object Hierarchy
Object
0..* Entity Relation Graphs
0..* Media Features
1..*
0..* Visual Features
1..*
0..* Semantic Features
1..*
Descriptors
Descriptors
Descriptors
Figure 2: UML representation of image DS. Table 1: Elements in the Image Description Scheme (DS).
Element image
image_object_set image_object
img_obj_media_features
img_obj_semantic_features
img_obj_visual_features
object_hierarchy object_node entity_relation_graph entity_relation
Contains image_object_set (1) object_hierarchy (0..*) entity_relation_graph (0..*) image_object (1..*) img_obj_media_features (0..1) img_obj_semantic_features (0..1) img_obj_visual_features (0..1) location (0..1) file_format (0..1) file_size (0..1) resolution (0..1) modality_transcoding (0..1) … text_annotation (0..1) who, what_object, what_action, why, where, when (0..1) … image_scl (0..1) color (0..1) texture (0..1) shape (0..1) size (0..1) position (0..1) … object_node (1) object_node (0..*) entity_relation (1..*) relation (1) entity_node (0..*) entity_relation (0..*)
May be Contained in (root element)
image object_set
image_object
image_object
image_object
image object_hierarchy object_node image entity_relation_graph entity_relation
5
3.1 Image object and object set An object refers to one or more arbitrary regions of an image. To indicate if the object has associated semantic information, the object has an optional "semantic" attribute that can be true or false. To indicate if the object has associated physical information (shape, color, position etc.), the object has an optional "physical" attribute that can be true or false. To indicate if the regions of an object are always spatially adjacent to one another (continuous in space), the object has an optional "spaceContinuous" attribute that can be true or false. Examples of objects from a news photograph are shown in Figure 1. In Figure 1a), objects o1 (Boris Yeltsin), o2 (Jiang Zemin), o4 (Yeltsin’s face), o5 (Jiang’s face) and g0 (entire photo) are objects that only have one associated region. Object o0 (people) and o3 (faces) are examples of objects that each have multiple regions that are separated in space. In Figure 1b), object o0 (building) is an example of an object that has multiple regions that are both separated and connected in space. It is important to note that while each object is used to describe one or more arbitrary regions of an image, it is not necessary, although possible, to explicitly provide information concerning the shape and position of each region. For example for object o1 in Figure 1a), we may or may not provide the shape and position of Boris Yeltsin. This is discussed further when we explain the feature elements that are included within the object element. A special type of image object element is the global object element, which contains the features that are common to an entire image. The set of all objects that are to be described in an image is included within the object set element (). Each object element has a unique identifier within an image description. The identifier is expressed as an attribute of the object element (id). The objects that we have chosen to describe for the image in Figure 1a) are listed below. The type of the object is defined as an attributed of the image object element (type). For now, we will leave each object element empty. Later in this section, we will describe how features are included within each object element in an extensible and modular way.
3.2 Object hierarchy The image DS is comprised of object elements that can be combined hierarchically in one or more object hierarchy elements (). The hierarchy is a way to organize the object elements in the object set element. Each object hierarchy consists of a tree of object node elements (). Each object node points to an object. Below, we show two possible hierarchies for the image example of Figure 1a). The object node element has an associated unique identifier in the form of an attribute (id). The object node element references an object element by using the latter’s unique identifier. The reference to the object element is included as
6
an attribute (object_ref). An object element can also include links back to object nodes in the object hierarchy by using an attribute (object_node_ref).
3.3 Entity relation graph The image DS also allows for the specification of relations among objects in one or more entity relation graphs (). An entity relation graph is a set of explicit entity relations () among two or more entity nodes (). There are several types of relations; Table 2 contains examples of relation types and relations. Table 2: Examples of relation types and relations.
Relation Type Temporal
Spatial
Semantic
Relations
Directional
Before Of, After Of, Immediately Before Of, Immediately After Of
Topological
Co-Begin, Co-End, Parallel, Sequential, Overlap, Adjacent To, Neighboring To, Near By, Within, Contained In
Directional
Top Of, Bottom Of, Right Of, Left Of, Upper Left Of, Upper Right Of, Lower Left Of, Lower Right Of
Topological
Adjacent To, Neighboring To, Near By, Within, Contained In Relative Of, Belongs To, Part Of, Related To, Same Object as,
For the example in Figure 1, a relation graph with two relations is shown below. Each entity relation can have a unique identifier in the form of an attribute (id). The type of entity relation is included in the entity relation element as an attribute (type). An entity node element in the entity relation graph references an object element in the same way an object node references an object 7
element. Each entity node element is also assigned a unique attribute (id). An object element can also include links back to entity nodes in the relation graph using an attribute (entity_node_ref). Left Of Face of
3.4 Features An object set element, one or more object hierarchy elements, and one or more entity relation graphs form the image element (). The image element comprises the description of the image. In our image DS, each object element contains the feature elements. The features associated with an object are grouped according to the following categories: media (), semantic () and visual (img_obj_visual_features)> [5]. Table 3 shows examples of features that belong to each feature class. Table 3: Feature classes and features.
Feature Class
Features
Semantic
Text Annotation, Who, What Object, What Action, Why, When, Where
Visual
Color, Texture, Position, Size, Shape, Orientation
Media
File Format, File Size, Color Representation, Resolution, Data File Location, Author, Creation, Scalable Layer (MPEG-2), Modality Transcoding
Note that we propose six specific semantic features (who, what object, what action, whey, when, and where) plus another text annotation feature to allow annotations. We define the 6-Ws with the text annotation element for convenience. It is important to emphasize that our image DS can include any number of features for each object in an extensible and modular way. In the following, we show how a variety of features can be incorporated in each object element. Boris Yeltsin, Jiang Zemin Toast
8
Moscow, Russia, Kremlin 24 April 1997 Signing of declaration GIF 49195 bytes 240 175 Russian President Boris Yeltsin
The feature elements will eventually include the descriptors that are selected by MPEG-7 for each feature. As an example, we have included the tamura texture descriptor for the texture feature element in the above example. Descriptors that may be associated with some visual features are shown in Table 4. Table 4: Examples of visual features and associated descriptors.
Feature
Descriptors
Color
Color Histogram, Dominant Color, Color Coherence Vector, Visual Sprite Color
Texture
Tamura, MSAR, Edge Direction Histogram, DCT Coefficient Energies, Visual Sprite Texture
Shape
Bounding Box, Binary Mask, Chroma Key, Polygon Shape, Fourier Shape, Boundary, Size, Symmetry, Orientation
3.5 Import new descriptors by code downloading For descriptors associated with the visual features, XML enables the image DS to include links to external extraction and similarity matching code, as shown in the following example. The 9
example also shows the way in which XML enables any number of externally defined DTDs for descriptors to be imported and combined into the image DS. This shows that content from different sources using proprietary descriptors can still be searched and filtered efficiently.
40 40
In summary, the object hierarchy, the entity relation graph, object set elements, image object elements and feature elements are contained in the image element (). The objects are combined hierarchically in one or more object hierarchy elements. Relations between objects can be expressed in entity relation graphs. For efficient traversal of the image description, links are provided to traverse from objects in the object set to corresponding object nodes in the object hierarchy and vice versa. The same is also true for the entity nodes in the entity relation graphs. Each object includes features that can link to external extraction and similarity matching code.
3.6 Indexing hierarchies and multi-abstraction levels In describing satellite images, they main problem that arises is how to describe their large amount of objects in a way that will facilitate the retrieval based on the value of specific feature or semantic criteria. If we choose to describe satellite images using the proposed image DS, all the objects will be pooled into the image object set. We will still need to index the objects in the object set for efficient transversal using specific criteria. In this scenario, we propose to use an object hierarchy to cluster the objects of the image based on some of the object features. The procedure will be similar to the one used to cluster images in visual information retrieval engines. The intermediate nodes of the object hierarchy will be represented as image objects and will include the criteria, condition on one or more features, used for grouping the objects at that level. The image description will include as many indexing hierarchies as different relevant criteria are considered. Figure 3 shows a very simple example where image objects have been grouped based on the value of their areas (number of pixels). The XML for this example is provided in Annex B. Instead of using only one feature, multiple features could have been used for the same purpose. The image description could include multiple indexing hierarchies for different indexing criteria. 10
The same approach could also be used to represent multiple levels of abstraction based on the specific semantic or visual criteria. Continuing the example in Figure 3, we have actually defined three different levels of abstraction based on the size of the objects. Any other importance criteria could be used. The resolution level will indicate the number of levels descended in the hierarchy. All the terminal objects included in those levels will be assign to that specific resolution layer (see Table 5). o6
area
o2
o7
o1 o1
o5 o4
o3
area < 50
o2 o4
a)
o3
o8
b)
area < 10
o5
Figure 3: a) Example of objects in satellite image. b) Indexing hierarchy based on size of objects. Table 5: Objects in each resolution layer.
Resolution Layer
Objects
1
o1, o2
2
o1, o2, o3, o4
3
o1, o2, o3, o4, o5
3.7 Modality transcoding In a service scenario, content broadcasters may have to transmit some image content to their users. Due to the difference of terminals and bandwidth each user has, the content broadcaster will need to transcode the image content into different media modalities and resolutions as needed for each specific user. A clear example is an image being transcoded into a text description to be received by a palmpilot. An important media feature that we include in our image DS is the modality transcoding. This media feature contains the modality media, the resolution, and the location of transcoded versions of the same image object into different modalities and resolutions. Local objects as well as global objects can included their transcoded versions in their descriptions. Let’s consider a very simple example: an image object for which a text and audio transcoded versions are also available. The important pieces of XML in this example follow.
11
Generates
Generates
4 DOCUMENT TYPE DEFINITION OF IMAGE DESCRIPTION SCHEME This section discusses how we used XML to design and create the document type definition (DTD) of the proposed MPEG-7 image DS. Table 1 summarizes the DTD of the proposed image DS. Annex A includes the full listing of the DTD of the image DS. A Document Type Definition (DTD) provides a list of the elements, tags, attributes, and entities contained in the document, and their relationships to each other. DTDs specify a set of rules for the structure of a document. DTDs may be included in the file that contains the document they describe, or they may be linked to from an external URL. Such external DTDs can be shared by different documents and Web sites. A DTD is included in a document’s prolog after the XML declaration and before the actual document data begins. The first thing you need in a DTD is the root tag. In our image DS, we call this the tag. Every tag used in a valid XML document must be declared exactly once in the DTD with an element type declaration. An element type declaration specifies the name of a tag, the allowed children of the tag, and whether the tag is empty. The first tag we have to define is the root tag, as follows:
12
All element type declarations start with . They include the name of the tag being declared image and the allowed contents (image_object_set, object_hierarchy*, entity_relation_graph*). This declaration says that an image element must contain an image object set (), zero or more object hierarchy (), and zero or more entity relation graph ()1. Then, we define the image object set element.
The first declaration says that an image object set element () must contain one or more image object elements (). The second declaration says that an image object element must contain optional image object media feature (), semantic feature (), and visual feature () elements. In addition, we have defined that the 1
The asterisk mark (*) indicates zero or more occurrences. The plus sign (+) indicates one or more occurrences. The question mark (?) indicates zero or one occurrence.
13
image object tag has one required attribute, type, that can only have two possible values (LOCAL, GLOBAL); and three optional attributes, id, object_ref, and object_node_ref, of type ID, IDREFS, and IDREFS, respectively. Some XML tags include attributes. Attributes are intended for extra information associated with an element (like an ID). The last three declarations correspond to the image object media feature, semantic feature, and visual feature elements. These elements group feature elements depending on the information they provide. The media features element () contains an optional location, file_format, file_size, resolution, and modality_transcoding element. The semantic feature element () contains an optional text_annotation and the 6Ws elements. The visual feature element () contains optional image_scl, color, texture, shape, size, and position. In the DTDs in Annex A, we have chosen to declare the image_scl, location, file_format, file_size, resolution, text_annotation, modality_transcoding, color, texture, shape, size, and position elements in external DTDs using entities for the sake of clarity and flexibility. We will describe now how to reference a separate external DTD for each one of these elements. In the simplest case, DTDs include all the tags used in a document. This technique becomes unwieldy with longer documents. Furthermore, we may want to use different parts of a DTD in many different places. External DTDs enable you to build large DTDs from smaller ones. That is, one DTD may link to another and in so doing pull in the elements and entities declared in the first. Smaller DTD’s are also easier to analyze. DTDs are connected with external parameter references, as follows: