Technical Activities Forum
.
Content-Based Retrieval in Digital Libraries Nabil R. Adam, Rutgers University Aryya Gangopadhyay, University of Maryland, Baltimore County
ith the recent developments in multimedia and telecommunication technologies, content-based information is becoming increasingly important for various areas such as digital libraries, interactive video, and multimedia publishing (see P. Algrain, H. Zhang, and D. Petkovic, “Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review,” Multimedia Tools and Applications, Vol. 3). Multimedia data refers to simple structured data (such as numbers and short strings), large unstructured data (such as text documents, images, audio, and video data), and complex structured data (such as maps, graphs, charts, and tables). Here we briefly address content-based retrieval and the issues of representation, storage, and retrieval of multimedia objects in digital libraries. We then very briefly identify some open areas of research. Our expanded version of this column and a more extensive list of references is available from us (“Content-Based Multimedia Information Retrieval in Digital Libraries,” tech. report, Center for Information Management, Integration and Connectivity, Rutgers University, 1997). For more on content-based retrieval, see http://ciir.cs.umass.edu/info/
W
Technical Activities Forum coordinator: Deborah Scherrer, Stanford University, HEPL4085, Stanford, CA 94305-4085; fax (650) 725-2333;
[email protected]
Recent developments in multimedia and telecommunications technologies have made content-based retrieval increasingly important.
the content of the media objects. For example, text strings in text documents; the color, texture, and position of objects in an image; and individual frame characteristics, such as color histograms, for video objects. • Content-descriptive. Metadata that is not based on the content. For example, names of authors and years of publication. • Content-independent. Metadata that describes the characteristics of the media objects but cannot be generated automatically. For example, image characteristics like the mood reflected by a facial expression and camera shot distance. Once retrieved, data should be presented to the users in the decreasing order of retrieval-status value, which is a function of the relevancy and uniqueness of the features used for indexing and retrieval.
TEXT RETRIEVAL ciirbiblo.html and http://www.mitre.org/ resources/centers/advanced_info/g04f/ bnn/mmhomeext.html.
OBJECT RETRIEVAL Digital libraries must store and retrieve multimedia data on the basis of feature similarity. A feature is a set of characteristics. Features include text strings for text documents; color, texture, and objects for images; and objects, frame sequences, and camera operations for videos. Contentbased retrieval uses content-representative metadata to both store data and retrieve it in response to user queries. Metadata is data about the media objects stored. Manually collecting metadata is not only inefficient but also infeasible for large document spaces, so we need automatic metadata generation. Once collected, these content descriptors are linked to the physical location of data. Data-storage strategies are key to efficient retrieval. To facilitate retrieval, we classify metadata as • Content-dependent. Metadata based on some characteristics specific to
Metadata for text documents includes content description, storage information, and historical status information. Several methods have been suggested for deriving metadata from SGML (Standard Generalized Markup Language) and digitized text documents. These include textiling algorithms used for topic identification, multiresolution morphology for text line identification for keywords, and the Hidden Markov Model algorithm for keyword spotting. Several methods are available for identifying relevant text documents in response to user queries. These include searching for key index words by full-text scans, using index files, and using document clusters. Full-text scanning methods have been designed using finite state transition diagrams that are used for locating index words in full-text documents. Inverted index files are also used to link keywords with documents in which they occur. The index file itself can be structured using a B-tree linking the feature with the physical storage locations of the documents. Rich Holowczak has described how information extraction was applied to the problem of information retrieval of documents (“Extractors January 1998
93
.
Technical Activities Forum
for Digital Library Objects,” PhD thesis, Rutgers University, 1997).
IMAGE RETRIEVAL Image metadata includes raster data, data about the data types and data sets that represent the image data, and image processing history. Metadata extraction is intended to use the image features in storing and retrieving the images efficiently. This involves identifying the image features and locating objects in images and classifying the images based on the features extracted. Image-segmentation algorithms are used for locating an object within an image. Two approaches for object location are boundary detection, which isolates objects by detecting their boundaries, and region approaches, which identify the region that falls within the object. Retrieving image data is accomplished by comparing the spatial layout and relationships among objects in the image and image features such as the color and texture similarity. Retrieval methods that rely on the structural layout of the image are used boundaries and the spatial relationships among the objects contained in the boundary. Boundary-based methods include those using minimum bounding rectangles (MBR) and the plane sweep method. Plane sweep algorithms work by sweeping an image with a horizontal and a vertical line and identifying the coordinates of the points of intersection between the lines and the objects in an image. These points, called event points, identify the spatial extents of the objects in an image. Methods for identifying spa-
94
tial relationships among the objects contained in an image include 2D strings and 2D-C strings. Image data can also be retrieved on the basis of features such as color, coarseness, contrast, and directionalities of textures. Feature-based retrieval methods require a feature mapping function that would measure the distance between a query image and the image data. For efficient retrieval, image data is clustered and indexed on the basis of feature similarity. Methods for color similarity are based on color histograms and correlations.
VIDEO RETRIEVAL Video object metadata can be divided into that which describes a sequence of frames and that which describes individual frames. For frame sequences, the metadata includes information corresponding to the whole video, such as camera shot heights, distances, and motions. For individual frames, the metadata information includes color histograms, textures, and objects covered. Metadata information about video objects is gathered by a video parser that identifies the boundaries between two consecutive frames. To detect the shot boundaries, a quantitative metric gauges the information content of each frame. Whenever the difference between two metrics exceeds a predetermined threshold, a boundary is detected. Examples of metrics include pixels or blocks, and color or intensity histograms. Camera operations and object motions are captured using motion vectors and motion analysis of each block in a frame.
Metadata information about video objects typically spans several frames. Such information can be used to index and retrieve video data in response to user queries, which are stored as intervals. Data structures such as the segment index tree have been suggested for indexing and retrieval of frame-based video information.
Feature selection Feature selection is the process of identifying individual characteristics of stored objects that can be used in their efficient retrieval. It facilitates both indexing and retrieval. The indexing function is used to create description vectors for stored objects. This is accomplished by assigning weights to some predefined features (such as keywords for text documents). When a query is processed, a similar method is used to generate a query description vector. The retrieval function creates a retrieval status value (RSV) for each object-query pair. The retrieved objects are then presented to the user in the decreasing order of the RSV. At this point, users may input their feedback about the relevancy of the objects retrieved. This feedback can be applied using relevance feedback techniques, and the query can be further refined and reprocessed. The weights in the object and query vectors are calculated using feature frequencies and inverted object frequencies. The feature frequency of an object denotes the number of times an indexing feature occurs in it. Inverted object frequency is an inverse function of the number of objects in which the feature occurs.
About the Task Force on Digital Libraries
Data storage
The IEEE Computer Society Task Force on Digital Libraries promotes activities and furthers the growth of the theory and practice of all aspects of digital libraries. The task force sponsors a semiannual newsletter and the International Journal on Digital Libraries. Issues of interest include acquiring and storing information, finding and filtering information, securing information and auditing access, providing universal access, cost management and financial instruments, and socioeconomic impact (see N. Adam and Y. Yesha, “Digital Libraries: Introduction,” Int’l J. Digital Libraries, Apr. 1996). To contribute to the newsletter, contact Sue Feldman at
[email protected] or see http://cimic.rutgers.edu/ ieeedln/. To join, follow the links off the CS Web page, http://computer.org.
The storage strategy of a DL is dictated by the type of objects stored and the realtime retrieval requirements of the stored objects. The different storage strategies can be divided into single disk, multiple disks, and multiple disks with striping. Single-disk storage involves storing all media objects on a single disk. Retrieval speed in such storage strategies is determined by the disk bandwidth. In the single-disk storage method, data blocks can
Computer
be arranged contiguously, scattered randomly across disk blocks, or distributed over the disk blocks in a constrained or log-structured manner. Multiple-disk storage systems distribute the objects across multiple disks through disk striping, which facilitates concurrent data access. In the staggered striping techniques, a disk is treated as an individual storage unit. The number of disks over which a sub-object is striped depends on the bandwidth requirement. This method facilitates retrieval of media objects with different data retrieval rates. In the network striping method, multiple servers are used to manage a group of clusters that are connected to a network. Both simple and staggered techniques are used to stripe the data. Although this method improves data transfer rates for multimedia objects, its performance is dependent on network bandwidth. lthough there have been many developments in multimedia technology, there remain many problems in the field of content-based retrieval. The problems include methods for the automatic extraction of multimedia object features; and indexing, querying, and searching on the basis of multiple similarity features. The automatic extraction of features requires content analysis. To develop a content analyzer, we must determine the level of understanding of the content that is required to perform such analysis. New methods of similarity-based indexing such as statistical and multidimensional analysis should be investigated. As a first step toward developing generalized solutions, theoretical research in this area should look into specific application-oriented problems. ❖
A
Nabil Adam is a professor at Rutgers University and chair of the Task Force on Digital Libraries. Contact him at
[email protected]. Aryya Gangopadhyay is an assistant professor at the University of Maryland, Baltimore County. Contact him at
[email protected]
CS Update
.
New Award to Honor Seymour Cray
he IEEE Computer Society and Silicon Graphics/Cray Research have announced that they will honor Seymour Cray’s legacy of innovation and genius with an annual award for innovation in highperformance computing. The award, announced at Supercomputing 97, has been endowed by a $280,000 gift from Silicon Graphics/Cray Research. “Seymour Cray,” said 1997 CS President Barry W. Johnson, “was an outstanding engineer and a true pioneer in the computer industry. His many innovations helped create the supercomputing field and contributed substantially to the improvement of society in general. This award will recognize individuals who, like Cray, work creatively and diligently to provide innovative solutions to problems in computing engineering.” The Computer Society is the ideal host for this award, said Irene Qualters, president of Cray Research and senior vice president of Silicon Graphics. “The society’s passion and dedication toward profound advances in engineering are exemplified in Seymour Cray’s career. This award is a celebration of that legacy.” The IEEE Computer Society Seymour Cray Computer Engineering Award will be given to individuals “for innovative contributions to high-performance com-
T
Send CS news to Computer, 10662 Los Vaqueros Cir., PO Box 3014, Los Alamitos, CA 90720-1314;
[email protected]
Seymour Cray
puting systems that best exemplify the creative spirit demonstrated by Seymour Cray.” The award will be given annually, with the first winner selected in 1998. Recipients, who will receive a cash prize and a memento, will be selected through an extensive and thorough review process. The award nomination procedures will be publicized in Computer, in other publications, and on the society’s Web site, http://computer.org. Widely considered to be the founder of supercomputing, Cray was known for his passion for technological creativity and his constant search for new ideas. He founded Cray Research in 1972, with a proclaimed mission of designing and building the world’s most powerful and usable computers. When it was introduced in 1976, the Cray-1 supercomputer set a new standard in superJanuary 1998
95