Semantic Web for Content Based Video Retrieval - Semantic Scholar

35 downloads 44991 Views 943KB Size Report
used in a semantic web based search engine. Keywords: ... also define a ranking algorithm for sorting of video search .... Lucene happens in a good time which is practical in real time as ... software called Tesseract [1] for extraction of text from.
2009 IEEE International Conference on Semantic Computing

Semantic Web for Content Based Video Retrieval Sancho C Sebastine1 Bhavani Thuraisingham2 Balakrishnan Prabhakaran3

E-mail: [email protected]

Dept. of Computer Science University of Texas at Dallas Richardson, Texas, US [email protected]

underlying data that is to be modeled. It is hard to predict the concepts that a system would learn from the extracted features from a video. As the semantic concepts are learned on the fly as and when the features are extracted, semantic web provides a more flexible data storage technique when compared to a relational database. The second reason is to impart the framework the ability to extend when new feature extraction techniques are introduced without modifying the underlying data model and structures. The third reason is scalability. As the amount of multimedia content to be indexed is huge, indexing techniques need to be scalable. When compared to relational database, semantic web is easier to scale as it was built for handling very large amount of data. By using semantic web, we observe that we could easily scale the system by using a distributed file system and adding computers as and when needed. In short, the issue of content based search is an ideal candidate to go for a semantic web based implementation.

Abstract— This paper aims to provide a semantic web based video search engine. Currently, we do not have scalable integration platforms to represent extracted features from videos, so that they could be indexed and searched. The task of indexing extracted features from videos is a difficult challenge, due to the diverse nature of the features and the temporal dimensions of videos. We present a semantic web based framework for automatic feature extraction, storage, indexing and retrieval of videos. Videos are represented as interconnected set of semantic resources. Also, we suggest a new ranking algorithm for finding related resources which could be used in a semantic web based search engine. Keywords: semantic web, content based video retrieval, video annotation.

I.

INTRODUCTION

The amount of video content being uploaded to the internet is increasing day by day. Currently the search engines that index multimedia content such as YouTube, Metacafe etc index videos mainly based on manually assigned text tags. With huge amount of videos available on the net, it would be a time consuming process to tag all of them manually. This fact increases the importance of content based search which relies on automatic feature extraction techniques. One of the reasons why automatic feature extraction techniques have not been used extensively is their low accuracy rates. Though this problem is relevant, we observe that even with the low accuracy rates, the features combine together to give a significant amount of information which improves the user search experience. In this paper we introduce a semantic web based framework for content based video retrieval. Semantic web is currently used for indexing of textual data [13]. We define a semantic web based layer which would enable conversion of video content features into semantic web standards such as RDF. We also define a ranking algorithm for sorting of video search results in such a framework. Before describing the framework we would like to explain why we chose a semantic web based framework rather than a relational database framework. The main reason for this is the dynamic nature of the 978-0-7695-3800-6/09 $26.00 © 2009 IEEE DOI 10.1109/ICSC.2009.49

[email protected]

II.

RELATED WORK

Quite significant work has been done with regard to content based retrieval systems. A query by image content system is proposed in [8] and [9]. A video content based retrieval system is proposed in [10]. A database approach for modeling and querying video content data is proposed in [11] and [12]. We propose a novel idea of creating a framework for multimedia content extraction and retrieval using semantic web. Semantic web has been used for indexing of textual data. Swoogle [13] is a semantic web based search engine for text documents. In [18] Jane Hunter presents a MPEG7 Ontology to represent extracted video features. We realized that this approach is limited as it is very specific to using MPEG7 descriptors. The feature extraction techniques available today were found to use multiple standards other than MPEG7. In our paper we present a generic ontology which mainly concentrates on the representation of extracted features of a video.

103

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

III.

DESIGN

used commonly. Hence, we would like to define an Ontology, which would accommodate multiple standards. In this section we propose such an Ontology, which we refer to as TBOX as per the semantic web terminology. Fig 3 gives the TBOX overview. In the TBOX, we define two types of resources, Video and a Descriptor. Video is represented as a collection of Scenes. A Scene is a collection of Shots and a Shot is a collection of Frames. Any of the Resources, Video, Shot, Scene and Frame can be related to any number of Descriptors. Descriptors are used to describe the extracted features of a video. A Text descriptor is used to describe textual content in a video and a MediaDescriptor is used to describe visual and audio content of a video. Several standards exist in the market today for the representation of media features. All media descriptors should be a subclass of the MediaDescriptor element. Fig 3 shows how MPEG7 [15] and CEDD [16] standards are represented in the Ontology. The Ontology is built such that new standards could be incorporated by extending the MediaDescriptor class. Fig 4 gives the RDF representation of a video using the TBox. Fig 5 gives the sample representation of a frame with its extracted features in RDF.

The semantic web world has extensively used RDF for indexing of textual data. We propose our approach for indexing video content features in Fig 1. We define in this paper a Video to RDF Mapper (Fig 2) which would map a video file into its corresponding RDF file. The RDF file is a representation of the extracted feature contents of the video. The Mapper uses a Semantic TBOX 1 which we define in this paper in order to do this conversion.

Fig 1: Our Approach for indexing videos using semantic web. A. Video to RDF Mapping For converting a Video into an annotated RDF file, the first step is to segment the video. A video is segmented into scenes. Scenes are then further segmented into frames. For each frame in the video, feature extraction is performed. In this paper, we tried extracting six different kinds of features, which are text, face, color, shape, texture and audio. Once the features are extracted from a video, the next stage is to accumulate all the extracted features and convert them into a semantic format such as RDF. We develop a Semantic TBOX (Fig 3) for representing the extracted video feature contents. Once the RDF resources are generated, we perform refinement of generated tags. In a video, some features are specific to a frame, while some features occur for a bigger amount of time such as a scene or even the whole video. We collect these tags that are common to all the sub children of a node and move those descriptions to the parent node. This step of the processing maps a video file to its corresponding RDF file.

Fig 2: Video to RDF Mapper C. Indexing Multiple Video Files In order to index multiple video files initially each video file is mapped onto its corresponding RDF file. This scheme of mapping provides scalability, as each video could be processed separately and converted to its corresponding RDF file. We use the Google File System in order to store the RDF files. In the next stage, we index the Resources present in all the RDF files by an indexer. Separate indices are created for each type of features, in order to facilitate feature based searching. Feature based searching gives the capability of searching based on one or more features. For example, we can have a Face based search or text based search. We create separate indexes for text, audio, face, color and shape. Overview of the indexing scheme is given in Fig 6.

B.

Semantic TBOX In [18], we could observe how the video features could be converted to RDF using MPEG7 standard descriptors. On examining the video feature extraction techniques, we found that MPEG7 descriptors are not

1

TBOX is an acronym used in Semantic web community to represent Structural Description or the Ontology.

104

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

properties called as descriptors. The child resources could be Shots, or Scenes or Frames or Descriptors. The distance between two resources could be defined recursively. As a base case, we define a distance measure between two descriptors D1 AND D2 which are the leaf nodes of the RDF tree and do not have any sub children. This distance measure is given in (1). The distance is calculated only for textual descriptors. Visual descriptors are ignored in the rank calculation. We define GND (D1, D2) as the Google normalized distance [5] between textual content of descriptors D1 and D2.

D. Ranking Algorithm for Resources A query for videos would return a list of videos that are related to the query. As the number of videos available in the net is huge, the query results cannot be expected to fit in one page. A ranking algorithm should be present in order to rank the videos and present the user with top results. A naïve approach for this is to rank the videos with respect to the number of times the query word appears in it. As this is not an efficient approach and could lead to manipulation of search results, we define a new ranking algorithm which is based on the Google Normalized Distance [5].

Distance (D1, D2) = GND (D1, D2)

(1)

Now, we give a recursive definition of distance measure for resources that have one or more children. Let D1 and D2 be two resources that have N1 and N2 children respectively. Let the Child (D1, i) represent the ith child of D1. The distance between D1 and D2 is defined as the distance sum of distances between its children. This is given by (2). Distance (D1, D2) = ™i, min ( ™j Distance (Child (D1, i),Child (D2, j) ) ) Where, i= 1 to N1 and j = 1 to N2 ( 2)

Equations (1) and (2) give the recursive definition of calculating distances between any two resources in the framework. For ranking videos, we now calculate the distance measure between videos. If there are N videos in the system, we calculate a NxN matrix A where A(i,j) gives the distance between the ith video and jth video. The videos could be considered as a part of graph where the videos represent the vertices and the edges between two videos represent the distance between them given by the equation 2. We perform a random walk on this graph to calculate the ranking of the videos by using the equation 3. In equation 3, the constant d is the probability of jumping from one vertex to another which is empirically calculated to be 0.85. The ranking of videos is given by linear array VR of size N which is calculated by the (3).

Fig 3: RDFS graph of the TBox

Fig 4: RDF representation of a video resource.

(3) The matrix VR gives the ranking of each of the video. When we obtain a query result, we sort it according to the decreasing order of the rank. The matrix A which is a distance matrix can also be used to find related videos for a video. Related videos are those that the user watching a particular video might also be interested in watching. When a user watches a video in a search engine such as YouTube it presents the user with a set of related videos. The distance

Fig 5: RDF representation of a sample Scene resource

The ranking algorithm would sort the videos according to its relative importance. A video after its conversion to its RDF format could be considered as a collection of child resources. Each resource has a set of 105

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

matrix A that we calculated in (2) could be used to find out the related videos. To find the related videos for a video, we calculate the distances of a video to all other videos in the database from matrix A and then sort them in the increasing order of their distances and display the top results.

repetitive queries. The index and the RDF files are stored in GFS based network file system and not a database. B. Feature Extraction Techniques Text, Face, Audio, Color, Shape and Texture are the six features extracted in our implementation. This section describes in detail the techniques used to extract each of the six features. 1) Text: Text represents the textual content in a video or an image. The main technique for extracting text is to use OCR engine. We used open source software called Tesseract [1] for extraction of text from a video. Though Tesseract does a good job in extracting text from a scanned document, it performs very poorly when used on a video file. A set of preprocessing techniques and devised to improve the accuracy significantly. A video is initially split into a series of key frames. Each frame is stored as an image file. We split the preprocessing into several stages. In the first stage the image is converted into a gray scale image. In the second stage, we apply a Gaussian difference edge detector on the image. Application of the edge detector highlights the text character edges and we observed that this significantly increases the accuracy of the OCR. In the third stage, we apply sharpening and histogram based thresholding. The threshold is set based on the histogram of the image. In the final stage we use Google Suggestion and Wordnet to correct and filter out words with small spelling mistakes. Table 1 gives the experimental results on a test set of ten videos with a total duration of five hours and ten minutes. We observed that we are able to increase the efficiency up to 59.067 % in the test set. Total number of words in all of the videos in the test set is 965. 2) Audio: Audio is extracted using one of the open source speech recognition techniques called Sphinx 3.0 [2]. The average accuracy of speech recognition is poor. Nevertheless speech recognition serves as a very important component of video feature extraction. 3) Face: Face Recognition is a field in which many accepted techniques have been published. Our framework tries to be as generic as possible so as to accommodate different techniques and standards. A new Face detection technique could be accommodated by extending the FaceDescriptor. For example an HMM based Face detector is modeled as HMMFaceDescriptor which is a subclass of FaceDescriptor. Other techniques can be modeled similarly. We browsed the web with the help of Google search engine API to collect information about individuals in the video. We modeled each Individual as a semantic resource by identifying the person with a

Fig 6: Overview of Indexing Scheme for multiple videos. IV.

IMPLEMENTATION

A.

Overview We have implemented the framework to test its performance addressing mainly four challenges which are segmentation, feature extraction, RDF generation and indexing. 1) Segmentation -- We use a clustering based approach [14] for the segmentation of video into frames. A video is clustered into scenes based on KMeans clustering. It is a challenge to decide the K value for this approach. For this we initially perform a hierarchical clustering and stop in between prematurely to find out the approximate K value. 2) Feature Extraction-- We extract six different kinds of features for a video. These six features were chosen because of its importance with respect to the user searching a video. The six features are Text, Face, Audio, Color, Shape and Texture. Most of the techniques available perform poorly when used with real time video data. This is mainly because of the poor quality of videos and the noise associated with it. We come up with preprocessing techniques enabling use of the existing feature extraction techniques for real time video data. A detailed description of these techniques is given in section IV.B. 3) RDF Generation -- After the extraction of features, the next task is to convert them into an RDF model. We use Jena [19] framework which is a Java based framework commonly used in semantic web world to create RDF models. 4) Indexing-- We use a Lucene [5] based indexing for indexing the RDF resources. A Lucene index is created for each of the six features. Querying based on Lucene happens in a good time which is practical in real time as shown in our experiments in Section 5.1. We use query caching, a popular technique used by search engines to reduce load on the server for

106

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

used are 3.0 GHz, Pentium 4, 2.0GB Ram PC’s running on Linux, GFS [5] distributed file system and HBase [6] for handling of distributed tasks across the network. All time measurements are done in milli seconds. In the case of a cluster of two machines the task of indexing is split between the two machines using MapReduce [4].

unique URI and added the images and other related information such as name, occupation and related keywords to it. 4) Color, Shape and Texture: Some of the standards for representing color, shape and texture of an image are MPEG7, CEDD etc. If one would like to add a new standard, MediaDesciptor needs simply be extended and one’s descriptor written. Currently we support MPEG7, CEDD descriptors. The Color, Shape and Texture descriptors currently defined by these standards are supported. V.

Without preprocessing Stage 1 Stage 2 Stage 3 Stage 4

EXPERIMENTS AND RESULTS

A. Content Based Search In the first experiment, we demonstrate the effectiveness of the framework. For this purpose, we have indexed most of the technical presentations in YouTube and Metacafe. This amounts to 1345 videos which have a total size of 320GB which approximates to about 1240 hours of raw video data. Content based search is expected to perform much better (In terms of quality of results) than a text based search because each video is indexed by additional content such as text, audio, faces etc. Here, we use a collection of technical presentation videos and show how a content based search gives higher quality results compared to normal video search. We find the precision and recall to demonstrate the effectiveness of the framework for three sample queries. Before each query, it is manually decided how many of the videos in the database are relevant to the query. The precision recall curve is generated by increasing the size of the result set from one to ten. For the experiment, we use 3 sample queries. As the content for these queries is within the slides of the videos, YouTube returned zero results for the exact query. We have plotted the precision recall graph for these three queries by gradually increasing the size of the result set from one to ten in Fig 7. We can observe that the framework presents a high precision and the recall increases as the number of videos in the result set increases. The three queries we used are 1. Learning in Single layer neural network 2. Active RDF examples 3. Properties of Amino Acids

Words correctly identified 20

% of words correctly identified 2.072

70 410 523 570

7.254 42.487 54.197 59.067

Table 1: Improvement of text extraction accuracy in different levels of preprocessing.

Fig 7: Precision-Recall graph for the 3 sample queries. (X-axis shows recall and Y-axis represents precision) Single Machine (ms)/MB Downloading Segmentation Feature Extraction Indexing

2000 10 35 5

Two Machine (ms)/MB 1480 6 25 4

Total without downloading time

50

35

Table 2: Performance comparison of single machine and cluster of two machines (Table lists processing time per MB of video data.)

The bottleneck would however be the download speed of the internet connection. As videos do not change usually, unlike web pages, the initial time taken to download a video can be overlooked. Among parameters other than download speed, feature extraction is the next biggest bottleneck. By this experiment we have shown that though feature extraction and video processing are time consuming tasks, our framework is scalable to process videos in real time of around 50ms per MB of data. We should also note that a video unlike a web page seldom changes. Hence, there may not be a need to repeat

B. Performance Testing For performance testing, we use the same test set as in the first experiment (Section V A). We find out the time taken for both indexing and retrieval. We use two computers with a combined disk space of 1 TB. In Table 2 we show the time taken for indexing of the video. The performance with respect to one machine and a cluster of two machines are given. The machines 107

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

VII. REFERENCES

indexing once it is done. This is a major advantage of video indexing over normal web page indexing. Another important parameter of a search engine is the average query time. Average search time for a query was found to be 190 ms. This is found out by performing 100 sample queries on the search engine. We also tested the performance of the system by splitting the videos into sub-sets and then giving each machine one sub-set of videos to process. We split the videos into two subsets in order to show the scalability of our framework. We show that by increasing the number of computers in the framework, we can scale the system to index large number of videos. The number of videos is gradually increased from 100 to 1200 as shown in Fig 8. We can observe that the overall time for indexing has significantly reduced in case of a cluster of 2 machines. For faster feature extraction, we have to increase the number of computers in the cluster.

[1] Tesseract OCR software, URL: http://code.google.com/p/tesseract-ocr/ [2] The CMU Sphinx Group Open Source Speech Recognition Engines, URL: http://cmusphinx.sourceforge.net/html/cmusphinx.php [3] Project home page and source code:

http://sites.google.com/site/sanchohomesite/projects/semanti c-video-indexing MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, Volume 6, pp 10 - 10, 2004. [5] Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung GFS , The Google File System ,. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. [6] Hbase URL: http://hadoop.apache.org/hbase/. [7] Lucene URL: http://lucene.apache.org/java/docs/. [8] M. Flinker, H. Samhey, W. Niblack et al., “Query by Image and Video Content: The QBIC System”, IEEE Computer, 28, Sept. 1995, pp. 23-32. [9] J. R. Smith, S-F. Chang, “VisualSEEk: A Fully Automated Content-Based Image Query System”, Proceedings of the ACM International Conference on Multimedia, pages 87-93, New York, US, 1996. [10] S-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, “A Fully Automated Content Based Video Search Engine Supporting Spatio-Temporal Queries”, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 8, No. 5, Sept 1998.

[11] C. Decleir, M-S. Hacid, J. Kouloumdjian, “A Database Approach for Modelling and Querying Video data”, LTCS-Report 99-03, 1999. [12] J. Z. Li, M. T. Ozsu, D. Szafron, “Modeling of Video Spatial Relationships in an Object Database Management System”, Proc. of Int. Workshop on Multi-media Database Management Systems, 1996, pp. 124-132. [13] Swoogle URL: http://swoogle.umbc.edu/. [14] Bilge Günsel, A. Mufit Ferman, A. Murat Tekalp , Temporal Video Segmentation Using Unsupervised Clustering and Semantic Object Tracking , 1998, Journal of Electronic Imaging, Vol. 7, 1998. [15] F. Pereira, ed., ISO/MPEG N4320, MPEG-7 Requirements Document, v 15, MPEG Requirements Group, Sydney, July 20. [16] Savvas Chatzichristofis, Yiannis Boutalis, Color and Edge Directivity Descriptor: A Compact Descriptor for Image Indexing and Retrieval, Computer Vision Systems, pp. 312-322, 2008. [17] Mathias Lux, Savvas A. Chatzichristofis, LIRe: Lucene Image Retrieval – An Extensible Java CBIR Library , Proceeding of the 16th ACM international conference on Multimedia, Vancouver, British Columbia, Canada pp 1085-1088 , 2008. [18] Jane Hunter, Adding Multimedia to the Semantic Web - Building an MPEG-7 Ontology, International Semantic Web Working Symposium (SWWS), 2001. [19]Jena URL: http://jena.sourceforge.net/

Fig 8: Comparison of time taken for video test set by a single machine and a set of two machines. VI.

CONCLUSION & FUTURE WORK

Semantic web provides a very flexible framework for content based multimedia retrieval. The performance of the framework in real time is also reasonable for practical search engines. The improvement in search result quality for content based search is shown through experimental results. Semantic web would serve as a good integration platform for content based retrieval. Though the state-of-the-art feature extraction techniques have poor accuracy, a combination of the features gives information that could perform better than manual tagging. Owing to limited resources, these experiments were conducted in only 2 machines and 320GB of video data. More machines should be clustered and tested for higher number of videos. The framework is made available on the project home page [3].

108

Authorized licensed use limited to: Univ of Texas at Dallas. Downloaded on April 16,2010 at 19:28:29 UTC from IEEE Xplore. Restrictions apply.

Suggest Documents