TV Broadcast Macro-Segmentation: Metadata ... - Semantic Scholar

83 downloads 246 Views 167KB Size Report
2An equivalent service named VPS (Video Programming. System) exists in ... channels and then make them available on a unique server that can be directly ...
TV Broadcast Macro-Segmentation: Metadata-Based vs. Content-Based Approaches Sid-Ahmed Berrani

Patrick Lechat

Gaël Manson

[email protected]

[email protected]

[email protected]

Orange Labs – France Telecom Division R&D – Technologies 4, rue du Clos Courtel 35510 Cesson Sévigné. France

ABSTRACT In this paper, we study different approaches for TV broadcast macro-segmentation. This is needed for many novel services such TV-on-Demand. TV braodcast macro-segmentation can be performed using either metadata associated with the stream or by directly analyzing audio-visual stream. This paper presents both approaches and analyzes their advantages and limitations w.r.t. applications. It presents then an experimental study. This study has been conducted on real data of more than 5 month broadcast. Two types of metadata have been considered and a video identification technique has been developed. Obtained results show in particular, the effectiveness of content-based solution and highlight imprecision and limitations of metadata.

Categories and Subject Descriptors H.3.m [Information Storage and Retrieval]: Miscellaneous; I.2.10 [Artificial Intelligence]: Video and Scene Understanding—Video Analysis

(DVR), users can now stop watching a program and carry on watching it at a later, more convenient time (A time shifting service). Users can also store or upload their favorite movies onto their Portable Multimedia Players (PMP) and watch them anytime, anywhere (Video pod-cast). Another highly interesting service is TV-on-Demand (TVoD). This service allows users to watch TV in a non linear manner. The principle is to give them the possibility to access past TV programs from a potentially high number of channels and to compose their own program guide. In addition to these services, the explosion of audio-video data has also created the need to index and store the content. Audio-visual documents are segmented and annotated. In particular, key events in the video are detected and described. This allows the user to browse and retrieve them at a later date. In general, the objective is twofold: To re-use the audio-visual content or simply for archiving purposes. Basically, all these services and needs require:

General Terms Algorithms, Experimentation

1. Performing a macro-segmentation of audio-video streams, i.e. precisely extracting programs,

Keywords TV Broadcast, macro-segmentation, indexing

2. Classifying them into a set of predefined categories,

1. INTRODUCTION

3. Describing them with an appropriate amount of metadata that allows users to navigate within the set of available content.

The significant increase in the number of TV channels and the amount of digital video content, and the diversification of broadcast possibilities and storage devises has recently given rise to the emergence of many new services and TV programs consumption schemes. These new services basically aim at making audio-visual content available to users without any constraints on location and/or time. For example, with the development of Digital Video Recorders

Each of these three steps represents in itself a research field. In this paper, we will focus on the first step in the context of TV broadcasts, i.e. how to precisely extract programs from TV streams. A TV broadcast is composed of an audio-visual stream and a metadata stream. Metadata stream is added to provide specific textual information that describe the audio-visual streams. Among these metadata, we can find Closed Caption, the Electronic Program Guide (EPG), Elementary Information Table (EIT) and possibly teletext. All of this information is not always broadcasted and depends on particular standards and transmission mode (Hertzian/Satellite/DSL, Analog/Digital...).

Macro segmentation can therefore be performed based on either audio-visual content [8, 9, 11] or on metadata [16], or both [14]. The objective of the paper is to analyze content-based and metadata-based approaches, to compare them and to provide an experimental study providing evidence of their advantages and limitations. This experimental study uses TV data collected over more than 5 months. The rest of the paper is organized as follows. Section 2 discusses audio-visual stream macro-segmentation in general. Section 3 and Section 4 focus respectively on the two main approaches, metadata-based and content-based solutions. Section 5 presents the experimental study we conducted to show the advantages and limitations of each of the approaches. Section 6 concludes the paper and discusses future study extensions.

2. TV STREAM MACRO-SEGMENTATION: AN OVERVIEW The basic objective of a TV stream macro-segmentation system is to precisely segment the stream, isolate inter program (IP), such as jingles, trailers or advertisements, and to precisely delimit useful programs. There are a lot of applications of TV stream macro-segmentation, both for personal and professional usage. In the personal context, extracting useful programs allows a user to automatically record its favorite movie and watch it at another time than that in the channel’s schedule, and which is not necessarily the most convenient for him/her. It also allows the user to skip advertisements and to automatically control the access to some programs for children for instance. In a professional context, macro-segmentation can be used to facilitate the archiving TV programs, to build a TV on demand service or to customize programs and/or advertisements depending on the region of broadcast and even at the user level. In general, to be useful in real-world applications, macrosegmentation methods have to fulfill the following main requirements: • Effectiveness: The stream should be accurately segmented. Extracting an incomplete program is in general completely useless,

To perform macro-segmentation, there are mainly two approaches. The first one consists in using metadata associated with TV broadcast, when they are both available. These metadata can be used as they are or within an expert system that is able to correct them and increase their reliability. The second approach is content-based. It relies on studying the audio-visual signal to find program boundaries. The general TV broadcast macro-segmentation scheme is depicted in Figure 1. In addition to metadata possibly being available in the TV broadcast, other metadata like the Electronic Program Guide can be retrieved from the web. Manual user annotation can also be solicited in the macrosegmentation process, as will be shown later in the paper. Roughly speaking, metadata-based approach suffers the imprecision and the non-exhaustiveness of metadata. Moreover, these two drawbacks have never been experimentally studied. On the other hand, content-based approaches are generally not fully automatic and require user support to provide ground-truth. In the following two sections, these two approaches will be deeply studied.

3.

Metadata associated with the stream varies depending on standards and transmission modes (Analog/Digital). In the case of analog TV, metadata w ere1 available within teletext (or Closed Caption in US). Teletext encloses many data such as news, weather forecasts... but also static information on the program guide. In addition, European standard teletext could also include Program Delivery Control (PDC)[1]. PDC2 is a system that properly controls equipped video recorders by using hidden codes in the teletext service. These codes allow the user to precisely control the record start and end times of a specific program. In digital mode, broadcasted metadata are called Elementary Information Table (EIT) and are of two types: 1. EIT schedule: Contains the TV program over many days.

• Efficiency: Processing time should ideally be realtime or at least with a constant shifting time, that is, segmentation results are returned with a constant delay after each event (start or end of each program). Otherwise, results will be delivered with an exponentially increasing delay. • Automaticity: Segmentation should not require any manual annotation or user input. Annotating TV streams is very time consuming and any required manual processing would make the system unpractical, in particular when handling a large number of TV broadcasts.

METADATA-BASED APPROACHES

As explained before, there are mainly two classes of metadata: Broadcasted metadata associated with the audio-visual stream, and metadata that can be retrieved from specialized websites which gather electronic program guides.

2. EIT present and follow: Contains the details (start and end time, title, and possibly a summary) of the program currently being broadcasted and the following one. If EIT “present and follow” is generally available, the EIT “schedule” is rarely available. 1

Analog TV being in the process of becoming defunct. An equivalent service named VPS (Video Programming System) exists in some EU countries (e.g. Czech Republic). 2

TV broadcast

{

Metadata Audio-visual stream

Macro-segmentation

Add. Metadata



… IP

Program 1

IP

Program 2

IP

Figure 1: TV broadcast macro-segmentation scheme. Apart from PDC, both for digital and analog TV, broadcasted metadata are static, that is, they are not updated or modified to take into account any delay or change that may occur in the broadcast with respect to program planning.

hand, the model does not take into account program planning of special events that may occur without any regularity from one year to the next (e.g. political events, sports competitions...).

Unfortunately, PDC cannot be used for macro-segmentation for a variety of different reasons. The main problem is that PDC is very rarely provided, as it allows users to skip advertisements and, these currently represent the main source of income for TV channels. Another limitation of PDC is that it is well defined and standardized for analog TV, but it is still not clear as to how it can be used for digital TV.

4.

Metadata available on the web are typically program guides provided by channels or made available on their websites. Many companies also provide a service (like mobivillage or emap) in which they gather EPGs from a large number of channels and then make them available on a unique server that can be directly queried through the web. In general, both kind of metadata are widely used despite their imprecision and non-exhaustiveness. Apart from traditional techniques for metadata aggregation and fusion [7], that could be used to enrich them and to increase their accuracy, only very few studies (among which [16]) have proposed novel ways to make use of these metadata. In [16], Poli et al. propose a statistical predictive approach that allows the correction of an EPG using a model learned from a ground truth created on one year broadcast. This approach is based on a simple observation: Channels have to respect a certain regularity in their program planning in order to preserve and to increase their audience. The main drawback of this approach is the required ground truth data for training. This data is required for each channel as program planning differs from one channel to the other. However, it is very difficult and very expensive to collect it. Poli’s study was feasible because it has been conducted at INA, the French National Audiovisual Institute 3 . On the other 3

INA is in charge of indexing and archiving French channels.

CONTENT-BASED APPROACHES

Content-Based approaches directly use the audio-visual content. In general, the first level of segmentation is only signalbased, that is, it only detects basic transitions in the signal. A more elaborated analysis is then performed upon this to properly delimit and extract programs. Existing approaches can be classified into the two following classes.

Using Inter-Programs The first class relies on the detection of inter-programs (Advertisements, trailers...), that naturally structure the TV broadcast. Generally, a useful program is preceded and succeeded by advertisements, trailers... Therefore, it can be accurately extracted if adjacent inter-programs are detected. The macro-segmentation problem has been considered in this manner because inter-programs are relatively easy to characterize and to detect (Comparatively to programs that are very heterogeneous). Inter-programs have also the very important property of being redundant in the TV broadcast. Therefore, they can also be detected in a non-supervised manner as redundant sequences. Lienhart et al. [13] propose a set of techniques to detect advertisements. The proposed criteria and features are amongst others: • Monochrome frames: These frames are used by many TV channels to separate two consecutive advertisements. [13] proposes to study standard intensity deviation of each frame and to compute static decision thersholds, www.ina.fr.

• Scene breaks: The frequency and the style of cuts (hard or fades) are analyzed,

a time warping procedure has been proposed in order to correct the EPG.

• Action: Analysis of edge change ratio and motion vector length,

These approaches has mainly two drawbacks, both related to the reference database. First, this database has to be created manually for each channel. Second, database has to be periodically updated as new programs and inter-programs are introduced.

Hauptmann et al. [10] also propose a similar approach to [13] within a general system for story segmentation in broadcast news. Albiol et al. [2] presents a system that labels shots either as advertisement or program shots. The system uses two observations: Logo presence and shot duration. These observations are modeled using HMMs. Duan et al. [6] have proposed recently a multi-model approach for advertisement detection and categorization. Identification is performed using visual and audio features to detect advertisement shots. Categorization and identification are however done using extracted text from scenes using OCR. All these approaches are limited to advertisements and are therefore not sufficient to perform macro-segmentation. To detect all kinds of inter-programs, another solution consists in using their common property, that is redundancy. Indeed, inter-programs are generally broadcasted several times a day. Few recent papers have used this property to segment a TV stream. In particular, C. Herley [11] detects inter-programs as repeating objects (RO) by using correlation study of audio features. At time t, the current object (An audio segment of predefined length) is compared with a past stored buffer of fixed size in order to detect any possible correlation. The method imposes few constraints on the length of ROs and the depth of the search (i.e. the length of the search buffer). Covell et al. [5] proposes a similar approach. Redundant objects are detected using audio features using a hashing-based method. Detection are then verified using visual information. Again, these techniques are also not sufficient to perform macro-segmentation because of their technical constraints and of required internal parameters (In particular length of sequence/buffer).

Using a Reference Database The other class of content-based approaches relies on a reference database (DB) in which a set of reference audiovisual sequences are stored. These sequences include interprograms, jingles, opening/final credits (possibly with corresponding program details). Macro-segmentation becomes hence a content-based real-time sequence identification in an audio-visual stream (Figure 2). Identifying inter-programs from the reference DB allows isolating programs. Jingles and credits allows labeling extracted programs. Methods following this principle use audio or video fingerprinting [9, 12] to detect in the stream reference sequences stored in the DB. Perceptual hashing can also been used [15, 3, 4]. Naturel et al. [14] proposes a hybrid technique. In addition to a content-based identification of shots stored and labeled,

5.

EXPERIMENTAL STUDY

The two previous sections present metadata-based and contentbased approaches for TV broadcast macro-segmentation. They highlighted advantages and limitations of both approaches that can be summarized into the two following points: 1. Metadata allows performing macro-segmentation in a fully automatic manner but their results are imprecise and incomplete, 2. Content-based approaches are not fully automatic but could be very accurate, at least for very specific applications. In this section, experiments supporting analysis results provided in the previous sections are presented. This is the second main contribution of the paper. The objective is to show the limitations of metadata on real data collected from a French TV channel during more than 5 months. On the other hand, a content-based solution will also be tested and compared to metadata. Performed experiments are summarized in the following two sections.

5.1

Exp. 1: Accuracy of EIT and EPG

The objective of this experiment is to study the accuracy of EIT and EPG, and to compare them. A TV broadcast of 24h has been stored, segmented and labeled manually. Inter-programs (advertisements, trailers...) have been isolated and 49 useful programs have been identified. These programs have been split into two subsets. A subset of short programs whose length is less then 10 minutes, and the rest of programs. We established this distinction in order to study metadata accuracy and reliability w.r.t program length. As will shown later, statistics are different depending on program length.

Number of short prog. Number of other prog. Total

GT 24 25 49

EPG 5 26 31

EIT 22 25 47

Table 1: Statistics on the number of programs manually detected (GT), present in the EPG and the in the EIT. We can notice in Table 5.1 that both EPG and EIT do not contain all the programs. EPG is less accurate than EIT. In particular, EPG does not mention most of short programs. We also noticed that 3 among EPG entires and 4 among EIT entries do not correspond to any program in the broadcast.

TV broadcast

Start 1

Program 1

End1

Start n



Program n



Duration n

Online Content-Based Recognition

Reference DB

Figure 2: Macro-segmentation scheme using content-based solution with a reference database. To measure the accuracy of EPG and EIT, the absolute value of the difference of the start (resp. end) time w.r.t. ground truth (GT) has been studied. Upon considered programs, the mean (µ) and the standard deviation (σ) have computed. In this evaluation, we have considered only programs present in EPG (resp. EIT) and in the ground truth (GT). We have also studied the difference between EIT and EPG informations. We used the same criteria (mean and standard deviation of differences of start and end times), and again, we have considered only programs present in both EPG and EIT.

be of any benefit. To have an idea on the absolute difference between the effective start (resp. end) time and the information provided in metadata, we have also plotted the histogram of the start (resp. end) time differences w.r.t. GT. This histograms is shown in Figure 3. 18 16 14

Start time End time

12 10

Obtained results are summarized in Table 2. Statistics are first presented for all the programs, then per kind of programs.

8 6 4 2

EPG vs. GT EIT vs. GT EIT vs. EPG

All the programs # of Start time prog. µ σ 28 6m 23s 8m 04s 43 4m 14s 2m 51s 29 4m 49s 4m 54s

End µ 7m 57s 4m 13s 7m 35s

time σ 6m 36s 5m 30s 5m 18s

EPG vs. GT EIT vs. GT EIT vs. EPG

# of prog. 4 20 4

Short programs Start time µ σ 1m 42s 1m 07s 3m 33s 2m 09s 2m 18s 0m 05s

End µ 4m 25s 2m 15s 4m 48s

time σ 3m 59s 2m 05s 2m 54s

EPG vs. GT EIT vs. GT EIT vs. EPG

# of prog. 24 23 25

Other programs Start time µ σ 7m 29s 8m 32s 4m 51s 3m 17s 5m 14s 5m 11s

End µ 8m 55s 6m 01s 8m 02s

time σ 6m 44s 6m 57s 5m 31s

0

Table 2: Accuracy of EPG and EIT, and difference between EIT and EPG.

< 10s

10 - 30s

30s -1min

1 - 2min

2 - 5min

> 5min

Figure 3: Histogram of shift times in EIT. Figure 3 shows among others that more than 40% of the programs start more than 5 minutes earlier or later than what expected in EIT.

5.2

Exp. 2: Content-based vs. EIT/EPG

To compare metadata with content-based solutions, we have developed a video identification technique. We have then focused on a set of programs and study the accuracy of their detection w.r.t. EPG and EIT over more than a 5 month TV broadcast. In the following, we first describe the video identification technique, then we present and analyze obtained results.

Video Identification (VideoID) Table 2 shows that the difference between the effective start time and the one provided in EPG or EIT is very important can be in many case greater than the duration of the program. This table shows also that standard deviation is also very important. Which means that not only EPG and EIT are completely inaccurate but also completely irregular. We can notice also that the difference between EPG and EIT is as important as the difference between EPG or EIT and GT. This suggests that fusing EPG and EIT would not

We have chosen to develop a video identification technique that uses a reference database. This kind of techniques can be easily evaluated: A set of program credits has to be added to the reference DB and, when a detection occurs, it just has to be verified. This technique relies on two description levels. A first level in which visual stream is sub-sampled and where each selected image is described using a coarse DCT-based descriptor. In the second level, visual stream is segmented into

All programs

shots. Keyframes are then extracted from each shot and described using a finer DCT-based descriptor. Other color and texture based descriptors could also be used.

700 600

Detecting a sequence in the visual stream is performed as depicted in Figure 4. The coarse descriptors of a target sequence are used to compute a correlation with the visual stream. When the cumulated correlation is greater than a threshold, an alarm is raised and the second description level is used to corroborate the alarm or invalidate it.

EIT

EPG

500 400 300 200 100 0

Occurrence of the sequence to detect

< 10s

10 - 30s

30s -1min

1 - 2min

2 - 5min

> 5min

1 - 2min

2 - 5min

> 5min

Video stream t

News

Correlation window Correlation between descriptors

t

Validation period

100 90

Threshold

80 Result

EIT

EPG

70 60

t

50

1st alarm

40 30

Figure 4: stream.

Sequence identification within visual

20 10 0

Results Start credits of 24 programs have been described and stored in the reference DB in order to be detected in the stream using the method described previously. The study has been conducted on a 5 month broadcast of a French channel (September 2006 – January 2007). During this period, EIT and EPG have been saved in order to be compared with content-based detections. Apart from very few cases where due to capture or broadcast technical problems, we were not able to evaluate detection results of referenced programs, precision is almost 100%. Recall is more difficult to evaluate as we have to go through 5 month broadcast. Nevertheless, for periodical programs such as news, for which it can be easily evaluated, recall is also almost 100%. Accuracy of EPG and EIT w.r.t. content-based detection results have been evaluated in the same manner as in Section 5.1. As only start credits have been stored in the reference DB, comparisons concern only start times. We have noticed also that two programs never appear in the EPG. Therefore, they have not been taken into account in the evaluation. Table 3 summarizes obtained results. This table shows again that EPG and EIT are very imprecise, even for regular programs such as News and TV shows. Figure 5 shows histograms of the differences between VideoID detections and EPG and EIT. These results confirm previous results. In the histogram computed over all the programs, please note that EPG and EIT cannot be directly compared. The number of programs taken into account for EIT is grater the one for EPG.

< 10s

10 - 30s

30s -1min

Figure 5: Histogram of shift times in EIT and EPG w.r.t. VideoID.

6.

CONCLUSIONS AND PERSPECTIVES

In this paper, we have presented different approaches to perform macro-segmentation of a TV broadcast. These approaches has been classified into two categories. The first one is based on metadata associated with the stream or available on the web. The second one is content-based, that is, it directly analyzes the audio-visual stream in order to segment it. Each approach has been analyzed and its advantages and limitations have been highlighted. This analysis has then been corroborated by an experimental study using real data collected over more than 5 months, which makes obtained results very reliable. The paper show that metadata-based solution is not useful in real-world applications that need TV stream macrosegmentation. Metadata are very imprecise and incomplete. Content based-solution however allows precisely segmenting the stream. However, they generally require off-line manual annotations in order to create the reference database used for macro-segmentation. This limits their potential applications. They are however still interesting to use them to extract a target set of programs or to monitor a stream. This study gives also a basis to guide future works in the domain. In particular, hybrid methods using both metadata and content-based solutions may be a very promising research direction. This would get rid of the need to manually create a reference DB by making use of metadata. In

EPG vs. VideoID EIT vs. VideoID

News µ σ 2m 16s 36s 1m 18s 4m 40s

A show µ σ 2m 37s 1m 47s 2m 17s 2m 00s

All prog. µ σ 3m 38s 10m 18s 2m 35s 4m 29s

Table 3: VideoID vs. EPG and EIT this case, metadata would be used only to annotate automatically segmented streams.

7. REFERENCES [1] Television systems; specification of the domestic video programme delivery control system. European Standard (Telecommunications series), ETSI EN 300 231, April 2003. [2] A. Albiol, M. Ch, F. Albiol, and L. Torres. Detection of tv commercials. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada, volume 3, pages 541–544, May 2004. [3] J. Barr, B. Bradley, and B. Hannigan. Using digital watermarks with image signatures to mitigate the threat of the copy attack. In Proceeedings of the International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, volume 3, pages 69–72, April 2003. [4] B. Coskun and B. Sankur. Robust video hash extraction. In Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, Vienna, Austria, pages 292–295, September 2004. [5] M. Covell, S. Baluja, and M. Fink. Advertisement detection and replacement using acoustic and visual repetition. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, Victoria, BC, Canada, October 2006. [6] L.-Y. Duan, J. Wang, Y. Zheng, J. S. Jin, H. Lu, and C. Xu. Segmentation, categorization, and identification of commercial clips from tv streams using multimodal analysis. In Proceedings of the ACM Multimedia, Santa Barbara, CA, USA, October 2006. [7] I. Foster and R. L. Grossman. Data integration in a bandwidth-rich world. Communications of the ACM, 46(11):50–57, 2003. [8] J. M. Gauch and A. Shivadas. Finding and identifying unknown commercials using repeated video sequence detection. Journal of Computer vision and image understanding, 103(1):80–88, 2006. [9] J. Haitsma and T. Kalker. A highly robust audio fingerprinting system. In Proceedings of the International Symposium on Music Information Retrieval, Bloomington, Indiana, USA, October 2002. [10] A. Hauptmann and M. Witbrock. Story segmentation and detection of commercials in broadcast news. In Proceedings of the Advances in Digital Libraries Conference, Santa Barbara, CA, USA, pages 168–179, April 1998. [11] C. Herley. Argos: Automatically extracting repeating objects from multimedia streams. IEEE Transactions on Multimedia, 8(1):115–129, 2006. [12] A. Joly, C. Frelicot, and O. Buisson. Content-based

[13]

[14]

[15]

[16]

video copy detection in large databases: A local fingerprints statistical similarity search approach. In Proceedings of the IEEE International Conference on Image Processing, Genova, Italy, volume 1, pages 505–508, September 2005. R. Lienhart, C. Kuhmunch, and W. Effelsberg. On the detection and recognition of television commercials. In Proceedings od the IEEE International Conference on Multimedia Computing and Systems, Ottawa, Ontario, Canada, pages 509–516, June 1997. X. Naturel, G. Gravier, and P. Gros. Fast structuring of large television streams using program guides. In Proceedings of the 4th International Workshop on Adaptive Multimedia Retrieval, Geneva, Switzerland, July 2006. J. Oostveen, T. Kalker, and J. Haitsma. Feature extraction and a database strategy for video fingerprinting. In Proceedings of the 5th International Conference on Recent Advances in Visual Information Systems, Hsin Chu, Taiwan, pages 117–128, March 2002. J.-P. Poli. Predicting program guides for video structuring. In Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, Hong-Kong, China, pages 407–411, Novembre 2005.

Suggest Documents