Mining Video Editing Rules in Video Streams Yuya Matsuo
Miki Amano
Kuniaki Uehara
Department of Computer and Department of Computer and System Engineering, System Engineering, Kobe University Kobe University Nada, Kobe 657-8501, Japan Nada, Kobe 657-8501, Japan
[email protected] u.ac.jp
[email protected] u.ac.jp
ABSTRACT Data mining is a technique to discover useful patterns or patterns of special interest as explicit knowledge from a vast quantity of data. In video editing, there are a lot of editing patterns. According to the editor's preferences, different editing patterns give the opportunity to achieve a variety of effects. Discovering the editing patterns is required, because it is useful to find each editor's skills and to use them for editing new video material. In this paper, we propose the methods of extracting editing rules from video stream by introducing data mining technique. We can edit a video material by applying the extracted rules. The edited video may produce the same quality as the video from which we extracted the patterns.
Department of Computer and System Engineering, Kobe University Nada, Kobe 657-8501, Japan
[email protected]
material is edited. We may able to find the patterns particular to each video type. For example, in the scene where two speakers A and B are talking, the shot contains both A and B that appears often in every three shots to show who are talking to each other. Some periodicity can be discovered. Apart from periodic patterns, there may be a continuous pattern over multiple cuts. We try to extract such editing rules particular to each video contents type. In order to extract these rules, the metadata such as shot size or camerawork has to be extracted and indexed to the video. Till now, such extraction work of editing rules has been done manually. In this paper, we propose the methods to automatically extract such editing rules specific to each video by using data mining.
Keywords
2. VIDEO GRAMMAR
Data Mining, Video Grammar, Video Material, Video Editing, Shot Size.
The video grammar is a group of rules to define the shot connection. The rules are described in the same manner as conventional sentence grammar. A basic element is a group of shots to which the video grammar is applied. We shall explain, at first, the definition of a shot. The cut is defined as a physical continuous section where the camera starts recording at the beginning and stops at the end. On the other hand, the shot is defined as a logical continuous section where the shot size or camera work is uniquely defined within the cut, as shown in Fig. 1. Therefore, one or more shots are included in one cut.
1. INTRODUCTION A huge amount of multimedia information including video is becoming prevalent as a result of advances in multimedia computing technologies and high-speed networks. Due to its high information and entertainment capability, video is rapidly becoming one of the most popular media. A video editor connects some fragments of the video material with a certain meaning. As the video material itself has no meaning, the editing work is necessary in order to make the video more meaningful and attractive. However, the ways of conjugating the fragments are various. Moreover, when video is edited to precisely convey editor’s intention to a viewer, it must obey some universal rules. We call these rules “video grammar”. Professional video editors, like the broadcasting station staff, use such rules (video grammar). In documentary films, variety shows, and other TV programs, etc., the number of editing rules is limited. However, depending on the editor’s preferences, the edited video will produce a different effect even if the same video Permission to make digital or hard copies of all or part of this work for Permission to make digital hard copies of all part of this forare personal or classroom use is or granted without fee or provided that work copies classroom use is granted without fee advantage provided that are notpersonal made orordistributed for profit or commercial andcopies that copies notthis made or and distributed profitonorthe commercial and that or notice the full for citation first page. advantage To copy otherwise, bear copies bear thison notice andorthe full citation to onlists, the requires first page. To specific copy to post servers to redistribute prior republish, otherwise,and/or or republish, permission a fee. to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Multimedia’02, December 1-6, 2002, Juan-les-Pins, France. Conference ’00, Month 1-2, 2000, City, State. Copyright 2002 ACM 1-58113-620-X/02/0012…$5.00. Copyright 2000 ACM 1-58113-000-0/00/0000…$5.00.
Figure 1: An example of a cut including three shots. The shot size is selected according to the distance from the camera to the objects. The shot size is classified into loose shot (LS), medium shot (MS) and tight shot (TS), as shown in Fig 2. TS and LS are the shots taken by approaching to or leaving from the object respectively compared with MS. A full shot is the shot where all the objects are included and is used as a master shot at the editing process. The following video grammar is available concerned with these shot sizes: Rule (1): Two shots cannot be connected to each other, where their shot sizes are extremely different, such as TS and LS.
Rule (2): The start shot of the scene must be a master shot. The shot duration plays an important role to convey the meaning of the shot. For example, if shots with slow pace motion continue for a long time, the audience becomes bored. On the other hand, if shots with quick motion are played in a short time, the audience cannot understand the director's intention. In order to avoid these situations, the following grammar is available for the shot duration. Rule (3): Durations of LS, MS and TS must be about 6, 4, 2.5 seconds respectively. The camera work means camera movement, such as pan, zoom and follow. The follow means that the camera tracks moving objects. It is difficult to identify an important object in the shot in pan, zoom and follow. In order to avoid the difficulty, the following grammar is available. Rule (4): Pan and zoom shots should be surrounded with fixed shots, which continues for more than 1 second.
2.
Extraction of periodic patterns in time-series data that shows the row of the shotsize. This is the approach based on row analysis.
By using these approaches, we will extract patterns containing the elements of the video grammar and are useful to edit a video material.
3.1 Method 1: Pattern detection in multistream 3.1.1 The definition of pattern This method detects frequent patterns from multi-stream consisting of attributes such as LS, MS, fix, and pan etc. In this case, the attributes mean shotsize, camerawork, and duration of a cut. Shotsize can be LS, MS, or TS. Camerawork can be fix, pan, or zoom. Duration represents the duration of a cut. First of all, we will define the notation of the pattern. For example, the patterns which occur frequently in Fig. 3 are shown in Fig. 4. That is, this pattern removes elements such as the attribute of camerawork in the second column that has multiple values and cannot be set uniquely. We will consider the relation between s1 (shotsize) and s3 (Duration). We express each pattern as a set of 2-tuples of the form IJ = (s, t). Each 2-tuple specifies a token value for the stream of shotsize, s, and a range value for the stream of time t. The pattern over multiple cuts can be expressed with a combination of this type by arranging the patterns in time-series order. In the case of Fig. 4, it can be expressed pi = {(LS, t < 3), (MS, 2 t 4), (TS,-)}. We use the symbol “-” to denote the element that we cannot identify. Duration of a cut is classified according to the distribution of the cut’s length. For example t = 3 is classified into 2 t 4.
Figure 2: An example of different shot sizes.
3. DATA MINING METHOD Data mining techniques are required to discover new and useful patterns for video editing. Fig. 3 shows an example of a video stream. As shown in the Fig. 3, this is the matrix like form that represents a row for each attribute of video grammar and a column for the composition of one cut. A row is consisting of s1 (Shotsize), s2 (Camerawork), and s3 (Duration). In Fig. 3, the first column (cut) indicates that “shotsize is LS, camerawork is fix, and duration is 1.5 seconds”, then the 2nd column (cut) represents that “shotsize is MS, camerawork is fix, duration is 3 seconds.” respectively.
Figure 3: Video stream indexed by the metadata. We use this matrix for pattern extraction. Then, we can consider two contrasting approaches from this matrix. One is the approach to analyze a row of this matrix [1] [3] and the other is the approach to analyze column of it [4]. This section defines two approaches that detect editing patterns in video stream. 1.
Extraction of patterns from a multi-symbol stream shown in Fig. 3. This is the approach based on column analysis.
Figure 4: Examples of extraction of the patterns. In pattern extraction, pi is an identifier of each pattern. (xi, yi) indicates that the pattern yi is preceded by the pattern xi. We define the following two parameters “support” and “confidence” as the importance of the pattern shown in the equation (1) and equation (2). sup pi
cnt xi yi
(1)
conf pi
cnt xi yi cnt xi
(2)
sup(pi) is the number of times that yi is preceded by xi, that is, the number of times {xi, yi} appeared in the video stream. We call sup(pi) “support”. cnt(xi) is the number of times the pattern xi is present in the video stream. conf(pi) is the value of the conditional probability that shows the appearance frequency of yi when xi appears. We call conf(pi) “confidence”.
We consider that it is a useful pattern which satisfies the thresholds of support and confidence. We call the threshold of support “min_sup” and the threshold of confidence “min_conf”.
3.1.2 The Searching method This method accepts a set of multi-stream time-series data as input. First, we define a window w shown in Fig. 5 and denote the width of the window w as ws. The width of window w is initially set to 0 (ws = 0) and extended as the search progresses. Whenever we extend the width of the window, we need to check if the pattern satisfies the two threshold conditions. If there are elements that do not satisfy at least one of them, the patterns should be removed. Then, they are denoted by “-”. We continue this searching process by extending the width of the window until both elements of tuples are replaced by “-”.
occurrences where such distances of d and d' are exactly l. For the next symbol d', we check each of the symbol in the moving window. If there is a symbol d at the jth position, the period counter (L - j + 1) for symbol d (denoted as Cd,d',L-j+1) is incremented by 1. For example, in Fig. 6, the third MS in the sixth position will contribute to both CLS,MS,6 and CLS,MS,2
. Figure 6: Example of finding periodic patterns
Figure 5: The process of extending pattern. For example, in Fig. 5, we consider that the threshold of min_sup is 30 and min_conf is 0.3. The width of the window is gradually extended by one column from ws = 0. The element of camerawork in the second column and the element of duration in the third column are removed since they don't satisfy the threshold conditions (min_sup and min_conf). Now, w is extended from ws = 3 to ws = 4. Here, the pattern xi is {(LS, t < 3), (MS, 2 t 4), (TS, -)} at ws = 3. All the patterns which can be considered at ws = 4 do not satisfy both threshold conditions. So it is not necessary to search any more. Finally, the extracted pattern is pi = {(LS, t < 3), (MS, 2 t 4), (TS, -)}.
For each occurrence of a symbol d', we need to track its distance to all of previous occurrences of d within the moving window. Given a symbol d, d' and a period l, if Cd,d',l is larger than or equal to min_rep, then it is possible that d might participate in some valid pattern of period l.
3.2.2 Complex Pattern Verification After a set of symbol d, d' and period l have been extracted, the complex pattern verification is used to specialize the candidate patterns. Scanning through a sequence to determine the most prominent periodic pattern, the specialization process executes the following algorithm: 1.
m=1
2.
For each sequence of (d,...,d'), whose length is l, in a given video stream, check a symbol X at the mth position from symbol d. If the symbol X appears over a given threshold, the pattern is specialized. Otherwise, that position is denoted as *. We use * to denote the “don't care” position in a pattern.
3.
m=m+1
4.
If m = l - 1, the process terminates. Otherwise back to 2.
3.2 Method 2: Mining Periodic Patterns in Video Stream In the movie, there exist rules not only for a sequence of shots (ex. Rule(1) in Section 2), but also rules for a periodicity of shots. For example, in the scene where two speakers A and B are talking, MS contains both A and B that appears often in every three shots to show who are talking to each other. If we extract only sequence patterns for video editing rules, there is a difficulty in connecting a large number of shots. For example, in Fig. 5, the pattern {(LS, t < 3), (MS, 2 t 4), (TS, -)} is extracted by using method 1, but we don't know what comes after TS. However, when we focus on a periodicity of this given sequence, we can assume that three shots preceded by TS are always MS. In this section, we propose a method for mining periodic patterns in video stream data.
3.2.1 Distance-Based pruning of candidate patterns Since there are a huge number of potential patterns, a pruning method is needed to reduce the number of candidate patterns. The pruning method is based on the concept that if there is a valid pattern (d,...,d'), whose length is l, then it must occur at least the required minimum times (min_rep) in a given sequence of symbols D. Otherwise then, patterns which don’t satisfy the threshold of min_rep are removed. In this method, it is not sufficient just to track the distance of a symbol to its last occurrence. To perform the distance-based pruning which scans through the sequence, we need to maintain a moving window for the last L symbols, which have already been scanned. For each symbol d (i.e., LS, TS), d' and period l, we have to correct the number of occurrences when such distance of d and d' is exactly l in the sequence. Let Cd,d',l be the number of
For example, in Fig 6, the candidate pattern (TS,*,*,MS), whose length is 4, is extracted from a given sequence by using distancebased pruning. First, we check symbols which are next to TS (the first position of the candidate patterns) in the given sequence (i.e., position 4, 8 and 9). The pattern is not specialized, because every symbol at the position does not exceed a given threshold. Next, we check symbols at the 2nd position from the TS (i.e., 5, 9 and 10). In this case, LS appears over the threshold, then the pattern is specialized as (TS,*, LS, MS).
4. EXPERIMENTS We implemented the above two methods to extract patterns. The video used for this purpose is the movie “Star Wars Episode I”. The number of cuts used in this experiment is 1,058 cuts. The results by using the method 1 are presented in the Table 1. In the experiment for the method 1, according to the length of a cut, duration of each cut is divided as follows. LS is divided into “t < 3”, “3 t 5” and “t > 5”. MS and TS are divided into “t < 2”, “2 t 4” and “t > 4”. Duration of each cut is classified according to these classification conditions. These conditions should show that durations of LS and MS are about 4, 3 seconds respectively. That means that this movie is edited in a little faster pace. From the results of Table 1, some characteristic pattern such
as continuous connection of MS and TS can be extracted. This pattern means the connection pattern of continuous close-ups of the speaker (TS) which shows the characteristics of the conversation scene. The results obtained by using the method 2 are shown in Table 2. In Table 2, a relation between LS and TS can be found by preparing an interval, although it is difficult to find the relation between them: “Two shots cannot be connected to each other, if their shot size is extremely different, such as LS and TS.” Some other periodic patterns are also found. For example, the pattern (TS,*,*,MS) is extracted in almost all scenes. Other characteristic patterns can be found according to each scene. Table 1: Result of extracting patterns using method 1. (min_sup = 30, min_conf = 0.3) pattern p {(LS, t < 3), (LS,-)} {(LS,-), (LS, t < 3)} {(LS, 3 㻟 t 㻟 5), (LS,-)} {(LS,-), (LS, 3 㻟 t 㻟 5)} {(LS,t < 3), (MS,-)} {(LS, 3 㻟 t 㻟 5), (MS,-)} {(LS,-), (MS, 2 㻟 t 㻟 4)} {(MS, 2 㻟 t 㻟 4), (MS, 2 㻟 t 㻟 4)} {(MS, 2 㻟 t 㻟 4), (MS,-)} {(MS,-), (MS, 2 㻟 t 㻟 4)} {(TS, 2 㻟 t 㻟 4), (MS, 2 㻟 t 㻟 4)} {(TS, 2 㻟 t 㻟 4), (MS,-)} {(TS,-), (MS, 2 㻟 t 㻟 4)} {(TS, 2 㻟 t 㻟 4), (TS, 2 㻟 t 㻟 4)} {(TS, 2 㻟 t 㻟 4), (TS,-)} {(TS,-), (TS, 2 㻟 t 㻟 4)} {(MS, 2 㻟 t 㻟 4), (MS, 2 㻟 t 㻟 4), (MS,-)} {(MS, 2 㻟 t 㻟 4), (MS,-), (LS,-)} {(MS, 2 㻟 t 㻟 4), (MS,-), (MS,2 㻟 t 㻟 4)} {(MS, 2 㻟 t 㻟 4), (MS,-), (MS,-)} {(MS,-), (MS, 2 㻟 t 㻟 4), (MS, 2 㻟 t 㻟 4)} {(MS,-), (MS, 2 㻟 t 㻟 4), (MS,-)} {(TS, 2 㻟 t 㻟 4), (MS,-), (TS,-)} {(TS,-), (MS, 2 㻟 t 㻟 4), (TS,-)} {(TS, 2 㻟 t 㻟 4), (TS,-), (TS,-)} {(LS,-), (MS,-), (MS, 2 㻟 t 㻟 4)} {(MS,-), (MS,-), (MS,t < 2 )} {(MS,-), (MS,-), (MS, 2 㻟 t 㻟 4)} {(TS,-), (MS,-), (TS, 2 㻟 t 㻟 4)} {(TS,-), (TS,-), (TS, 2 㻟 t 㻟 4)}
sup (p ) 44 41 51 53 49 50 70 75 119 122 44 75 82 56 76 79 31 37 32 53 33 58 33 35 40 42 32 55 33 41
conf (p ) 0.386 0.363 0.447 0.469 0.374 0.382 0.455 0.63 0.569 0.678 0.587 0.56 0.732 0.737 0.628 0.868 0.413 0.311 0.604 0.445 0.569 0.475 0.44 0.427 0.526 0.656 0.314 0.539 0.589 0.707
Apart from this, we experimented in the same way by using Akira Kurosawa's movie. In contrast to the previous movie (S.W.I), Kurosawa's movie is edited in a slow pace. That is, since the duration of each cut is long, the total number of TS in the movie is small. Conversely, the total number of LS and MS is large. Because the durations of LS, MS, and TS must be about 6, 4, and 2.5 seconds respectively, so the total number of TS that is the shortest duration is required to be small. Most of the extracted patterns contain only the patterns like LS and MS.
5. CONCLUSION In this paper, we proposed the methods of extracting editing rules by use of data mining technique from video stream. As shown in Table 1 and Table 2, it is possible to discover some useful patterns such as a continuous pattern over multiple cuts. So it is evident that this method is effective. Since there is a large number of combinations of “min_sup” and “min_conf” in the method 1, and the value of “min_rep” in the method 2, the problem of determining the optimal threshold still remains. In order to obtain the optimal threshold, it is necessary to solve a certain kind of dynamic programming problem. Since patterns are extracted from a limited number of attributes (shotsize, duration), it is necessary to extract various patterns from the video after inserting additional elements such as utterance, color, and voice information in order to increase the variety of patterns. Additionally, it is required to derive patterns from video material utilizing editing techniques such as parallel editing. Finally, we can edit new video by using the extracted rules. While editing, if we use video editing support system [2], all processes including the editing work can be done automatically, because video editing support system can automatically edit video based on given rules. If we apply the extracted rules to this system, we can automatically edit a video based on each characteristic rule. Therefore, it is useful to extract editing rules and apply them to this system.
6. REFERENCES Table 2: Result of extracting patterns using method 2. (min_rep = (the number of cut) / 3) scene No.
the number of cut
1
138
2
13
3
37
4
126
5
195
6
81
pattern (LS,MS), (LS,*,MS), (LS,MS,*,MS), (LS,*,*,*,MS) (MS,LS), (MS,*,*,LS) (TS,*,LS), (TS,*,*,*,LS) (TS,MS), (TS,*,*,MS) (LS,MS), (LS, *,MS), (LS,MS,*,MS), (LS,MS,*,*,MS) (LS,MS), (LS,*,MS), (LS,*,*,MS), (LS,*,*,*,MS) (MS,LS), (MS,*,LS), (MS,*,*,LS) (TS,*,*,MS), (TS,*,*,MS) (LS,MS), (LS,*,MS), (LS,*,*,MS), (LS,*,*,*,MS) (MS,TS), (MS,*,*,TS) (TS,MS), (TS,*,*,MS) (LS,MS), (LS,*,MS), (LS,MS,*,MS), (LS,*,*,*,MS) (TS,MS), (TS,*,MS ) ,(TS,*,*,MS), (TS,*,*,*,MS) (LS,*,*,MS) (LS,*,*,TS), (LS,*,*,*,TS) (MS,*,*,LS), (MS,*,*,*,LS) (MS,*,*,TS) (TS,*,MS), (TS,*,*,MS),(TS,*,*,*,MS)
[1] J. P. Caraca-Valante and I. Lopez-Chavarrias. Discovering Similar Patterns in Time Series. In Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 497-505, 2000.
[2] M. Kumano, Y. Ariki, M. Amano, K. Uehara, K. Shunto and K. Tsukada. Video Editing Support System Based on Video Grammar and Content Analysis. In Proc. of 16th International Conference on Pattern Recognition (Volume 2), pages 1031-1036, 2002.
[3] T.Oates and P.R.Cohen. Searching for Structure in Multiple Streams of Data. In Proc. of thirteenth International Conference on Machine Learning, pages 346-354, 1996.
[4] J. Yang, W.Wang and P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. In Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275-279, 2000.