Visual Tracking and Localisation of Billboards in ... - Semantic Scholar

3 downloads 0 Views 456KB Size Report
ABSTRACT. In this paper we present a system for the localisation and tracking of billboards in streamed soccer matches. The application area for this research is ...
Visual Tracking and Localisation of Billboards in Streamed Soccer Matches Frank Aldershoffa and Theo Geversa a University

of Amsterdam, Kruislaan 403, Amsterdam, The Netherlands ABSTRACT

In this paper we present a system for the localisation and tracking of billboards in streamed soccer matches. The application area for this research is the delivery of customised content to end users. When international soccer matches are broadcast, the diversity of the audience is very large and advertisers would like to be able to adapt the billboards to the different audiences. By replacing the billboards in the video stream this can be achieved. In order to build a more robust system, photometric invariant features are used. These colour features are less susceptible to the changes in illumination. Sensor noise is dealt with through variable kernel density estimation.

1. INTRODUCTION We are moving toward a post-Gutenberg era, a time in which there is pervasive access to rich (multimedia) data. Already the Internet offers access to a sheer limitless collection of multimedia data. The deployment of 2.5 and 3G mobile phone networks will provide ubiquitous access to this collection 1 . Because of this development the Internet will become more and more important as a delivery method for video programming. The digital format and the pervasiveness of the Internet offer many advantages over the current analogue delivery formats (broadcast, cable and video cassette). One of the main advantages is that the Internet is suitable for delivery of customised content. For example, when international soccer matches are broadcast, the diversity of the audience is so large (different countries, different continents) that advertisers would like to be able to adapt the billboards to the different audiences. When the matches are streamed, the billboards can be tailored to each individual viewer. For the delivery of the customised soccer matches we envision three scenario’s: 1 a) Streaming customised video to a PC. The video is customised at the server and then streamed to the end user. b) Streaming un-customised video to a PC, customising at the PC. The video is streamed to the end user and customised by the client PC. 2 Streaming customised video to a mobile phone. The video is customised and down-scaled at the server and then streamed to the mobile phone. 3 a) Broadcasting customised video directly to a TV. b) Broadcasting un-customised video to a TV, customising by a set-top box. For scenario’s 2, 1b and 3b MPEG-4 (in combination with MPEG-7 and possibly MPEG-21) is the most suitable format, since it can accommodate very high compression ratio’s and merge different sources of multimedia content. For 1a and 3a MPEG-2 is also suitable. In this paper we describe how a digital video stream of a soccer match can be customised, by replacing the advertisements on the billboards next to the field with customised advertisements. This of course is beneficial to the advertisers, who now can choose not to advertise in regions where their products are not available, or to target advertisements to specific regions. In the remainder of this paper we will first discuss some related research. Then in section 3 photometric invariant object representation is presented. Section 4 details how a billboard is first detected and then tracked in a video stream. Finally the results are presented in section 5. Further author information: (Send correspondence to Frank Aldershoff) Frank Aldershoff.: E-mail: [email protected], Telephone: +31 20 525 7540 Theo Gevers: E-mail: [email protected], Telephone: +31 20 525 7540

2. RELATED WORK Several companies offer products for the insertion of so called virtual billboards 2;3 . Virtual billboards are computer generated graphics that are added to the video signal that is being broadcast. These billboards are put in places where no billboard was present in the original signal. Other applications of this technique are the superimposing of logo’s on parts of the playing field and even the insertion of complete artificial structures into the video. Our research, however, is not about adding billboards where there were none, but about replacing of billboards already present in the video. Recently a system has been proposed in which logo’s are recognised on billboards in still frames from video’s of formula one races 4 . The recognition is done with a method derived from string matching on essentially binary representations of the video frames, so obviously no colour information is used. The purpose of this system is to gather statistics on the presence of the billboards in the video, which can be used in advertisement and sponsoring. Both the approach and the application area differ greatly from our project. Medioni et al. present a system for the real-time billboard substitution in a video stream 5 . The application area of this research is close to ours, only the mobile component in missing. Billboards are recognised with the Point Matcher, a module that extracts interesting points (e.g. corners) and tries to find corresponding points in the target model. This can quickly become a very computationally demanding task. Colour is used to differentiate between points that could be part of the target billboard and points that have simply the wrong colour. The whole system description is quite detailed and a patent has been issued. In our system billboard detection is done with robust, colour based object recognition. The billboards are represented by the distribution of their colours. Robustness is provided by using an invariant colour representation, the photometric invariant colour space (rgb). The use of photometric invariants results in robust object recognition 6 . These colour representations are invariant under changes of surface orientation, illumination direction and intensity. To deal with the problem of (sensor) noise in the video, the measurement error of a pixel value is propagated through the colour space transformation to the colour distributions, using variable kernel density estimation 7 . This method is preferable to traditional methods of suppressing the error by static thresholding. The calculation of colour invariants could also be implemented in hardware, e.g. video camera’s. Most of the research on object tracking in sports video’s focuses on player and ball tracking 8;9 . Kernel-based object tracking in combination with the Bhattacharyya coefficient as a distance metric is used to track objects in sports video’s (and other types of video data). Comaniciu et al. use the Mean Shift algorithm to locate a target model 10 . And in another approach particle filtering is employed 11 . Both these approaches require a smooth feature space, which is achieved by weighting the features with a kernel. In kernel-based object tracking, the target model is represented by a probability density function (pdf) of colour features. This pdf is usually discretized in a histogram. We are not interested in the tracking of players or the ball, we want to track the billboards. When a billboard has been localised it can be tracked in the following frames. Since its current position is known, the tracking algorithm can predict the area where to look for the billboard in the next frame. Within this restricted area the billboard can be localised by a brute force search. Kernel based object tracking can also used for the tracking of billboards in sports video’s. Both the Mean Shift and the Particle Filter approach are suitable to efficiently search for an object.

3. COLOUR BASED OBJECT REPRESENTATION 3.1. Colour invariant features For this application the billboards are represented in a colour feature space. In order to obtain a representation that is robust under varying imaging conditions, an invariant colour feature space is used: rgb (normalised RGB) denoted by R , (1) r= R+G+B G g= , (2) R+G+B

B R+G+B This representation of colour is invariant under changes in illumination intensity and object geometry 7 . b=

(3)

As with all sensor data, a video stream also contains noise. Noise can be introduced by the sensor (the video camera), the data transport (including storage methods) and the encoding of the data (MPEG compression). We assume that this sensor noise is normally distributed, since additive Gaussian noise is the limiting behaviour of both photon counting noise and film grain noise.

3.2. Sensor noise propagation When the data is converted to another format, in our case from RGB to rgb, this is usually a non-linear operation. This means that small variations due to noise lead to very large changes in the new format. For an indirect measurement, such as r or g, the true value of a measurand u is related to its N arguments, denoted by uj , as follows: u = q(u1 , u2 , · · · , uN ). (4) The measurand u can be estimated by a measurement u ˆ by substituting u ˆj for uj . Then u ˆ1 , u ˆ2 , · · · , u ˆN are measurements and we get: u ˆ = q(ˆ u1 , u ˆ2 , · · · , u ˆN ), (5) with corresponding standard deviations σuˆ1 , σuˆ2 , · · · , σuˆN . With Taylor series an approximation of a function can be written. The Taylor series with respect to noise and with N = 2 (to simplify calculation), is given by:   m  ∂ 1 ∂ ∂ ∂ q(u1 , u2 ) + Rm+1 , (6) E1 + E2 q(u1 , u2 ) + · · · + E1 + E2 q(ˆ u1 , u ˆ2 ) = q(u1 , u2 ) + ∂u1 ∂u2 m! ∂u1 ∂u2 where u ˆ1 = u1 +E1 , u ˆ2 = u2 +E2 (E1 and E2 are the errors of u ˆ1 and u ˆ2 , and Rm+1 is the remainder term. Further, ∂q is the partial derivative of q with respect to u ˆ . The general form of the error of an indirect measurement is: j ∂u ˆj E=u ˆ − u = q (ˆ u1 , u ˆ2 ) − q (u1 , u2 ) , in terms of the Taylor series we obtain the following:    m ∂ ∂ ∂ ∂ 1 E= E1 + E2 q (u1 , u2 ) + · · · + E1 + E2 q (u1 , u2 ) + Rm+1 . ∂u1 ∂u2 m! ∂u1 ∂u2

(7)

(8)

Using only the first linear term the error can be computed: E=

∂q ∂q E1 + E2 . ∂u1 ∂u2

(9)

Then, for N arguments, it follows that if the uncertainties in u ˆ1 , · · · , u ˆN are independent, random, and relatively small, the predicted uncertainty in q is given by: v uN  2 uX ∂q σuˆi , (10) σq = t u ˆi j=1 Although (10) is deduced for random errors, it is used as an universal formula for various kinds of errors. Substitution of (1) and (2) in (10) gives the uncertainty for the normalised coordinates s 2 + σ 2 ) + (G + B)2 σ 2 R2 (σB G R σr = (R + G + B)4 s 2 + σ 2 ) + (R + B)2 σ 2 G2 (σB R G σg = (R + G + B)4

(11)

Assuming normally distributed random quantities, the standard way to calculate the standard deviations σR , σG , and σB is to compute the mean and variance estimates derived from a homogeneously coloured surface patches in an image under controlled imaging conditions. From the analytical study of (11), it can be derived that normalised colour becomes unstable around the black point R = G = B = 0.

3.3. Variable kernel The uncertainty of rgb colours can be taken into account when constructing a histogram. Instead of always assigning a measurement to only one bin from the histogram, the energy of the measurement can be distributed over several bins. The amount of the uncertainty determines how many surrounding bins are filled. Therefore, the kernel density estimator is insensitive to the placement of the bin edges   n x − Xi 1 X ˆ K (12) f (x) = nh i=1 h R Here, kernel K is a function satisfying K(x)dx = 1. Further, n is the number of pixels with value Xi in the image, h is the bin width and x the range of the data. In the variable kernel density estimator, the single h is replaced by n values α(Xi ), i = 1, · · · , n. This estimator is of the form   n 1X 1 x − Xi fˆ(x) = K (13) n i=1 α(Xi ) α(Xi ) The kernel centred on Xi has associated with it its own scale parameter α(Xi ), thus allowing different degrees of smoothing. To use variable kernel density estimators for colour images, we let the scale parameter be a function of the RGB-values and the colour space transform. We are now left with the problem of determining the scale and shape of the kernel. Assuming normally distributed noise, the distribution is approximated well by the Gauss distribution 12 −x2 1 K(x) = √ exp 2 . 2π

(14)

Then the variable kernel method for the bivariate normalised rg kernel is given by:     n 1 X −1 r − ri g − gi −1 ˆ f (r, g) = σ K σ gi K n i=1 ri σ ri σ gi

(15)

where σr , σg are derived according to (11). In conclusion, to reduce the effect of sensor noise during density estimation, we use variable kernels where the normal distribution defines the shape of the kernel. Further, kernel sizes are steered by the amount of uncertainty of the colour invariant values.

64 48 32 g

16 32 r

16 48 64

Figure 1. Histogram representation of the pdf of the billboard in the r × b chromaticity space.

3.4. Billboard representation The billboards are represented by a Probability Density Function (pdf ) in the r × g space, i.e. the billboards are represented by the coordinates of their pixels in the chromaticity space. This pdf is estimated from a set of manually annotated frames from soccer video’s. Figure 1 shows the representation (pdf) of a billboard in the chromaticity space r × b. The estimated pdf is represented by a histogram.

Figure 2. Classification of areas with grass colour, based on ratio histogram backprojection.

4. OBJECT DETECTION AND TRACKING Before object tracking can be performed, the object has to be detected to get an initial position. Another use of object detection is the labelling of large areas in a frame during a preprocessing step, that can then be excluded in the billboard tracking and detection steps. Examples are the labelling of large areas of grass and of the audience. After removal of these two areas, only a small strip is left in which to search for billboards. The labelling of the objects is done with ratio histogram back-projection.

4.1. Histogram back-projection Histogram back-projection is a way to relate histogram data, without any spatial information, back to the spatial domain 13 . Histograms are used to assign to every pixel in the image the probability that it is part of the object that is searched. This is done by first computing a histogram for all the pixels that are part of the object in question; this can de averaged over several different frames, to achieve a more robust model. A histogram for all the pixels that are not part of the target object is created similarly. The ratio histogram is then calculated: hratio =

hobject , hobject + hnot

(16)

where hobject and hnot are histograms for the pixels that are part of the object and the pixels that are not part of the object. This ratio histogram gives for each colour the probability that a pixel of that colour is part of the target object. By assigning this probability to the pixels, we get a gray valued representation and with some thresholding a binary classification in object and non-object pixels is possible. Figure 2 shows how grass is classified. The final detection of a billboard in the unlabelled area is done using the same technique as for object tracking: Kernel based object tracking. By initialising the object tracker with a carefully selected set of points in the search area (unlabelled area), the tracker can start.

4.2. Target model A target, in our case a billboard, is represented by a pdf in the feature space (r × g). A histogram is used to approximate the pdf and to construct the target model. A model is represented by a 64 × 64 histogram of a region. During the construction of the histogram, the contribution of a pixel is weighted by a weighting function

Figure 3. The masking region (see equation 17) for different scales. Top-left is the lowest scale (least amount of detail) and at the bottom-right we have the highest amount of detail.

k. By assigning smaller weights to pixels further from the centre of the region, the histogram is less sensitive for boundary pixels that may belong to the background or become occluded. By creating a mask of these weights, regions of any shape with any weight distribution can be constructed. We use an ellipse and weighting function  1 − r2 , r ≤ 1 k (r) = (17) 0 , otherwise q 2 where r is the radius of the ellipse. For an ellipse with half axes Hx and Hy r is Hxx 2 + Hyy . Figure 3 shows a series of masks for different scales. Another possibility is to use an Epanechnikov kernel 10 . When these mask are computed off-line, efficient weighted histogram construction is possible. It is easy to see that many other shapes of masks, with different weights, can also be constructed. The density estimation of an object is discretized in a histogram of 64 × 64 bins for the r × g feature space. The function h (xi ) assigns the correct n bin o to the colour at position xi . Then for a target centred at y the (u) construction of the histogram py = py is now 14 11 : u=1...m

p(u) y

  I X ky − xi k =f k δ [h (xi ) − u] , a i=1

(18)

q Hx2 + Hy2 is used to Pm (u) adapt to the size of the region and f a normalisation factor. This factor f is defined such that u=1 py = 1 where I is the number of pixels in the region x, δ is the Kronecker delta function, a =

f=P I

i=1 k

1 

ky−xi k a



(19)

4.3. Object tracking Tracking differs form object detection and localisation, because it is assumed that the target object is to be found in a small, known area. Within this area a search is performed for the region that best matches the target

object. This means that the histogram of the region should be similar to the target histogram, based on some similarity measure. A popular measure between two distributions is the Bhattacharyya coefficient Z p p(u), q(u)du. (20) ρ [p, q] = In the discrete case of our histograms this becomes: ρ [p, q] =

m p X p(u) , q (u)

(21)

u=1

A larger ρ indicates more similar histograms, so we use the distance measure p d = 1 − ρ [p, q]

(22)

which is called the Bhattacharyya distance.

4.4. Object localisation Object localisation is now a search for the point y where the distance between the histograms of the candidate object py and the target object pq is minimal. Several methods have been proposed to efficiently perform this search. There is of course the brute-force approach in which all points in the area are evaluated, but with kernel based tracking more efficient solutions are possible. Because the region from which the histogram is constructed is masked with a smooth kernel, the landscape of histogram distances is also smooth. The mean-shift and the particle filter approach use this for more efficient tracking.

4.5. Motion estimation In order to more efficiently and reliably track the billboards, information about the displacement (motion) of the billboard can be used to restrict the area where to look for the billboard in the next frame. It can be assumed that the displacement of an object between consecutive frames in a continuous shot will be limited. By using information about the direction and velocity of the objects, relative to the frame, the object position in the next frame can be predicted. This information can be used to set the shape and the size of area where to look for the object in the next frame. Limiting the area that is searched for the target object, leads to a more efficient tracking algorithm (less pixels are visited) but also a more reliable algorithm: the changes of accidentally matching another, but similar, object somewhere else in the frame, are smaller when the search area is restricted. This increase in reliability is even more important when tracking objects like players, of which there are usually many similar occurrences in a single frame. The motion of objects in a frame can be estimated in several ways. Using a motion model and the known previous positions of the object, the direction and velocity can be estimated. The motion model can be simple, if not much is known about the motion of the objects. Equation 23 is a simple model, in which the position of an object is estimated from it’s two previous positions: x ˆt = Axt−1 + Bxt−2 .

(23)

4.5.1. MPEG-2 motion vectors Another way to estimate motion vectors is to use the motion information encoded in the MPEG-2 stream. The MPEG-2 standard 15 uses so called motion vectors to encode the displacement of 16 × 16 macro blocks. But the name motion vector can be deceptive, since these vectors are mainly used as a technique to achieve more efficient compression. An MPEG encoder is in no way obliged to encode the real motion of objects, any information that might improve compression is allowed. But, depending on the encoder, these motion vectors can provide interesting information, because the encoder had access to frames that lie ahead in time compared to the current frame. This means that the encoder had access to more information than when using only a auto-regressive motion model.

Figure 4. Some examples of the tracking results. An ellipse is drawn over the billboard, the size of the ellipse indicates the size of the tracked object.

Figure 5. In this frame the original billboard has been replaced with a new, fake, one. Players in front of the billboard are not covered by the replacement billboard.

5. RESULTS Using the techniques outlined in the previous sections a short video clip has been processed. A billboard was tracked for about ten seconds, during which it was replaced by another. A kernel based tracker was used for the tracking. And the billboard was replaced using histogram backprojection, template matching and some techniques to improve the visual quality of the result.

5.1. Object tracking For this experiment a billboard was tracked in a short segment of a soccer match. The board is tracked for about ten seconds in which it moves from right to left and back and also changes in size due to the panning and zooming of the camera. The tracker never loses track of the target billboard and the tracking region is adapted in scale in reaction to the zooming. Figure 4 shows the tracking results for several frames. The tracker was given the initial position and scale of the target billboard. During the period that the billboard was tracked, players occasionally occluded the billboard, but this didn’t effect the tracking process.

5.2. Billboard replacement When the tracker has located the billboard, the position and scale are used to initialise a template matcher. This will give the orientation of the billboard and the exact size. Histogram backprojection (see section 4.1) is used

to generate a mask of the billboard. In this mask only pixels that are part of the billboard are on, any objects in front the billboards won’t be in the mask. The edges of the mask are further blurred and this mask is used as an alpha-channel for the insertion of the new billboard. White pixels are completely replaced by the new billboard, black pixels are left untouched and any gray valued pixels are a mix of the original and new billboard. This will ensure a smooth and natural looking image. To further enhance the visual appearance of the inserted billboard, fake shading can be added. This could be done by creating a gray level image of the target and comparing this with the gray valued image of the tracked billboard. Any differences are taken to be lightning effects and these effects are added to the replacement billboard. Figure 5 shows a replaced billboard, no fake shading has been applied and it can be seen that the resulting image looks quite artificial.

References [1] U. Forum, “The future mobile market: global trends and developments with a focus on western europe,” Tech. Rep. 8, March 1999. [2] Pvi virtual media sevices, http://www.pvi-inc.com/. [3] Orad website, http://www.orad.co.il/. [4] R. J. den Hollander and A. Hanjalic, “Logo recognition in video stills by string matching,” in International Conference on Image Processing, IEEE, (Barcelona Spain), September 2003. [5] G. Medioni, G. Guy, H. Rom, and A. Francois, “Real-time billboard substitution in a video stream,” in Proceedings of the 10th Tyrrhenian International Workshop on Digital Communications, (Ischia, Italy), 1998. [6] T. Gevers, “Robust histogram construction from color invariants,” in ICCV, pp. 126–132, IEEE, 2001. [7] T. Gevers and H. Stokman, “Robust histogram construction from color invariants for object recognition,” Transactions on Pattern Analysis and Machine Intelligence 24, pp. 113–118, Januari 2004. [8] A. M. T. Ahmet Ekin, “Robust domininant color region detection and color-based applications for sports video,” in International Conference on Image Processing, (Barcelona Spain), September 2003. [9] G. Jaffr´e and A. Crouzil, “Non-rigid object localization from color model using mean shift,” in International Conference on Image Processing, IEEE, (Barcelona Spain), September 2003. [10] V. R. Dorin Comaniciu and P. Meer, “Kernel-based object tracking,” Transactions on Pattern Analysis and Machine Intelligence 25, May 2003. [11] E. K.-M. Katja Nummiaro and L. van Gool, “An adaptive color-based particle filter,” Image and Vision Computing 21, pp. 99–110, Jnauari 2003. [12] J. R. Taylor, An Introduction to Error Analysis, University Science Books, Mill Valley, CA, 1982. [13] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. of Comp. Vision 7(1), pp. 11–32, 1991. [14] D. Comaninciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” Transactions on Pattern Analysis and Machine Intelligence 24, May 2002. [15] Mpeg-2 specifications, http://www.chiariglione.org/mpeg/standards/mpeg-2/mpeg-2.htm.