Player Identification in Soccer Videos - Semantic Scholar

Player Identification in Soccer Videos Marco Bertini, Alberto Del Bimbo, and Walter Nunziati Dipartimento di Sistemi e Informatica, University of Florence Via S. Marta, 3 - 50139 Florence, Italy

[email protected], [email protected], [email protected]

ABSTRACT A method for the identification of players in soccer videos is presented. The proposed approach exploits the inherent multiple media structure of soccer videos to perform people identification without relying on face recognition. Instead, faces are detected in closeup shots, and then the filmed player is recognized by means of recognition of the number depicted on the frontal part of its jersey, or by detection and interpretation of superimposed closed caption. Players not identified by this process are then assigned to one of the labeled faces by means of a face similarity measure. We present results obtained from soccer videos of the last European Championship for national teams, held in Portugal in June 2004.

Categories and Subject Descriptors I.2.10 [Vision and Scene Understanding]: Video analysis; H.2.4 [Systems]: Multimedia databases

General Terms Algorithms, Experimentation

Keywords Sport video analysis, automatic annotation, person recognition.

1. INTRODUCTION Video annotation refer to the process of analyzing a video stream and producing meta-data that describe its semantic content, in order to allow efficient browsing, searching and retrieval of video documents. In the context of soccer videos, there is a strong need for tools that enable identification of most relevant video segments, for instance highlights of the game and close-up shots of players involved in these highlights. A typical scenario is the production of reports on the best performance of a particular player (say for instance the famous English player David Beckham). For this task, the producer may want to select video segments of the most important actions performed by the player, as well as close-up shots that show its emotions and expressions during important parts of the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR 2005 November 11-12, 2005, Singapore Copyright 2005 ACM 1-59593-044-2/05/0011 ...$5.00.

(a)

(b)

Figure 1: a) Close up with superimposed text caption - b) Team graphic screen and close up with jersey’s number. game. To this end, the digital library must be searched with query like “give me all the close-up shots of David Beckham in the last World Cup”, to select the most expressive video segments without explicitly browsing all the video material. Annotation of sports videos received wide attention in the last few years. Many works have been devoted to parse the video stream at semantic level, in order to detect various events typical of the sport under consideration (like goals or shots on goal in soccer videos). Usually, several types of knowledge of the sport domain and of the typical broadcasting production rules are exploited, to achieve the detection of a number of domain-specific events. Recent works are [1], [6] the case of soccer videos, and [15], [22] for basket videos. More generic approaches addressing the problem of segmenting team sports videos into play/break shots have been presented in [13] and [20]. Recent comprehensive review of the most significative works in this area can be found in [21] and [19]. People identification from video sequences is a widely studied topic as well. Two important recent works are [7] and [18]. Both of them studied the problem of detecting faces from broadcasted video sequences, and to find instances of the same person among all the detected faces. For the vast literature on face detection and recognition, the reader is also referred to the recommendable surveys presented in [5], [10] and [11]. Person recognition by means of association of interpreted textual content to faces has been investigated in the context of news video in [16], and more recently in [2]. They use several information sources typically available for this type of videos, such as transcripts and video captions, as well as faces automatically detected using color analysis to recognize skin tones. Another recent work on text detection and recognition in video sequences is presented in [4]. They employ the same method to detect both superimposed text and text filmed by the camera. However, it is required that text is fixed onto rigid surfaces (such as the text on a billboard), while our goal is to detect text onto non-rigid surfaces, like the jersey of a player. Our goal is to provide automatic annotation of player’s identity in close-up shots. The method performs its analysis in two steps: in the first, players are identified by means of detection and recog-

nition of the number depicted on the player’s jersey, or by means of detection and recognition of the player’s name that appears in superimposed text caption. In the second step, players that are detected, but not identified in the above described process, are assigned to one of the labeled faces using a face similarity measure. We bring contribution in two different areas. First, we exploit the presence of a very peculiar media of soccer videos, which is the “stream” of jersey’s number, to perform recognition. Second, we provide a face representation scheme based on (sequence of) local image patches, that is robust to expression and, to some extent, pose variation, and that is also robust to background variation. Algorithms involved are described in Sections 2, 3, 4, and 5, while experimental results are presented in Sect.6. Results have been obtained from videos of soccer games of the last European Cup for national teams. About 80% of accuracy has been achieved for the face and number detectors, while superimposed text caption detector achieved almost 95%. As can be excpected, less accurate, but still promising, results have been obtained on the face matching task.

3.

DETECTING JERSEY’S NUMBERS AND FACES IN CLOSE-UP SHOTS

Detection of faces and of numbers depicted on player’s jerseys is achieved trough an implementation of the algorithm proposed by Viola and Jones [14]. We briefly outline here the algorithm, referring the reader to the original paper by Viola and Jones, and their subsequent work. Basically, the algorithm relies on a number of simple classifier which are devoted to signal the presence of a particular feature of the object to be detected. In the case of faces, this could be the alignment of the two eyes, or the region between the nose and the mouth. A feature is a weighted combination of pixel sums of two rectangles, and can be computed for each pixel in constant time using auxiliary images like the the Summed Area Table (SAT), which is defined as follows (I is the original image): SAT (x, y) =

I(i, j) i≤y j≤x

The current algorithm uses the following templates to compute features: These prototypes are scaled in vertical and horizontal di-

2. BASIC IDEA Soccer videos are among the most difficult videos to analyze for face recognition tasks, for a number of reasons: first, players are not acting in front of a camera, like actors in movies. They usually run, making the head region jittering up and down over the frame sequence, and directors cannot avoid confused situations where a lot of people are in the field of view of the camera, possibly occluding each other. Second, cameras are not still, like in many surveillance applications, where camera setup can also be carefully chosen so that the quality of the images is maximized with respect to the face detection task. Third, and by far the main problem, players exhibits large variation in pose and expression during the game, making them sometimes hard to be recognized even for a human observer. On the positive side, close-up shots in soccer videos have a strong visual appearance. This is mainly due to the fact that players wear colored jerseys, usually decorated with some stripes or logos, and depicting the player’s number. Moreover, during the game superimposed text captions are shown to point out some interesting details about the currently filmed player, such as the number of goals he scored, or the fact that he has been sanctioned with a “yellow card” in soccer games. These considerations lead to the fact that, for the purpose of player identification, face recognition is not the only possible approach. Figure 1 shows situations where player’s identity can be derived from superimposed caption (left) and from the combination of jersey’s number and its team graphic screen (right). The jersey’s number is unique during an international tournament, and can be used to recognize a player’s identity either analyzing a graphic screen like those shown in Fig.1 (right), or checking an existing database, such as the one available on the Euro 2004 website. To provide the annotation described in Sect. 1, the video stream is analyzed (offline) in two step. In the first one, three detectors are run on each shot. These are devoted to detect (frontal) faces, jersey’s number, and superimposed text caption respectively. After this step, a number of close–up shots are annotated with the identity of the player. In the second step, every non–identified face is analyzed to understand if it is similar to any of the faces annotated in the first step. The similarity measure is based on the content of face patches centered on the eyes, and on the eyes–midpoint. To increase the robustness, and in contrast with several previous approach, matching is carried out between face tracks, instead that between single instances of faces.

Figure 2: Templates used to compute features by the face and number detection algorithm. rection in order to detect objects within a certain range in size, and 45 degrees rotated versions are used as well. In our case, we tune the template size to recognize faces whose bounding box side is about 100 pixels wide, and numbers whose bounding box side is about 30 pixels wide. Computation of a single features f at a given position (x, y) requires to subtract the sum of the intensity values of all the pixels lying under the white rectangle of a template (pw ) from the sum of the intensity values of all the pixels lying under the black rectangle (pb ) of the same template: f (x, y) =

pb (i) − i

pw (i) i

A large number of these simple features is initially selected. Then, a complex classifier is constructed by selecting a small number of important features using AdaBoost [8]. These classifiers are combined into a cascade structure, which acts as a (degenerate) decision tree. At the beginning of the cascade there are the more discriminative classifiers. The classification process works as follows: the sub window to be evaluated is passed to the first classifier of the chain. If the example is classified as a positive example, it is passed to the second classifier, otherwise is immediately rejected. The process is then repeated for the subsequent classifiers, until the example reaches the end of the cascade where it is eventually accepted as a positive detection.

3.1

Recognition of numbers

The task to be accomplished is to detect and recognize number’s depicted on the frontal side of the shirts. Official rules of most important soccer organization (like UEFA and FIFA) state that jerseys must be decorated with such numbers on their front, and that size of the numbers must be within a certain range. Moreover, numbers are in the range from 1 to 22, and remains assigned to each player for the entire tournament. We train a different detector for each number from 1 to 22. We found that this approach is far more reliable that having classifiers

for digits 0–9, because two digits numbers are not always well separated, and so they tend to cause missed detections. Moreover, detecting each digit separately would force us to impose constraints on spatial arrangement of detected digits which are not easy to verify in the cases where numbers are not perfectly horizontal. Each detector acts as a dichotomizer, allowing us to directly recognize which is the particular number that has been detected. Each classifier has been trained with 50 positive and 100 negative examples, the latter being randomly selected from images, while the former have been manually cropped. Other positive examples have been generated with graphic programs or obtained by small rotations of some selected images. Figure 3 shows examples from the training set used to build the detector for number 10.

Figure 3: Positive examples from the training set used to build detector for number 10.

component. Then, remaining pixels are classified according to their Hue. This is done to avoid for example that white pixels, which could mapped by the RGB to HSV transformation onto pixels with low saturation values, are counted as colored pixels. This histogram is clearly dominated by the principal color of the team jersey, as can be seen in figure 4, which shows histograms for players belonging to different teams in the same game. For each detected faces, this context color histogram c is compared to a reference histograms r, using the χ2 statistic: n

χ2 (c, r) = i=1

If the above is not under a predefined threshold, the detected face is discarded. The reference histogram r has been manually trained using a very small set of close–up shots (usually one to three) from the game to be annotated. The last verification step is based on eyes detection performed directly on the intensity values. Pixels of the region of interest are first transformed into the Y Cr Cb color space. Then, two eye-maps are computed: the first one uses the chrominance information, while the second one uses the luminance component of the pixels. The eye map from the chrominance is based on the observation that high Cb and low Cr values are found around the eyes, and it is built from the information contained in Cb , the negative Cr and the ratio Cb /Cr ; more exactly is defined by the following equation: Ec =

3.2 Detecting Frontal Faces For the face detection task, the algorithm has been trained with a few hundred of positive examples taken from a standard dataset, and another 100 examples manually cropped from real soccer video sequences. The algorithm has been trained to detect frontal and quasi-frontal faces only. Despite this large training set, a high number of false detection was initially observed, due to the fact that soccer videos are characterized by many regular pattern that tend to confuse the algorithm, being somewhat “face-like”, such as the midfield in a soccer game. To deal with this problem, the basic object detection algorithm of Sect.3 was modified to take advantage of the several hyphoteses that can be made for the particular case of soccer videos. First, negative examples have been mainly selected from actual soccer games, in particular using the background of close–up shots. Since this type of shot have a distinct appearance, it is possible with a rather limited amount of examples to capture their variation. Also, we defined a face verification procedure which is run within the bounding box of each hypothesized face. Two cues are used for this verification step. The first one again exploit the particular appearance of soccer videos close-up shots, which is somewhat dominated by the color of players’ jerseys. For each detected face, we produce a color histogram of the region immediately below the face bounding box. The color model is based on an histogram in the Hue-Saturation-Value color space, in order to decouple chromatic information from other effects, such as shading. This is particularly important in the case of sports videos since the same scene is taken with several different illumination condition by the various cameras. Color information however is reliable only when both the saturation and the value component are not too small, otherwise is meaningless (e.g, a pixel could appear to be white, and yet have a red Hue). For this reason, bins corresponding to white, black and 3 other shades of gray are first populated using only the S and V

(ci − ri )2 , (ci + ri )

Cb 1 2 ) (Cb + (255 − Cr )2 + 3 Cr

The eye map from the luminance information is based on the observation that eyes usually contain both dark and bright pixels in the luminance component. These regions are identified through morphological operations that use a disk-structuring element. After that, a single eye map is obtained, combining the luminance and chrominance maps. Once the regions present in the final map have been identified, a roundness metric is computed and if the value is greater than a threshold the shape is considered as a possible eye. After that, the position of the eye is evaluated to rule out eyes in the bottom part of the face. A segmented object that satisfies these checks is recognized as an eye, and a region that has two eyes is finally recognized as a face. Being computationally demanding,

Figure 4: Typical color histograms produced from regions beneath face bounding boxes. The last 5 bins are for 5 shades of gray (from black to white).

the above described eye detection step has been put at the end of the verification procedure. In this way, it is usually performed only on small candidate regions of the image. It should be noted however that the envisioned application is intended for off–line annotation of video streams, hence we are not particularly concerned with speed–related issues.

4. SUPERIMPOSED TEXT DETECTION AND RECOGNITION To extract information from superimposed text captions, we must detect captions and interpret them. It turns out that sport videos captions may appear everywhere within the frame, even if most of the time they are placed in the lower third or quarter of the image. Also the vertical and horizontal ratio of the caption zones varies, e.g. the roster of a team occupies a vertical box, while usually the name of a single athlete occupies a horizontal box, as well as a match score. Character fonts may vary in size and typeface, and may be superimposed on opaque background as well as directly on the video. Caption often appear and disappear gradually, through dissolve or fade effects. These properties call for automatic caption localization algorithms with the least amount of heuristics and possibly no training. In [3] we presented an algorithm for superimposed text detection, based on spatio-temporal analysis of image corners. This approach stems from the fact that to enhance readability of characters the producers typically exploit luminance contrast, since luminance is not spatially sub–sampled within the TV standards. An image location is defined as a corner if the intensity gradient in a patch around it is not isotropic, i.e. it is distributed along two preferred directions. Corners are image points with large and distinct eigenvalues of the gradient auto-correlation matrix, where subscripts denote partial differentiation with respect to the coordinate axis, and brackets indicate Gaussian smoothing: A=

hIx2 i hIx Iy i

hIx Iy i hIy2 i

c(x, y) = detA − k tr2 A with k = 0.04 A corner is detected if c(x, y) is above a predefined threshold. The first term of the equation will have a significant value only if both eigenvalues are different from zero, while the second term inhibits false corners along the borders of the image. The corner extraction greatly reduces the number of spatial data to be processed by the text detection and localization system. Following the text localization several steps for text recognition are performed, such as binarization, temporal integration and image enhancement. An example of the results of this algorithm is shown in Figure 5. Once

5.

FACE MATCHING

Not every close-up shot of a soccer video has superimposed text caption showing the filmed player’s name, nor it is always the case that the player’s number on its jersey is recognizable. Hence, some interesting shots could be missed their name, even if the face detector provide a correctly detected face. In this case, we have a classification problem, where each class is given by a different player, of which we already have (hopefully!) some labeled examples, resulting from the textual identification. The objective is then to assign every non–identified face to one of the player classes, or to a null class that comprises all non– identified faces that have been detected. However, we have noticed that unlabeled examples vary significantly from annotated examples. We deal with this problem exploiting the fixed and somewhat limited population of players presents in typical soccer videos. The proposed solution is to consider each annotated example as an individual. We avoid to merge clusters relative to the same player, because often the resulting cluster would be characterized by high variation in pose and expression, and then every cluster would look very similar to all of the unknown examples. An example of this situation is given in Fig.6, where the unlabeled face on the right must be assigned to one of the labeled faces on the left, which are (a subset of) faces annotated using the jersey’s number or text caption. The faces on the first, fourth, and sixth rows represent the same player. However, we consider each row as a distinct individual, and for each sequence we build a compact representation based on local facial features. This has the effects of increasing inter–class distances in our classification task, but at the cost of having an increased number of classes. Hence, for the unlabeled example on the right of Fig.6, there are 3 possible correct pairings. In practice, to label an unknown face, we require to find a face of the same player with a similar pose and expression. This is reasonable due to the limited number of players, and to the several close-up shots for each player usually occurring during the game.

Figure 5: (top left) Original frame with detected corners - (top right) Detected caption area - (bottom left) Recognized text area - (bottom right) Recognized text.

that text is extracted, it must be correctly interpreted. In our experiments, we have used a freely available OCR software ([9]). The system provide a good separation between different words, while making some mistakes in character recognition. Since our objective is to recognize player’s names, we deal with this problem using an approximate string matching algorithm to perform query on a database of players taken from the Euro 2004 championship web site, from which most of our test videos come from.

Figure 6: Left - examples labeled by means of text or number. Right - an unlabeled example to be assigned to one of the labels. Arrows represent possible correct pairings.

The matching process begins by obtaining a single face track for each of the (frontal) face found in a shot (as described in Sect. 3.2). A face track is a set of (consecutive) faces of the same player in the same shot. The detected face is used as a starting point to initial-

sented by a 3 = 384 length vector. Similarity between face tracks is computed using the minimum distance between the two sets. If U is a face track corresponding to a non–labeled player, and L is a labeled face track, their distance is defined as follows: d(U, L) = min kUi − Lj k, i,j

(a)

(b)

Figure 7: a) Histogram of the Cb component of pixels within a face region - b) Histogram of the Cr component of pixels within a face region. These are used to define the skin model for a particular face.

ize (open) the track. First, a skin-tone model is built for the face. This is done by collecting an histogram in the Cr Cb space of the bounding box, and then using the dominant color as a skin tone. We found this representation particularly useful since usually a single, sharp peak is present in both the Cr and in the Cb spaces, as shown in Fig. 7. This skin color model is used to obtain a rough segmentation of the face with respect to the background, which otherwise will be a major source of noise for matching faces. Also, this helps to break tracks when two or more people occlude each other, as it will be described later. Then, eyes are tracked throughout the shot, using a simple correlation based tracker that uses eyes-centroids as measure. To avoid false tracks, the search for an eye is performed within a limited region (10 pixels) centered on the last observation, and completely comprised in the region delimited by the skin map of that particular face. When the tracker looses the eye-tracks (either due to occlusion, or to drastic appearance changes), the face track is closed and a compact representation of the whole track is produced. Our representation is a part-based face representation that captures the characteristic of the face throughout the track. It has been shown before that part based approaches are better suited for face recognition tasks under wide variation of pose and expression. In a very recent work, Sivic et al. [18], showed how features based on the SIFT descriptor [12] can be adopted to efficiently represent faces in feature length movies. They use 5 SIFT–like descriptors centered on facial features to represent a face. We experiment with several part–based representation schemes, and we obtained the most satisfying results using three SIFT descriptors centered on the two eyes (20 × 20 pixel, with the face size normalized to be 80 pixels wide), and on the midpoint of the eyes (15 × 30 pixels). This choice is motivated by the facts that a) these are the most robust facial features to detect and track troughout the shot and b) the lower part of the face is often characterized by appearance changes due to variation in expression, that exceed those due to identity changes. Our results are corroborated by early studies on saliency, that shows how the upper part of the face is usually more relevant for the face recognition task than the lower part [17]. We also introduce a significant modification to the basic SIFT descriptor, to exploit the peculiarity of the problem at hand. It is well known that a major cause of noise when matching image patches occurs when the patch overlaps the object boundaries. In our case, this would happens when a patch centered on an eye, for instance, partially overlaps with the background. To avoid this problem, we rely on the skin–map to adaptively compute the weights of the components of the SIFT descriptor. In particular, for each pixel of the patch, its weight in the descriptor is cut to zero as the pixel falls off the region defined by the skin–map. Every SIFT descriptor is a vector of length 128. Hence each face in the face track is repre-

where Ui and Lj are two 384–length vector, and their distance is measured using the l1 norm. If the distance falls below a predefined threshold, we assign U the label of L. It should be noticed here that, in principle, a single track may be labeled with more than a single label, of which some might be correct and some not. We believe this is acceptable, as long as the number of labels remains low, and the threshold has being set such that no more than two-three labels are assigned to the same player. A small human effort could be used to correct non–perfect annotation, as well as to correctly identify faces that do not look similar to any of the labeled examples.

6.

EXPERIMENTAL RESULTS

The system has been tested on soccer games of the last European championship for national teams (Euro 2004). All the games have been acquired with a standard dvd recorder, hence we are provided of 720 × 576 (PAL) video streams at 25 frames per second. Videos have been reduced to be 360 × 288, and deinterlaced. In the first experiment, the system have been run on two games from Euro 2004. Since we wanted to test the robustness of the system (in particular of the face detector) in a wide variety of situations, the video stream was not segmented into shots prior to conduct the analysis. Hence, detection is performed on the entire video. For each close–up detected, the corresponding shot is extracted from the video stream. We did not put much effort to optimize this shot segmentation task, however satisfying results (although not perfect) were obtained using the algorithm originally presented in [3]. On average, the system selected about 6000 frames for each game, providing 4 minutes of close-up shots with name/face association. The average number of players identified is 12 for game, without repetition. It is common that famous players are identified several times during a game, while other less known are filmed perhaps only once during the entire game. Once that a frame is selected, the whole shot is marked with the name of the recognized player. Figure 8 shows key–frames taken from shots where the either a face and a number have been found, or a face and text have been found. Examples are taken from the game Germany against Czech Republic. Table 1 reports performance of the number, face and text detectors. Reported ground truth (column “present”), is referred to close-up shots where a single player was present, those that are of interest for desired annotation. Number and text detectors have been run only when a face is detected, hence the “Present” column in this case is referred to the 90 shots obtained from the face detector. All the adaboost–based detectors were trained with a C++ program, running on a Pentium 4 with 1.5Gb of Ram. For each detector, a cascade of 14 stages have been used. The procedure took about 16 hours for the face detector, and about 6 hours for each number detector. At run time, detection took about 1/3rd of a second for the faces, including the face verification step. Number detectors took about 1/6th of a second each. Errors in recognition of numbers arise either if a number detector signals that a number is present when there is none, or when the recognized number is wrong. The first situation is very unlikely, since we require that detection is stable for a minimum number of frames, while we experienced the second type of errors for twodigits numbers in particular: it happened for instance that the num-

Detector Face Numbers Text

Table 1: Face, number and text detector performances. Present Detected Correct False 112 98 90 8 36 24 20 16 12 11 11 1

Missed 22 4 0

ples shown in figure 8, players that scores the two goals in the first half of the game between Germany and Czech Republic have been correctly detected (first row, second column and second row, third column, respectively). This is surely a valuable annotation, because shots of players taken after they scored a goal are very likely to be used for event summarization in soccer TV programmes. Table 2: Summarized results of the annotation of a sample game for the player identification task. Present Detected Correct Face and jersey’s number 36 24 20 Face and caption 12 11 11

6.1

Figure 8: Examples of key frames selected by the system from a Euro 2004 game.

ber 10 was detected instead of number 18. These errors are likely to occur either when the player is moving, or number is highly skewed with respect to the camera. Table 2 reports results of player identification on a sample game. Not surprisingly, detection of “face and caption” shots is more reliable than detection of “face and number” shots. This is mainly due to misdetections performed by face and number detectors, while the closed caption detector correctly detected nearly all the shots where a caption was present. Moreover, the number of close-up shots detected was fairly low if compared with the total number of close-up shots, where identification is not performed because neither jersey’s number nor text caption was present. However, it must be noticed that player’s close up occurring during the most interesting moments of the game (after a goal for instance) are usually detected by the system: in the exam-

Figure 9: Other examples of correct results (top row) and missed detection (bottom row).

Face Matching Results

To test the face matching scheme of Sect. 5, we picked 10 correctly identified faces from the various games present in our testbed, and 30 non–labeled face tracks, for which ground truth was manually obtained. Of these, 25 tracks had a matching face in the annotated set, while the other 5 were completely new to the system. The complete results are shown in Fig. 10. Face tracks and their SIFT–based representation were obtained using matlab, in about 3 seconds of processing for every second of video. The face matching scheme, also implemented in matlab, required a small additional time, in the order of a few seconds for all the tracks. To test the accuracy of the face–matching scheme, in this experiment we deliberately avoid to use context information, such as the color of the player’s jerseys. This however could be used to rule out obvious false matches (e.g., assigning a player to the wrong team, based on the color of its jersey). We learned two main lessons from these results: first, the error rate grows rapidly as the number of different individual grows. Second, as can be expected is very difficult to assign a face to the null class. Our current work is mainly concerned to deal with these two issues.

7.

CONCLUSIONS

We have presented a system to annotate the identity of a player from close–up shots of sports videos. Peculiar media present in sports videos, player’s jersey and superimposed text, have been exploited to annotate an initial number of shots. After this initial step, “unknown” faces are assigned to one of the already annotated examples by means of a face matching scheme. As can be expected for a very difficult task such as face matching, results are certainly not perfect, in particular for the face matching part of our system. However, we were able to provide a valuable annotation of entire soccer games, identifying shots of scorers and other nice shots. The proposed face representation has proved to be quite robust to variations due to changes in pose and expression, although the caveat is that it represents only the upper part of the face. We are currently investigating the possibility of describing the same face with several different patches related to the same part observed in different situations (e.g., the mouth). This should increase accuracy for those parts that are carachterized by wide,

Figure 10: Face matching experiment. Left: ground truth data. Detected faces annotated with their name, plus a “null” class, comprised of unlabeled faces. Right: results based on face matching. Some examples were not labeled, since they were not found similar to any labeled example. Crosses indicate incorrect assignments.

non-rigid variations. We are also trying to use context information, such as the color of the jersey, to ease the matching step, and rule out obvious false matches.

Acknowledgments This work has been partially funded by the European VI FP, Network of Excellence DELOS (2004-06).

8. REFERENCES [1] J. Assfalg, M. Bertini, C. Colombo, A. Del Bimbo, A. and W. Nunziati; “Semantic annotation of soccer videos: automatic highlights identification.” Computer Vision and Image Understanding, Volume 92, November–December 2003. [2] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Yee-Whye Teh, E. Learned-Miller, D. A. Forsyth. “Names and Faces in the News.”. In Proc. of CVPR, 2004. [3] M. Bertini, A. Del Bimbo, and P. Pala. “Content-based indexing and retrieval of TV news.” Pattern Recognition Letters, 22(5):503-516, 2001.

[4] D. Chen, J.M. Odobez and H. Bourlard. “Text detection and recognition in images and video frames.” Pattern Recognition, Volume 37, Issue 3, March 2004. [5] R. Chellappa, C.L. Wilson, and S. Sirohey. “Human and machine recognition of faces: a survey.” In Proc. of the IEEE , Volume: 83 , Issue: 5 , May 1995. [6] A. Ekin, A.M. Tekalp, and R. Mehrotra. “Automatic soccer video analysis and summarization.” IEEE Transactions on Image Processing, Volume: 12 , Issue: 7 , July 2003. [7] M. Everingham, and A. Zisserman. “Automated Person Identification in Video.” In Proc. of the 3rd International Conference on Image and Video Retrieval (CIVR2004), 2004. [8] Y. Freund and R. E. Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting.” In Computational Learning Theory: Eurocolt 95, Springer-Verlag, 1995. [9] GOCR: Open Source Character Recognition. http://jocr.sourceforge.net/screenshots.html [10] E. Hjelmas, and B.K. Low. “Face detection: A survey.”

[11]

[12]

[13]

[14]

[15]

[16]

Computer Vision and Image Understanding, Vol. 83(3), 2001. M.H. Yang; D.J. Kriegman, and N. Ahuja. “Detecting faces in images: a survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 24 , Issue: 1 , Jan. 2002. D. Lowe, “Distinctive image features from scale–invariant keypoints.” International Journal of Computer Vision, 60, 2, 2004. M. Mottaleb and G. Ravitz. “Detection of Plays and Breaks in Football Games Using Audiovisual Features and HMM.” In Proc. of Ninth Int’l Conf. on Distributed Multimedia Systems, pp. 154-160, September 2003. P. Viola and M. Jones. “Rapid object detection using a boosted cascade of simple features.” In Proc. of CVPR, pages 511518, 2001. S. Nepal, U. Srinivasan, and G. Reynolds. “Automatic detection of ’Goal’ segments in basketball videos.” In Proc. of the ninth ACM international conference on Multimedia, ACM Press, 2001. S. Satoh, Y. Nakamura, and T. Kanade. “Name-It: Naming and Detecting Faces in News Videos.” IEEE MultiMedia, Vol. 6, No. 1, January-March (Spring), 1999.

[17] J.W. Sheperd, G.M. Davies, and H.D. Ellis. “Studies of cue saliency.” Perceiving and Remembering Faces, Academic Press, London, 1981. [18] J. Sivic, M. Everingham, and A. Zissermann. “Person spotting: video shot retrieval for face sets”, Proceedings of CIVR, July 2005. [19] C.G.M. Snoek and M. Worring. “Multimodal video indexing: a review of the state-of-the-art.” Multimedia Tools and Applications, Volume 25, January 2005. [20] L. Xie, P. Xu, and S.-F. Chang, A. Divakaran and H. Sun. “Structure Analysis of Soccer Video with Domain Knowledge and Hidden Markov Models.” In Proc. of Int’l Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), pp. 4096-4099, May 2002. [21] X. Yu and D. Farin. “Current and Emerging Topics in Sports Video processing.” In Proc. of International Conference on Multimedia and Expo (ICME), 2005 [22] W. Zhou, A. Vellaikal, and C.C.J. Kuo. “Rule-based video classification system for basketball video indexing.” In Proc. of ACM Multimedia 2000 workshop, pp. 213-216, 2000.