Multi-player Detection with Articulated Mixtures-of-Parts ... - IEEE Xplore

6 downloads 2336 Views 685KB Size Report
Email: [email protected] ... Instead modeling object into deformable parts templates or mixtures of .... model a family of affine-warped templates.
2016 13th Conference on Computer and Robot Vision

Multi-Player Detection with articulated Mixtures-of-Parts Representation Constrained by Global Appearance Yang Fang

Shi Long Wei

Jo Geun Sik

School of Computer and Information Engineering Inha University Incheon, South Korea Email: [email protected]

School of Computer and Information Engineering Inha University Incheon, South Korea Email: [email protected]

School of Computer and Information Engineering Inha University Incheon, South Korea Email: [email protected]

that can mine semantic knowledge of object for multiple player detection and pose estimation and test our approach on the most challenging basketball game benchmark, which is called APIDIS dataset1 with so many challenges [6], [7] as follows: x Basketball players have abrupt changes of behavior and more complicated motion pattern, i.e. players run, jump, crouch, etc. x The players have very similar appearance, because the same teammates wear uniforms of the same colors. They also share similar height and body shape which is very ambiguous even for human eyes. x In some camera view, players can be severely articulated and occluded with each other. x The reflections of the players on the ground make most detection algorithms very confused to distinguish the reflection and real object, thus yielding false positive detection in some frames. Earlier works of [8], [9] explored pictorial structure framework for modeling a visual object based on which they decomposed the object into many local parts, together with geometric constraint visualized as springs. The champion [10] of the 2006 PASCAL person detection challenge trained a linear SVM classifier for person/non-person detection based on histogram of gradient (HOG) feature, such feature becomes widely used nowadays for pattern recognition fields. Pedro F. Felzenszwalb [11] regarded the part information as latent parameters which can be learned and updated by a Latent SVM solver, the work is considered as the state-of-art approach for object detection recently. Our paper provides a novel approach which models sport players into a flexible mixture-of-part based representation constrained by global appearance. Our work is most inspired by Y. Yang [12]’s work that introduced new representation of part models which can capture spatial and co-occurrence relations between parts and mixtures of parts, this treestructured model can be efficiently optimized with dynamic programming. In order to reduce the false alarms, we apply the global appearance constraint which is presented in N. Dalal in [10] and F. Felzenszwalb of [11]. We demonstrate that our model can capture global and local appearance information, the spatial and co-occurrence relations between parts and mixtures of parts for multiple articulated players,

Abstract—We describe a new representation for multiple articulated sport players for court player detection. Instead modeling object into deformable parts templates or mixtures of small parts which just capture local appearance of parts and spatial relations between parts, our proposed model trained with local part information with global constraint by a structured SVM can capture not only such local appearance and spatial relations above, but also the semantic relations between body parts which are the critical factors for precisely detecting objects and pose estimation. Our approach has several novel properties: (1) we adopt typical articulated partbased model with global appearance constraint to control trade-off between recall and precision for detection (2) we incorporate semantic knowledge about various articulation of court players into our mixture-of-parts model, these semantic knowledge are popularly used for pose estimation (3) after the root (global) and part (local) bounding boxes are predicted by our system, we train a linear least-squares regression model to output the final detection results which yields considerable improvements in performance. In our experiments with very challenging APIDIS basketball dataset and standard INRIA person dataset, it indicates that our detection system achieves state-of-art performance compared to previous works. Keywords-multiple player detection; deformable parts; structured SVM; semantic knowledge; ADIPIS

I.

INTRODUCTION

In past few decades, with the expansion of the internet and new media technologies, millions of thousands of videos are generated and uploaded to internet every day. Among them, various kinds of sports videos are especially attractive. Obviously, sports videos can realize huge commercial value potentially however it depends on sophisticated intelligent sports video analysis techniques. Vision-based sport player detection, tracking and pose estimation are the critical basis for higher-level team tactics and player activity analysis, automatic and personalized sport summarization, e.g. through highlight star players. These research topics have been of interest and spawned much research in the past decades. Previously, some works already focused on soccer player detection and tracking [1] [2], hockey match [3], [4], and even on the volleyball games [5]. In this paper, we exploit the videos of basketball game, since basketball game is one of the most popular sport games in the world. Particularly, we develop a new model 978-1-5090-2491-9/16 $31.00 © 2016 IEEE DOI 10.1109/CRV.2016.57

1

416

http://sites.uclouvain.be/ispgroup/index.php/Softwares/APIDIS

works on standard human detection dataset, e.g. INRIA person dataset and PASCAL VOC dataset, it still show poor performance on the sport players detection. Recently, Y. Yang and D. Ramanan of [12] provide a new representation which is called mixture of non-oriented pictorial structures to model a family of affine-warped templates. The new representation can jointly captures spatial relations between part locations and co-occurrence relations between parts mixtures.

and obtain state-of-art performance compared to previous works on challenging basketball dataset. We arrange the rest of our paper as following: Sec.2 describes related works. The construction of tree-structured model with global appearance constraint, the detail process of player detection, model learning and inference are discussed in Sec.3, and 4 respectively. Finally we show our experimental results on APIDIS and INRIA dataset in Sec.5 and make the conclusion in Sec.6. II.

III.

RELATED WORK

Santiago in [13] conducts a survey about two different categories of techniques used for sport player detection and tracking: (i) intrusive system– where special tags or sensors are placed on the players and (ii) nonintrusive system– where only computer vision techniques are used and no extra devices in the game environment. This difference causes specific challenges to each of system. In a survey of [7], [15], Porikli studies object detection and tracking methods with a single fixed camera. Given a fixed camera, objects can be detected by modeling the background and be tracked by using data association after detection step. In [6], [14] they present a generic approach to detect and track basketball players with a network of fixed and omnidirectional cameras given severely degraded foreground silhouettes. They formulate the problem as a sparsity-constrained inverse problem using an adaptive dictionary constructed on-line. The framework has the constraint on the number of cameras neither on the surface to be monitored. D. Delannay of [15] proposes a method to detect and recognize players on a sport field, which is also based on a distributed set of loosely synchronized cameras. Detection assumes player standing verticality, and sums the cumulative projection of the multi-view foreground activity masks on the ground plane. After summation, large projection values indicate the position of the player on the ground plane. In [2] player detection is achieved by running a boosted cascade of Haar like features on global views. First, the background is filtered out using playfield segmentation to reduce false alarms, and then scanning across the filtered image regions at multiple scales. Usually multiple detections will occur around each player after scanning the images which can be merged adjacently to yield final detections with proper and position. The most efficient and widely used features, called Histograms of Oriented Gradients, for human detection is created by N. Dalal and B. Triggs [10] in 2006. They create HOG by dividing images into small special regions, called cells. For each cell there is determined local 1-D histogram of gradient directions or edge orientations over the pixels within a given cell. Combinations of histogram values create a new representation of the object. After that, Pedro F. Felzenszwalb [11] conducts object detection with discriminatively trained part based models based on HOG feature. In the paper, the objects are modeled in the global appearance and local part appearance together with the degree of deformation corresponding to the relative location between parts and their root location. However, even this approach achieved state-of-art results compared to other

TREE-STRUCTURED MODEL WITH GLOBAL APPEARANCE

A. Tree-structured part model At this step, we decompose a player into many parts that is connected by fourteen articulation points (chin, scalp, two ankles, two knees, two hips, two wrists, two elbows and two shoulders). We use the body parts between pairs of articulation points together with these fourteen points to model 26 part filters totally. The modeling can capture the local appearance feature of each part as well as the spatial relations and co-occurrence relations between pairs of parts or/and mixtures of parts with certain type which are called articulated representation of player objects. These relations are tree-structured. We train the filters at fixed size for each part with a structured SVM solver; at the detection stage, these part filters are applied at all positions over all scales of an image. The articulated representation of a single player is illustrated in figure 1.

Figure 1: (left) we find all articulation points together with the parts in the middle of the neighboring pairs of these articulation points, and we can simply get the middle points by the linear combination of each pair of neighboring articulation points. (right) illustrate the tree structure of our model, each color line represents one subtree in which small number of points are the child node of its neighboring larger parent node in bottom-up order, the model can be efficiently optimized with dynamic programming.

We use  to represent an image,  = (, ) for the location of part  and  denotes the “type” of part  where  ∈ {1, ⋯ , },  ∈ {1, ⋯ , } and  ∈ {1, ⋯ , }. The types of parts include the orientation of the part (e.g., vertical oriented versus horizontally oriented arm), out-of-plane rotations and even semantic classes (open versus closed hand). We apply Histogram of Oriented Gradient (HOG) feature for representing the part appearance, the basic idea is that local object appearance and shape can be characterized

417

feature pyramid to get the best score, which is considered as the predicted part object. We define the score of part configuration bellow:

well by the distribution of local intensity gradients or edge directions. Specifically, we divide an image into 8x8 nonoverlapping cells, and accumulate 1-D histogram of gradient orientation for each cell. In each pixel, the gradient is discretized into one of nine orientation bins, and each pixel “votes” for the orientation of its gradient. And then we do the contrast normalization with respect to the gradient energy in a neighborhood around each cell in order to be invariant to illumination variation and shadowing. For that we choose the four 2x2 blocks of cells and normalize the histogram of given cell with respect to the energy in each of these blocks. Finally, there is a 9x4 dimensional vector representing the local gradient information. We define a HOG feature pyramid on each of which the HOG feature is calculated, the feature on the upper pyramid captures coarse appearance information over fairly large areas, while the feature on the bottom pyramid captures the finer appearance information over small areas. The HOG feature representation of basketball player is shown in figure 2.

 ,



S(, , ) =    ∙ ∅(,  ) +   ∈

∙  −  .

(1)

,∈

The first sum term in (1) is an appearance model for all the parts, where ∅(,  ) denotes the HOG feature extracted  at the pixel location  of image I,   means the learned filters used to compute local appearance score by placing  the template   on the pixel location  for part  with respect of type  . The second sum is our deformation model that can be interpreted as “switching” spring model controlling the relative placement of part  and part , we define  −   = [,  , ,  ]! as the relative placement of part  and part  , where  =  −  and  ,

 =  −  , and  encode not only the form of deformable distance but also the dependence of local appearance on geometry. In the paper, we simplify the second sum in (1) which is especially efficient to capture articulated information, we demonstrate that the relative location of part  with respect to its parent is only dependent on part type itself, but not parent type. For example, assume that  is the hand part,  is its parent elbow part and their types capture orientation information, thus a sidewaysoriented hand should lie next to the elbow, while a downward-oriented hand should lie below the elbow, no matter what the orientation of the upper arm is, it means  ,  ,   =  here. So if  = [0,1,0,1], the deformable cost for  part is the squared distance between its actual position with its anchor position relative to its parent part. 3) Co-occurrence model In order to assign particular type for each part , and each pair of parts  and , we try to define a compatibility function that score the configuration of part into a sum of local and pairwise scores:

Figure 2: The original image of player object (left), the fourteen articulation points of the object are modeled in spring structure (represented as red wavy lines). We calculate the HOG feature from original image and visualize the gradient orientation (right), which can capture the part appearance information.

"() =  #  +  #  , .

1) Rigidity modeling We construct a relational graph  = (, ) for treestructured human model, in which the set  contains nodes and the edges  specify the consistent relations that constrain related pairs of parts. Note that such graph can also encode relations between distant parts through transitivity. For example, we can assign the same orientation to a set of parts, not only two of them, so long as these parts can construct a subtree of  = (, ) . So we take this property to model multiple parts on the torso, as we know the whole torso parts share the same orientation. 2) Appearance and deformation model For each part  , there are different types which correspond to different appearances, and we learn a specific appearance model for each type  of part  which is generated by the dot production between learned parameters, called local filters, and the concatenated HOG feature values. We use the filters of fixed size to search over the HOG

∈

(2)

∈

Where #  denotes particular type assignments for part , and the pairwise parameter # , for particular cooccurrences of part types. For example, types of part  and part  share orientation assignment, and both of them are located on the torso, then # , would favor consistent orientation assignment. That is to say, # , should be large positive number for consistent orientation of part  and part  and a large negative number for inconsistent orientation of part  and part . We can now rewrite the full configuration of part types and locations as following:  ,



"(, , ) = "() +    ∙ ∅(,  ) +   ∈

418

∈

∙  −  . (3)

Since  = (, ) is tree-structured, we can get the maximization of "(, , ) by dynamic programming. Specifically, for each part we consider whether it has parent and child part or not, if the part  is “bottom” part (e.g. hand part or ankles) we just calculate the maximization of its score and pass it to its “upper” parent part, and when the part  is “middle” part (e.g. elbow or knee part) the final score of it contains two portions: the score calculated from itself and the score passed from its child part(s), then it also should pass the final score to its parent part. When the part  is “top” part (only scalp in our model), its final score is calculated by its own filter score plus all of its children’ score and it represents our model result. Based on the final result, we can predict whether there is articulated object or not. Passing the message from the child part to its parent part is formulated by following expression: 



$#%&'  ,   = #  +   ∙ ∅,   +



Figure 3: First image (top left) show the complex environment of basketball court with strong illumination contrast, reflection of players on the ground and severe shadows. Second image (top right) shows the false positive detection in basketball scene, in this case we only apply the treebased part model and we train a simple regression model based on the part detection bounding box which can be seen in image 3 (bottom left). We show the final detection results by our part model with global appearance constrain in last image (bottom right)

*  , - . (4)

∈/23()  ,

*  , -  = max 4#

 ,

+ max6 $#%&' ( ,  ) + 



(5)

 −  7.

- [(8 , 8 ), ⋯ , (/ , / ), ⋯ , ( ; ,  ; )]. Among them we select the top-left and bottom-right positions (

Suggest Documents