extracting behavioural models from 2010 fifa world cup - Springer Link

3 downloads 223 Views 333KB Size Report
Jun 25, 2012 - MENÉNDEZ HÉCTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID the final phase of the last World Cup Competition. This phase is divided in: ...
J Syst Sci Complex (2013) 26: 43–61

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP∗ ´ MENENDEZ H´ ector · BELLO-ORGAZ Gema · CAMACHO David

DOI: 10.1007/s11424-013-2289-9 Received: 20 January 2012 / Revised: 25 June 2012 c The Editorial Office of JSSC & Springer-Verlag Berlin Heidelberg 2013 Abstract The FIFA World CupTM is the most profitable worldwide event. The FIFA publishes global statistics of this competition which provide global data about the players and teams during the competition. This work is focused on the extraction of behavioural patterns for both, players and teams strategies, through the automated analysis of this dataset. The knowledge and models extracted in this work could be applied to soccer leagues or even it could be oriented to sport betting. However, the main contribution is related to the study on several automatic knowledge extraction techniques, such as clustering methods, and how these techniques can be used to obtain useful behavioural models from a global statistics dataset. The information provided by the clustering algorithms shows similar properties which have been combined to define the models, making the human interpretation of these statistics easier. Finally, the most successful teams strategies have been analysed and compared. Key words Behavioural patterns, clustering, FIFA World Cup, football, soccer, web mining.

1 Introduction 1.1 Basics on FIFA World-Cup Competition The World Cup is a football competition holds every four years since it began in 1930 (with the exception of 1942 and 1946 which it was stopped by the Second World War). This competition faces up national football teams which belongs to the Federation of International Football Association (FIFA). The first tournament was held in Uruguay and it has been moving to different countries until the last competition which was held in South Africa in 2010. In this competition, teams from the whole world play to win the Football World Cup, the most prestigious competition of this sport† . There are two phases in this competition: A qualifying phase (where teams of the whole world compete trying to qualify for the final phase) and a final phase (where the winner is decided). This work has been focused in the analysis (based on data mining techniques) of ´ MENENDEZ H´ ector · BELLO-ORGAZ Gema · CAMACHO David Department of Computer Science, Universidad Aut´ onoma de Madrid, Madrid 28049, Spain. Email: [email protected]; [email protected]; [email protected]. ∗ This work has been partly supported by: Spanish Ministry of Science and Education under project TIN201019872, the grant BES-2011-049875 from the same Ministry, and by Jobssy.com company under project FUAM076913.  This paper was recommended for publication by Editors FENG Dexing and HAN Jing. † The World Cup generates a revenue around $4 billion for FIFA, so it is considered the most current profitable sporting event[1] .

44

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

the final phase of the last World Cup Competition. This phase is divided in: group matches (each group is formed by four teams and each team plays three matches, the top two teams from each group pass to the following stage; there are 8 groups) and the knock-out stage (a single-elimination tournament in which teams compete on a pairwise basis, with extra time and penalty shoot-outs used to decide the winner if it is necessary). The knock-out stage begins with the “round of sixteen”, in which the winner of each group plays against one of the teams which places second in the ranking of another group. This is followed by the quarter-finals, the semi-finals, the third-place match (contested by teams which lose semi-finals), and lastly the final of the championship. 1.2 Related Work Several techniques such as statistics, data mining, and machine learning have been used to analyse the performance of teams and players in games like soccer, football, basket · · · These approaches, usually named human or robot behaviour modelling, have been applied in different domains like Robosoccer simulations[2−5] , but, in these examples, all the information is totally controlled and simulated. Other similar analysis applied to human team games can be found in the NBA league. Vaz de Melo, et al.[6] analysed the evolution of this league during its whole history creating a complex network model and studying its evolution. To the football or soccer analysis problem, Onody and Castro[7] proposed a model also based on complex networks but applied to analyse Brazilian players and Bitter, et al.[8] generated a statistical model, modifying classical probability distribution such as Bernouilli and Gaussian distribution to create a score model for different leagues. For the 2010 World Cup FIFA analysis, Cotta, et al.[9] generated a complex network of passes distribution, using own data instead of use the FIFA global statistics which have been used in this work. These methods can not be applied for the dataset provided by the FIFA web site because we have only global statistics of the competition. This work has been based on a previous analysis of robotsoccer teams[4] , where the behaviour of teams were modelized using similar clustering techniques: K-means[10] and Spectral Clustering[11] . In our previous work, only simulated data was used. It was generated using a robosoccer simulation platform. In this work some popular methods from Data Mining[10] , such as clustering algorithms are combined to extract the behaviour of players and teams during the last soccer World Cup Competition. This work focuses the complex system problem through the players which compose the soccer team. The analysis is constructed from the players which are used to explain the teams which they compose. The clusters which define the players behaviour are created for the analysis of the players in each position (goalkeeper, defender, midfielder, and forwarder) and these models are used to define the general possible behaviours of each player position in the global team. Finally, this information provides a perspective team strategies which are compared in the analysis. The rest of the paper is structured as follows, Section 2 describes the methodology, Section 3 describes the dataset. Section 4 defines the general approximation of the preprocessing and clustering methods used. Section 5 shows how the data has been preprocessed, selected, and used to define the clustering model. Section 6 shows the results of the analysis. Finally, the last section provides the conclusions and future work.

2 Methodology Our analysis is divided in two phases: the first phase defines the behaviours for players and teams. This phase creates the models which will be used to search patterns which help to

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

45

understand the teams strategy. The second phase has been focused on the teams which achieve the best results: Spain, Netherlands, Germany, and Uruguay (first, second, third, and fourth places of the last FIFA World Cup, respectively). It also shows a deep analysis of the Spanish team because Spain has usually been defeated in quarter-final matches, but it won the last competition. The first phase of the analysis uses clustering techniques. These techniques were designed to find hidden information or features in a dataset grouping the data with similar properties in clusters. In this analysis, these clusters will define the behavioural models for both, players and teams. The clustering phase needs a previous data preprocessing to prepare the analysis. The preprocessing techniques are used to avoid outliers and to choose which variables are more relevant for the analysis reducing the dimension of the problem and simplifying the analysis (these methods are known as feature selection process[12] ). In this case, the information provided by the FIFA statistics is global, which means that each feature is correlated with the number of matches and the minutes played by each team and player. Hence, preprocessing is critical to the analysis, since there is not information about each team and player per match. Therefore, the players which have played more time will have better global features, than players which have played only one or two matches. Some other variables are combined to extract the rates of the players and some variables are estimated per match or per minute depending of the information they provide. That is the main problem of this analysis, because the teams evolution during the competition can not be predicted. Nevertheless, our goal is to define the behaviours of the teams and players which means that global features are useful enough for this target. Once the clustering and behaviours definition process is ended, the analysis is focused in the most successful teams, as was mentioned above. The designed behavioural models is applied to these teams to look for relationships between their final results in the competition and how the individual players could have influenced in the global behaviour of the teams.

3 The FIFA Dataset All the information used related to the FIFA World Cup can be found in the FIFA web site[13] . This information, as was explained before, provides global features of players and teams which have been taken from the whole tournament. There are 75 variables about players and teams. It only summarizes the features but it does not give any perspective of the evolution of players and teams during the competition. This information can only be used for analytical purposes but it is useless for prediction. The analysis of the dataset is focused on the extraction of the most relevant variables to grouping (through clustering) the different kind of players (defender, forwarder, goalkeeper, and midfielder) and teams. Some of them can be combined in ratios (or are currently combined). An example is “shots on bar” over “total shots”. Those variables which can compose ratios have been used to reduce the number of variables. The final dataset which is used by the preprocessing process is the following (grouped by related information): General This category is related to the players or teams. It provides the names of the teams and players, the players position, and matches and minutes played. Goals This category is related to the goals of the matches. It gives information about the scored goals, the position of the player when he shots the ball, the own goals, the penalty goals, the goals conceded during the competition Shots This category provides information about the shots during the competition. This information is related to the shots on goal, shots wide, shots blocked, and their position.

46

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

Attacking This category is focused on the offensive behaviour of the teams and players. It gives information about the assists and deliveries, the different position of the attacks (left, right, center), the lost balls, solo runs, and tackles suffered losing possession. Defending This category takes care about the defensive behaviour of the teams and players. It provides information about the recovered balls, saves, clearance, and tackles made gaining possession. Disciplinary This category is a measure of the sportsmanship. It provides information about fouls that a player or a team has received or suffered during the competition. It also counts the number of yellow cards, red cards, and handballs. Passes This category provides information about the passes made during the competition. It divides the different passes by: short, medium, and long passes. It also gives information about the crosses, corners, and throw ins. Distance This category is related to the distance that players have covered and their activity time. It divides the information of the distance depending of the possession of the ball. It also difference between three levels of activity: low, medium, and high. These categories, and their related variables (see Appendix A for a further description), are used to defined the players and team behaviour and give a perspective about how a player can be considered better than the others in its position or what is the global strategy of a team in the competition.

4 Data Preprocessing and Clustering Methods This section describes the preprocessing techniques and the clustering techniques which have been used to extract the analytical models. It starts with the data preprocessing methods used to reduced the features space (formed originally by 75 variables included in the dataset extracted from the web of the 2010 FIFA World Cup[13] ). Later, clustering algorithms have been used in the analysis process because a blind search of similar behaviours amongst the dataset is the main goal of this work. These methods are deeply explained at the end of this section. 4.1 Preprocessing and Normalizing the Features Data Mining techniques need an intensive phase of data preprocessing. Initially the information must be analysed and stored in some kind of database system, cleaned, and separated. This preprocessing phase is used to avoid outliers, missclassifications, and missing data. Methods such as histogram and statistical correlation are used to clean the dataset and reduce the number of variables[10] . Projections are also usual in dimension reduction, however, projection methods[14] such as PCA (Principal Component Analysis) or LDA (Lineal Discriminant Analysis) do not offer a complete perspective of the problem we are dealing with. These methods create new variables which are estimated from principal components or lineal projection trying to separate the data and reduce its dimension. These methods are useless for this kind of analysis because the original information of the features is lost and unrecoverable once it is projected. The analysis shows the features which are more representative in the clustering process. This cannot be achieved with projection because the human interpretation of the variables is lost. The second step is related to normalization. It allows to compare data features with different kind or range of values. Z-Score[15] and Min-Max[16] normalization methods are commonly used for preprocessing the data. Both normalization algorithms take the attribute records and try to find a standard range for them.

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

47

4.2 Clustering Techniques Once the data is preprocessed and normalized, a high number of possible data mining methods could be used to analyse it. Clustering methods have been used to extract what kind of behaviour is deployed by a particular team, or a single player, in a tournament. With these methods it is possible to establish the current behaviour of a team or player and then automatically classify them in a set of predefined behaviour models. Simple algorithms such as K-means and Spectral Clustering algorithms have been tested to compare the available information. K-means[17] is a popular and well known clustering algorithm. It is a straightforward clustering guided method (usually by a heuristic or directly by a human) to classify data in a predefined number of clusters. Spectral clustering methods were introduced by Ng, et al. in [11]. These methods apply the knowledge extracted from graph spectral theory to clustering techniques. The Spectral Clustering algorithm which is used in this work is the Normalized Spectral Clustering Algorithm introduced by Ng, et al[11] .

5 Extracting the Behavioural Models This section shows how the above techniques have been applied, and the analysis of the yielded clusters. Through the clustering analysis carried out, it has been obtained several behaviour patterns for players and teams. Firstly a model which defines behaviours for players is created. Finally, it is used to build a global behavioural model for the teams also compared with the global team features. 5.1 Preprocessing and Data Normalization This first phase prepares the players and teams dataset to the clustering analysis. Methods described in Section 4 have been applied in the dataset extracted from the FIFA World Cup, shown in Section 3. For the players dataset, the analysis is divided using the classical soccer players position: defender, goalkeeper, midfielder, and forwarder. The preprocessing process is divided in two principal steps: The first step has been the study of the available variables through histograms and correlation diagrams which were used for dimension reduction. The information provided by this phase shows the values which are useless because, for example, are constants or have a high correlation (more than 0.8 if we consider that the correlation values is in range [0, 1]) with other variables. This means that they may variate the clustering results, if they are not eliminated, with redundant information. When two variables have a high correlation, the variable which has been considered less representative for the human interpretation of the problem has been removed. Appendix A shows a table with the variables that have been selected for the clustering phase. The second preprocessing phase consists on the normalization of the variables. First, the attributes with outliers are recentralized. Then, the same range is applied for all of them. As was described in Section 4.1, we combine Z-score to recentralized the distribution and avoid outliers and MinMax to fixed the range of all the values between 0 and 1. Table 10 (see Appendix) summarizes the variables which have been reduced because they have a constant value and what variables have been omitted because they contain a single value with outliers.

48

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

5.2 Behavioural Models After preprocessing the data, the clustering techniques described in previous section are applied and a deep analysis of the clusters found is performed. This section shows how the different models for players and teams have been found through this analysis. As we have explained in Section 4, K-means and Spectral Clustering algorithms need a previous fixed K values which is decided before the analysis. We have chosen 4 values for K: 2, 3, 4, and 5 for both clustering method. A higher number of centroids do not improve the within-cluster sum of squares (WCSS) significantly. 10 Cross-fold validation[10] has been used to choose the best value of K for the algorithms. The algorithms have been run 100 times and the clusters which reduced the WCSS have been selected for the analysis. 5.2.1 Types of players From the analysis of K-means and Spectral Clustering (SC) we are able to define different behavioural models for the players (depending of their classes). These models are defined through the clustering centroids. The clustering analysis has considered the distance between cluster centroids and the information provided by the most representative variables of these centroids. The information of those clusters of K-means and Spectral Clustering, which provided the best centroids, has been analysed. These centroids are used to define the players behaviour. The analysis is divided using the classical soccer players position: defender, goalkeeper, midfielder, and forwarder. And the results obtained for each position are the following (see Table 1): 1) Defenders The defenders analysis has shown that the clustering method which provided the most representative information are: 3-means and 3-SC. Comparing the results of the clusters generated using these clustering methods and giving them a human interpretation, a correlation between the finding clusters can be appreciated. Three kind of defender behaviours can be extracted from these similarities, see Figure 1. These global behaviours can be described as follows: i) Strong Defender: The first and fourth charts in Figure 1 show that this kind of player takes high values of clearances, fouls committed, and passes. They avoid goals which are usual in the defender players. Besides, both clusters take low values on shots, on target, and solo runs variables, which are common features in midfielder players. Hence, this kind of clusters may be considered as a strong defender pattern of player, which have pure defender features. ii) Medium Defender: The second and fifth charts show high values in the particular defender features previously mentioned, too. Moreover, this kind of players are characterised because they have the highest value of shots on target and solo runs which are features of midfielder players. It means that the players members of these cluster have features mixed between defender and midfielder behaviours. iii) Weak Defender: Finally, the third and sixth charts of Figure 1 show that the most representative defender values (clearances, fouls committed, and passes) take the lowest values. Therefore, this kind of defender is considered as the pattern with the worst defender features. Table 1 Summary of the most relevant attributes of the defender behaviours Category Behavioural Pattern Features Value Algorithm clearances, fouls committed, passes high Strong Both goals, shots on target, solo runs low Defender clearances, fouls committed, passes high Medium Both shots on target, solo runs high Weak clearances, fouls committed, passes lowest Both

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

Goals Shots on Target Solo runs Clearances Fouls commited Fouls suffered Passes Distance Covered in possession

Defender Clusters 1.0

0.8

1

2

3

49

5

4

6

0.6

0.4

0.2

0 0.11 0.04 0.19 0.31 0.19 0.8 0.58

0 0.23 0.19 0.26 0.28 0.28 0.79 0.65

Strong.KM

Medium.KM

0 0.04 0.04 0.11 0.28 0.16 0.74 0.57

0 0.04 0.12 0.26 0.31 0.3 0.77 0.62

0

0.3 0.25 0.28 0.32 0.31 0.79 0.67

0 0.09 0.05 0.15 0.27 0.17 0.72 0.58

0.0

Figure 1

Weak.KM

Strong.SC

Medium.SC

Weak.SC

Averages values of main variables measured for K-means and SC defender clusters obtained

2) Forwarders Regarding the forwarder analysis, the clustering methods which provided the most representative information are: 3-means and 4-SC. Comparing the clusters obtained by each clustering method, like in the analysis of the defenders, there is a correlation between the clusters obtained by both methods (Figure 2). However, SC algorithm finds one more cluster. The study of these results shows that the cluster whose variables takes medium values in the K-means results is divided into two clusters by SC method. The patterns observed are the following (see Table 2): i) Strong Forwarder: The first and fourth charts (Figure 2) show that the forwarders of these clusters have the highest values of shots on target. This variable is the main feature of these players because they are the players which score goals. Another important feature for a forwarder is the number of fouls suffered, that also has a high value in both methods. Hence, the kind of players belonging to these clusters may be considered as a strong forwarder pattern. ii) Medium Forwarder: Analysing the second, fifth, and sixth charts (Figure 2), it can be observed that the values of shots on target and fouls suffered decrease, they get intermediate values. As it was mentioned above, the SC clustering method divides this kind of players in two different clusters: 1 Midfielder-Forwarder (SC) These forwarders (see the fifth chart of Figure 2), have better  statistics of passes. It is a main feature of midfielders, so this behavioural pattern could be explained as a combination of these both profiles. 2 Medium-Forwarder (SC) The sixth chart of Figure 2 shows that the main feature value  of a forwarder player (shots on target) decrease, but the other features are closer to the values reached for the strong forwarder pattern. So this behavioural pattern seems similar to medium

50

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

Forwarder Clusters

Goals Shots on Target Solo runs Clearances Fouls commited Fouls suffered Passes Distance Covered in possession

1.0

0.8

1

3

2

5

4

6

7

0.6

0.4

0.2

0.1 0.56 0

0

0.22 0.44 0.61 0.33

0.04 0.22 0

0

0.27 0.3 0.6 0.32

0.08 0.16 0

0

0.27 0.17 0.58 0.32

0.04 0.37 0

0

0.26 0.33 0.61 0.3

0.13 0.23 0

0

0.19 0.23 0.63 0.32

0.07 0.27 0

0

0.26 0.32 0.6 0.32

0.08 0.11 0

0

0.12 0.13 0.55 0.32

0.0 Strong.KM

Figure 2

Medium.KM

Weak.KM

Strong.SC

Midfielder.SC

Medium.SC

Weak.SC

Averages values of main variables measured for K-means and SC forwarders clusters found

forwarder pattern found by K-means. iii) Weak Forwarder: In the last two charts (the third and seventh of Figure 2), the opposite situation to the strong forwarder clusters is founded. The minimum value of shots on target and fouls are reached. The number of passes is closer to the other kind of clusters described above, but for this patter this value decrease. Hence, these two clusters have been labelled as the forwarder pattern with the worst features of these type of players. Table 2 Summary of the most relevant attributes of the Forwarder behaviours Category Behavioural Pattern Features Value Algorithm Strong shots on target, fouls suffered high Both Medium shots on target, fouls suffered medium K-means shots on target, fouls suffered, passes medium Midfielder-Forwarder SC passes high Forwarder fouls suffered high Medium-Forwarder SC shots on target medium Weak shots on target, fouls suffered low Both

3) Midfielders The midfielder analysis has shown that the clustering method which provided the most representative information are: 3-means and 4-SC. As in the analysis performed for the forwarders, there is one more cluster found by SC clustering method (Figure 3). The behaviours found are (see Table 3): i) Strong Midfielder: The first and fourth charts in Figure 3 show that these midfielders have the highest values of passes, distance covered, and fouls committed. These three features mentioned are usual in the players which are responsible of building the moves. Therefore, this pattern is considered as a strong midfielder player pattern, which contains the most relevant midfielder characteristics.

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

Midfielder Clusters

51

Goals Shots on Target Solo runs Clearances Fouls commited Fouls suffered Passes Distance Covered in possession

1.0

0.8

1

3

2

5

4

6

7

0.6

0.4

0.2

0

0.08 0.2 0.36 0.35 0.19 0.73 0.43

0 0.07 0.26 0.09 0.24 0.24 0.68 0.48

0 0.02 0.11 0.04 0.19 0.14 0.67 0.44

0

0.12 0.21 0.21 0.29 0.2 0.71 0.46

0

0

0.17 0.16 0.26 0.19 0.69 0.45

0

0.05 0.25 0.25 0.23 0.22 0.62 0.46

0

0

0.11 0.01 0.19 0.13 0.7 0.44

0.0 Strong.KM

Figure 3

Medium.KM

Weak.KM

Strong.SC

Defensive.SC

Attacker.SC

Weak.SC

Averages values of main variables measured for K-means and SC forwarder clusters found

ii) Medium Midfielder: The study of the second, fifth, and sixth charts of Figure 3, shows that the values of the basic midfielder decrease (passes, distance covered, and fouls committed). These variables get intermediate values. The SC clustering method divides this kind of players in two different clusters: 1 Defensive-Midfielder (SC) The fifth chart shows that the number of shots on target is  closed to 0 for this kind of players. On the other hand, the number of fouls committed remains high. It is one of the main features of the defender players. Therefore, this behavioural pattern could be seen as a mixed approach of both type of players. 2 Attacker-Midfielder (SC) In this second pattern (see the sixth chart of Figure 3), only  shots on target are remained and the fouls committed decrease. Therefore, this behavioural pattern seems to be a combination of midfielder and the forwarder player profiles. iii) Weak Midfielder: Finally, the third and sixth charts show that the basic midfielder features (passes, distance covered, and fouls committed) take the lowest values. Hence, this player pattern can be considered as the behaviour with the worst midfielder characteristics. Table 3 Summary of the most relevant attributes of the Midfielder behaviours Category

Midfielder

Behavioural Pattern Strong Medium Defensive-Midfielder Attacker-Midfielder Weak

Features passes, distance covered, fouls committed passes, distance covered, fouls committed passes, distance covered, fouls committed shots on target passes, distance covered shots on target fouls committed passes, distance covered, fouls committed

Value highest high high lowest high medium medium lowest

Algorithm Both K-means SC SC Both

52

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

4) Goalkeepers The forwarder data analysis shows that the clustering methods which provided the most representative information are: 2-means and 4-SC. This is a special category which follows different game rules. Therefore, the main variables take into account are different (see Figure 4). A correlation can be appreciated between two of the clusters obtained by the two clustering methods used. But in SC method obtained two extra clusters with intermediate values. For these players, the global behavioural patterns found are (see Table 4): i) Strong Goalkeeper: The first and third charts of Figure 4 show that the players which belong to these clusters have the highest values of recovered balls, clearances, and passes. Besides, these kind of goalkeepers have the lowest values of lost balls and saves. Therefore, they are considered as the strong goalkeeper pattern. ii) Medium Goalkeeper (SC): Only SC method can distinguish two clusters which take intermediate values in the main goalkeeper variables: 1 Medium-Strong-Goalkeeper (SC) The fourth chart of Figure 4 shows that the value of  passes decreases and the value lost balls increases. However, it still has high values of recovered balls and clearances. So this behavioural pattern is a combination of strong and medium goalkeeper. 2 Medium-Weak-Goalkeeper (SC) The information extracted from the fifth chart shows  that two of the main goalkeeper features decrease: recovered balls and clearances. Other two attributes: passes and lost balls remain closer to the values reached in previous case. This behavioural pattern has similar characteristics to medium and weak goalkeeper. iii) Weak Goalkeeper: The second and sixth charts of Figure 4 show the opposite situation to the strong goalkeeper pattern. The lowest values of recovered balls, clearances, and passes are reached. Moreover, the number of lost balls is higher, so these two clusters have been labelled as weak goalkeepers.

Goalkeeper Cluster s

Clearances Passes Recovered balls Lost balls Saves

1.0

0.8

1

2

3

4

5

6

0.6

0.4

0.2

0.3 0.86 0.25 0.42 0.41

0.01 0.7 0.02 0.64 0.6

0.22 0.65 0.16 0.3 0.32

0.32 0.39 0.26 0.55 0.49

Strong.KM

Weak.KM

Strong.SC

Medium.Strong.SC

0.03 0.52 0.07 0.49 0.62

0

0.23 0.01 0.76 0.63

0.0

Figure 4

Medium.Weak.SC

Weak.SC

Averages values of main variables measured for K-means and SC goalkeeper clusters found

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

53

Table 4 Summary of the most relevant attributes of the Goalkeeper behaviours Category Behavioural Pattern Features Value Algorithm recovered balls, clearances, passes highest Strong Both lost balls, saves lowest lost balls, recovered balls, clearances high Medium-Strong SC passes low Goalkeeper lost balls high Medium-Weak SC recovered balls, clearances, passes low lost balls highest Weak Both recovered balls, clearances, passes lowest

5.2.2 Types of teams The teams analysis shows that the two clustering configuration which provided the most representative information are: 4-means and 3-SC. Figure 5 shows the values obtained for the main variables measured by each cluster. If these results are compared by both clustering methods, it can be appreciated a correlation between the clusters found. It confirms the existence of behavioural pattern into the teams (see Table 5): Defensive Team The first and fifth charts in Figure 5 show that this kind of teams takes high values of clearances and fouls committed. Furthermore, they have the lowest values of shots and goals. These characteristics are common in defender players, so this kind of clusters can be considered as representative for team which have strong defender features. Goals Shots on goal from penalty area Solo runs Clearances Fouls commited Fouls suffered Passes Distance Covered in possession

Team Cluster s 1.0

0.8

1

2

3

4

5

6

7

0.6

0.4

0.2

0.49 0.21 0.14 0.32 0.46 0.4 0.37 0.37

0.52 0.63 0.2 0.17 0.6 0.32 0.31 0.31

0.49 0.6 0.57 0.55 0.6 0.42 0.6 0.6

Attacker.KM

Attacker.Mid.KM

0

0.06 0.09 0.82 0.69 0.19 0.4 0.4

0.52 0.61 0.55 0.57 0.52 0.45 0.74 0.74

0.49 0.65 0.5 0.43 0.66 0.41 0.36 0.36

0.35 0.3 0.23 0.46 0.59 0.29 0.41 0.41

0.0 Defensive.Mid.KM

Figure 5

Defensive.KM

Attacker.Mid.SC

Attacker.SC

Defensive.SC

Averages values of main variables measured for K-means and SC team clusters obtained

Defensive-Midfielder Team (K-means) This is a particular type of team which is only identified by K-means algorithm. The second chart belonging to the previous figure shows closed values to those obtained previously by the defender features (clearances and fouls committed). Moreover, the number of shots and goals is increased, which are common attributes in attacker

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

54

players. It means that the teams which belong to this cluster have features mixed between defender and midfielder behaviours. Attacker-Midfielder Team The third and sixth charts show that this patter is closed to midfielder players. The features which take the highest values are (distance covered, passes, and fouls committed). Also, the forwarder player attributes are highly represented. Therefore, this behavioural pattern seems a combination of midfielder and forwarder player profiles. Attacker Team Finally, the fourth and seventh charts show that the most representative values are similar to the forwarder attributes (shots, goals, and fouls suffered). On the other hand, the defender and midfielder features have low values, so this kind of team has the strongest attack characteristics. Table 5 Summary of the team behaviours Category

Behavioural Pattern Defensive Defensive-Midfielder

Teams

Attacker-Midfielder Attacker

Features clearances, fouls committed shots, goals clearances, fouls committed, shots, goals distance covered, passes, fouls committed shots, goals shots, goals, fouls suffered clearances, fouls committed, passes

Value highest lowest high highest high high low

Algorithm Both K-means Both Both

6 Finding the Most Successful Strategies In this section, the behavioural analysis models are focused on the finalist teams of the competition: Spain, Netherlands, Germany, and Uruguay. First, the models defined in previous section, based on the players and teams behaviour, will be used to analyse the strategy of the goalkeepers, defenders, forwarders, and midfielders of these teams. Next, the influence of their most relevant players to the global strategy of the teams is analysed. Finally, the analysis is focused on a comparison of the strategies searching the most profitable strategy. 6.1 The Player Strategies The analysis of the teams is based on the clustering results represented in Tables 7 (K-means) and 8 (Spectral Clustering). The second column represents the global team behaviour and the others give information about the distribution of the players divided by: defender, forwarder, midfielder, and goalkeeper. The first row of each team shows the percentage of players which have played in that position in the whole competition. The second row represents the percentage of players which belongs to a determined cluster divided by kind of player. Table 6 Results from the K-means and SC Algorithm applied to the Global Team Statistics Team Global Results Team K-means SC Spain Att-Mid Att-Mid Netherlands Att-Mid Att-Mid Germany Att-Mid Att-Mid Uruguay Att-Mid Attacker

55

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP Table 7 Results from the K-means Algorithm K-means Results Defender Forwarder Midfielder

Team Spain

Goalkeeper

Strong Medium Weak Strong Medium Weak Strong Medium Weak Strong Weak 30% 20% 45% 5% 33.3%

33.3% 36.8%

50% 26.3%

25% 44.4%

0% 31.6%

55.6% 100% 0% 5.3%

14.3%

28.6% 57.1% 36.4%

20%

40% 22.7%

40% 33.3%

0% 31.8%

66.7% 100% 0% 9.1%

12.5%

25% 62.5% 28.6%

20%

0% 23.8%

80% 28.6%

14.3% 57.1% 42.9%

16.7%

33.3%

40%

20%

Netherlands Germany Uruguay

33.3% 25%

50%

40%

0%

0%

100%

50% 50% 4.7% 0%

100%

Table 8 Results from the Spectral Clustering Algorithm Spectral Clustering Results Defender Forwarder

Team

Strong Spain 0% Nether. 14.3% Germany 0% Uruguay 16.7% Team

Medium 30%

Weak

Strong

Mid-For

33.3% 36.8%

66.7%

25%

25%

25% 26.3%

25%

0% 36.4%

85.7%

20%

20%

60% 22.7%

0%

12.5% 28.6%

87.5%

20%

0%

20% 23.8%

60%

33.3%

50% 20% 0% Spectral Clustering Results Midfielder

Defensive

Strong Weak 45%

Attacker

Strong

22.2%

55.6% 11.1% 31.6%

11.1%

100%

33.3% 33.3% 31.8%

16.7%

42.8% 28.6% 42.9%

28.6%

33.3%

22.2%

Spain Nether. 16.7% Germany 0% Uruguay 22.2%

Medium 20%

22.2%

40%

Weak

40%

Goalkeeper Med-Str Med-Weak 5% 0%

Weak

0%

0%

0%

0%

0%

0%

0%

100%

5.3% 0%

100% 9.1%

50%

50% 4.7%

0%

0%

The global information shows that the four teams have an “Attacker-Midfielder” behaviour extracted from K-means (Table 6, second column) and that Uruguay is the only team which has an “Attacker” behaviour extracted from the spectral clustering results (Table 6, third column). This information shows that these teams are cooperative teams (high number of passes, crosses, assists, and deliveries, see Section 5). K-means algorithm is sensitive to those classes which are not well-balanced. In this case, as Tables 8 and 9 show, the percentage of midfielderes and forwaders is bigger than the percentage of defenders and goalkeeper. It should be the reason because the global information of the teams is deviant to the attacker-midfielder behaviour.

56

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

Spectral clustering is less sensitive to these problems. This global information is not enough to define the team strategies. This analysis continues with the analysis of the different players position using the information provided by the model defined in Section 5.2. The information about the defenders shows that Spain has a balanced defenders set (see Table 7, column 3) related to K-means model. It is remarkable that Piqu´e and Puyol have the same behaviour (“Medium”) in both K-means and Spectral Clustering (they covert the 33, 3% of both). These two players play in the Barcelona Football Club which is famous for its cooperative strategy and may explain that both play similar. The “Weak” behaviour is relevant in the defenders of Netherlands (57, 1% of K-means and 85, 67% of SC), Germany (62, 5% of K-means and 87, 5% of SC) and Uruguay (50% in both). There are less defenders with “Strong” behaviour (the highest results are the Spanish defenders with 33, 3% can be seen in Table 7). Spectral clustering (Table 8) shows that the most concentrated behaviour is the “Weak” instead of “Strong”. Midfielders column of Table 7 (fifth column) shows there are a few teams whose midfielders have a “Medium” behaviour (only Germany with 14, 3 %). “Weak” behaviour contains all the players of Uruguay (100%) and the highest percentage of the rest of players. The spectral clustering results (Table 8) change the Uruguay distribution completely, in this case all the midfielders are almost equivalently distributed between the classes. The cause is that the most relevant variables in the clusters extracted from spectral clustering are the shots and goals statistics while K-means is focused on the tackles, recovered and lost balls. Spain and Germany have midfielders with “Strong” behaviour (55, 6% and 42, 8%, respectively). “Defensive” and “Attacker” have distributed the players of the original “Weak” behaviour of K-means in the SC results. This set of players is difficult to classify through this model. Table 7 (fourth column) shows that all the teams have their forwarders distributed through the three classes except Germany whose players have a “Weak” behaviour (80%). Spectral clustering (Table 8) shows that Spain has all its forwarders balanced, and only Netherlands and Germany concentrated their players in two classes: “Medium” (60%) and “Weak”(60%), respectively. The Goalkeepers information extracted from Table 7 shows that Spain and Netherlands have a similar goalkeeper behaviour (“Strong”), Germany has two types of goalkeepers and Uruguay has a “Weak” goalkeeper. Table 8 shows that these teams have no goalkeepers with “Medium-Weak” behaviour. The Spanish goalkeeper, Iker Casillas, has “Strong” behaviour. The Uruguay goalkeeper has “Weak” behaviour. Finally, Germany has its two goalkeeper belonging to “Strong” and “Medium-Strong” classes and Netherlands has its goalkeepers as “Medium-Strong”. This analysis of the global teams behaviour and the different players position behaviour gives a perspective of the teams strategy which is analysed in the following section. 6.2 The influence of the Players in their Team Strategy The global information of the teams provided by the model (shown in Table 6) is not clear enough to discriminate the strategies. These tables also have the percentage of total players (first row of each team name) which have played in the competition for each of the most successful teams, divided by position. It shows that Spain and Uruguay have a high percentage of Midfielders and Germany a high percentage of Defenders. Netherlands is balanced. With the information analysed before, we can defined the strategies of these teams: Spain The strategy is concentrated in the behaviour of Defenders and Midfielders (75% of the players can be seen in Tables 7 and 8, in the first row of Spain). A high percentage of the Midfielders are “Strong” (44, 4 % of K-means and 55, 6% of SC) or Defensive (22, 2%) while

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

57

Defenders are more distributed although the main defender players are Medium Defenders (33, 3% of both). Also forwarders which are balanced, have a 25% of Midfielder-Forwarder behaviour. The Goalkeeper behaviour is Strong. Therefore, and following our extrated models, the Spanish team has a Medium/Strong Defensive-Midfielder. Netherlands Its strategy is balanced. Netherlands has a high percentage of “Weak” Defenders (57, 1% in K-means and 85, 7% in SC) and Midfielders (66, 7% in K-means and 33, 3% in SC). However, it has a high percentage of “Medium” Forwarders (40% in Table 7 and 60% in Table 8) and a Medium-Strong Goalkeeper. Its strategy is a Weak-Defensive Medium-Attacker strategy. Germany Germany is also balanced. It has a high percentage of “Weak” Defenders (62, 5% in K-means and 87, 5% in SC) and Forwarders (80 % in K-means and 60% in SC) but also a high percentage of “Strong”-“Medium” and “Attacker” Midfielder (42, 9% in K-means and 71, 4% in SC). It has a “Strong” and “Medium-Strong” Goalkeeper. Its strategy is Weak-Defensive Medium/Strong-Midfielder and Weak-Attacker. Uruguay Its strategy is concentrated in the behaviour of Defenders and Midfielders (71, 5%). It has a high percentage of “Weak” Defenders (50% for both), Midfielders (100% in Kmeans, but a 22, 2% in SC) and Goalkeeper. Its strategy is Weak/Medium Defensive-Midfielder. 6.3 Strategies Comparison Using previous conclusions, this section compares them looking for the most profitable strategy. To compare the different strategies it is necessary to analyse the results of the matches played by these teams in the last phase (Table 9). Table 9

Table with the results of the most successful teams in the last four matches of the competition. The table also compares the strategies followed by these teams Phase Team 1 Team 2 Goals 1 Goals 2 Strategy 1 Strategy 2 Semifinal Spain Germany 1 0 Medium/Strong Weak-Defensive Defensive-Midfielder Medium/StrongMidfielder and WeakAttacker Semifinal Nether. Uruguay 3 2 Weak-Defensive Weak/Medium Medium-Attacker Defensive-Midfielder Thirdplace Germany Uruguay 3 2 Weak-Defensive Weak/Medium Medium/StrongDefensive-Midfielder Midfielder and WeakAttacker Final Spain Nether. 1 0 Medium/Strong Weak-Defensive Defensive-Midfielder Medium-Attacker

Spain-Germany Spain has a Medium/Strong Defensive-Midfielder Strategy and Germany a Weak Attacker Strategy (it explains that Spain has 0 goals conceded). Germany also has a Weak Defensive Strategy which gives an advantage to the Spanish Medium/Strong Midfielders. Netherlands-Uruguay Both teams have a Weak-Defensive Strategy (it explains the number of goals scored), but Netherlands has a Medium Attacker strategy while Uruguay has its strategy focused on a Weak/Medium midfielder strategy. This gives advantage to Netherlands. Also the Netherlands Goalkeeper is “Medium Strong” while the Uruguay Goalkeeper is “Weak”. Germany-Uruguay Both teams have Weak-Defensive Strategy (which should be the reason of the number of goals scored in the match). Germany has a Medium/Strong Midfielder Strategy while Uruguay has a Weak/Medium Midfielder strategy it gives an advantage to Germany. The main disadvantage of Uruguay is that it has a Weak Goalkeeper while Germany has a Strong Goalkeeper.

58

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID

Spain-Netherlands Spain has a Medium/Strong Defensive-Midfielder Strategy and Netherlands a Medium Attacker Strategy (it explains the low number of goals scored during this match). The Weak-Defensive strategy of Netherlands gave an advantage to Spain which used a Medium/Strong Midfielder Strategy. Also, the Spanish Goalkeeper is Strong while the Netherlands goalkeeper is Medium-Strong.

7 Conclusions and Future Work This work has analysed the behaviour of the teams and players in the last 2010 FIFA World Cup. It has been focused on the players and teams statistics which have been used to extract the strategy followed by the teams related to the player behaviours. To achieve these goals three clustering techniques: K-means and Spectral Clustering have been applied to the FIFA global statistics dataset. The clustering methods used have been able to provide the behaviour of the teams and players individually and have given a perspective to combine both analysis to understand the global team strategy. With this information, the most successful team strategies have been analysed. This comparison has been applied to explain the results of the last four matches of the competition played by the finalist teams. This study has found that the behavioural patterns followed by these teams are simple and coherent (see Section 6.3). This information gives a straightforward human interpretation of the most successful strategies. Finally, this knowledge is specially useful for some different cases: 1) A coach can orientate the global strategy of a team if he has an intuition of the strategy followed by the rival. 2) It can be used to define strategies of teams from soccer or robosoccer leagues. 3) In sport betting, it can approximate the team which follows the most profitable strategy. The future work will be focused on the complex network defined by the players lineup in the matches and its evolution. The interest of this approach is to take advantage of the network evolution. It also looks for the best lineup to defeat several opponent strategies. References [1] Dobson S and Goddard J A, The Economics of Football, Cambridge University Press, Cambridge, 2011. [2] Aler R, Valls J M, Camacho D, and Lopez A, Programming robosoccer agents by modeling human behavior, Expert Systems with Applications, 2009, 36: 1850–1859. [3] Grollman D H and Jenkins O C, Learning robot soccer skills from demonstration, International Conference on Development and Learning, 2007, 276–281. [4] Jim´enez-D´ıaz G, Men´endez H D, Camacho D, and Gonz´ alez-Calero P A, Predicting performance in team games, INSTICC Institude for systems, Control Technologies of Information, and Communication, editors, ICAART 2011 — Proceedings of the 3rd International Conference on Agents and Artificial Intelligence, 2011. [5] Leng J S, Fyfe C, and Jain L, Reinforcement learning of competitive skills with soccer agents, Knowledge-Based Intelligent Information and Engineering Systems, Springer, 2010, LNCS 4692: 572–579. [6] Vaz de Melo P O S, Almeida V A F, and Loureiro A A F, Can complex network metrics predict the behavior of nba teams? Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, ACM, KDD ’08, 2008. [7] Onody R N and De Castro P A, Complex network study of Brazilian soccer players, Phys. Rev. E, 2004, 70: 037103.

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP

59

[8] Bittner E, NuBbaumer A, Janke W, and Weigel M, Self-affirmation model for football goal distributions, EPL (Europhysics Letters), 2007, 78(5): 58002. [9] Cotta C, Mora A M, Merelo-Molina C, and Guerv´ os J J M, Fifa world cup 2010: A network analysis of the champion team play, CoRR, abs/1108.0261, 2011. [10] Larose D T, Discovering Knowledge in Data, John Wiley & Sons, New Jersey, 2005. [11] Ng A, Jordan M, and Weiss Y, On Spectral Clustering: Analysis and an algorithm (ed. by Dietterich T, Becker S, and Ghahramani Z), Advances in Neural Information Processing Systems, MIT Press, 2001, 849–856. [12] Kohavi R and John G H, Wrappers for feature subset selection, Artif. Intell., 1997, 97: 273–324. [13] Fifa web site, 2011. http://www.fifa.com/worldcup/archive/southafrica2010/statistics/index.html. [14] Delac K, Grgic M, and Grgic S, Independent comparative study of PCA, ICA, and LDA on the FERET data set, International Journal of Imaging Systems and Technology, 2005, 15(5): 252–260. [15] Carroll S R and Carroll D J, Statistics Made Simple for School Leaders, Rowman & Littlefield, 2002. [16] Han J W and Kamber M, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [17] MacKay D, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge, 2003.

Appendix: Variables Reduction The following table shows the variables which have been used for the analysis carried out in this work. The variables set wit ‘x’ are those variables that have been used to normalized the data. The variables set with ‘o’ have been used to feed clustering algorithms. The variables set with ‘ ’ are reduced. Finally, the variables set with ‘-’ are not used. The columns are divided in: Categories: general, attack, defence, · · ·, explained in Section 3. Type of variable: [pl] players, [tm] teams, and [bt] are used when a variable is applied for both players and teams. Name of the variable (“rates” are variables which are combinations with other attributes to reduced the dimension, for example, “clearances completion rate” is defined as “clearances complete”/“total clearances”). Kind of players and teams: [Df] Defender, [Fw] Forwarder, [Mf] Midfielder, [Gk] Goalkeeper.

60

´ ´ MENENDEZ HECTOR · BELLO-ORGAZ GEMA · CAMACHO DAVID Table 10 Variables selected from each kind of player reduced from 75 variables to 62 (when the rates are considered). These subsets of variables will later feed the clustering algorithms Categories type Variable Df Fw Mf Gk Tm pl players bt team General pl minutes played x x x x x bt matches x x x x x bt deliveries in penalty area o o o o bt solo runs o o bt tackles suffered losing possession rate o o o o pl lost balls o o o o Attack bt offsides o o o o bt assists o o o o tm attacks from left rate o tm attacks from center rate o tm attacks from right rate o bt tackles made gaining possession rate o o o o o bt saves o o Defence bt clearances completion rate o o o o bt recovered balls o o o o o pl yellow o o o bt second yellow card and red card o bt red cards o Disciplinary bt fouls committed o o o o bt fouls suffered o o o o o bt handballs o o o o bt shots on goal from penalty area o o o bt shots on goal from outside penalty area o o o bt shots wide from penalty area o bt shots wide from outside penalty area o o o o bt shots wide o o o bt blocked shots from inside penalty area o o o o bt blocked shots from outside penalty area bt shots blocked o o o o Shots bt shots in penalty area o bt shots on target o bt free kicks shots o o bt free kicks shots direct o o bt free kicks shots indirect o bt total shots o o bt goal shot rate o o bt shots on bar o o bt shots on post o

EXTRACTING BEHAVIOURAL MODELS FROM 2010 FIFA WORLD CUP Table 10 Continued

61

Variables selected from each kind of player reduced from 75 variables to 62 (when the rates are considered). These subsets of variables will later feed the clustering algorithms Categories type Variable Df Fw Mf Gk Tm bt goals scored in penalty area rate o o bt goals scored from outside penalty area rate o o bt own goals o bt penalty goals o o Goals tm own goals for o tm open play goals rate o tm goals conceded in penalty area rate o tm goals conceded from outside penalty area rate tm set piece goals rate o bt short passes completion rate o o o o bt medium passes completion rate o o o bt long passes completion rate o o o o Passes bt passes completion rate o o o o bt crosses completion rate o o o o bt corners completion rate o o o bt throw ins o o o bt distance covered in possession rate o o o o bt distance covered not in possession rate o o o o bt top speed km h o o o o o Distance pl low activity time min o pl medium activity time min o o pl high activity time min o -