Introduction

For this assignment, I chose to work on a dataset about football statistics. The data give informations about the twelve teams of the french women first division from 2021 until today.

head(stats)
##                    JC  Age Buts PD PenM PenT CJ CR Titu Min/Match Remp Mn.Remp
## AS Saint-etienne22 24 25.1   14 12    0    0 39  2             83   73      20
## Bordeaux22         27 26.4   35 24    2    3 25  0             81   63      29
## Dijon22            22 25.8   10  9    0    0 20  1             85   53      19
## Fleury22           22 26.6   23 14    2    3 37  1             84   74      17
## Guingamp22         23 23.2   18  9    3    5 31  0             85   61      18
## Issy22             20 25.7   16  8    1    1 19  0             84   64      21
##                    RempNE Clt BE Pts
## AS Saint-etienne22     59  12 46   7
## Bordeaux22             70   5 20  32
## Dijon22                69   8 37  14
## Fleury22               57   4 24  34
## Guingamp22             62   9 52  14
## Issy22                 62  10 43  13
summary(stats)
##        JC             Age             Buts             PD       
##  Min.   :19.00   Min.   :23.00   Min.   :10.00   Min.   : 8.00  
##  1st Qu.:22.00   1st Qu.:25.07   1st Qu.:15.75   1st Qu.:10.00  
##  Median :23.50   Median :25.65   Median :26.00   Median :17.00  
##  Mean   :23.92   Mean   :25.48   Mean   :32.00   Mean   :21.79  
##  3rd Qu.:25.25   3rd Qu.:26.32   3rd Qu.:38.75   3rd Qu.:25.00  
##  Max.   :30.00   Max.   :27.10   Max.   :82.00   Max.   :63.00  
##       PenM            PenT              CJ              CR       
##  Min.   :0.000   Min.   : 0.000   Min.   :15.00   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.: 1.000   1st Qu.:23.75   1st Qu.:0.000  
##  Median :2.000   Median : 3.000   Median :27.00   Median :1.000  
##  Mean   :2.167   Mean   : 3.167   Mean   :29.33   Mean   :1.083  
##  3rd Qu.:3.000   3rd Qu.: 4.250   3rd Qu.:32.50   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :10.000   Max.   :59.00   Max.   :3.000  
##  Titu Min/Match       Remp          Mn.Remp          RempNE     
##  Min.   :78.00   Min.   :53.00   Min.   :16.00   Min.   :45.00  
##  1st Qu.:83.00   1st Qu.:62.00   1st Qu.:18.75   1st Qu.:61.25  
##  Median :84.00   Median :63.50   Median :20.50   Median :75.50  
##  Mean   :83.62   Mean   :65.79   Mean   :21.08   Mean   :73.58  
##  3rd Qu.:85.00   3rd Qu.:69.25   3rd Qu.:22.00   3rd Qu.:88.25  
##  Max.   :86.00   Max.   :80.00   Max.   :31.00   Max.   :95.00  
##       Clt              BE             Pts       
##  Min.   : 1.00   Min.   : 4.00   Min.   : 7.00  
##  1st Qu.: 3.75   1st Qu.:21.50   1st Qu.:14.00  
##  Median : 6.50   Median :35.50   Median :29.50  
##  Mean   : 6.50   Mean   :32.88   Mean   :29.54  
##  3rd Qu.: 9.25   3rd Qu.:43.75   3rd Qu.:38.75  
##  Max.   :12.00   Max.   :77.00   Max.   :62.00

There are 15 variables which are mainly about the team’s performance in matches (goals, assists, yellow and red cards), in the championship (points, classements) and about the coach’s rotation (substitutions, etc).

K means algorithm

I will apply a k-means algorithm on my dataset. I chose to use only 3 clusters to try differentiate top teams to bottom teams while the others should be in the third cluster.

k_means = kmeans(stats, centers = 3)
#k_means
table(k_means$cluster)
## 
## 1 2 3 
## 9 6 9
#k_means$centers

color <- as.numeric(k_means$cluster)

We can see the teams are divided into three groups of 9, 9 and 6 teams in each.

Now, we can plot some results to see how the teams behave according to different variables.

plot(stats$JC~stats$Pts, bg = color, pch = 21, xlab = "Points", ylab = "Number of players over the season",  col = "black")
points(k_means$centers[,1] ~ k_means$centers[,15], col = 1:3, pch = 8, cex = 1)
text(stats$JC~stats$Pts,labels=rownames(stats), col=color, cex = 0.4)

plot(stats$Clt~stats$`Titu Min/Match`, bg = color, pch = 21, xlab = "Min titu/match", ylab = "Classement",  col = "black")
points(k_means$centers[,'Clt'] ~ k_means$centers[,'Titu Min/Match'], col = 1:3, pch = 8, cex = 1)
text(stats$Clt~stats$`Titu Min/Match`,labels=rownames(stats), col=color, cex = 0.5)

plot(stats$Age~stats$JC, bg = color, pch = 21, xlab = "Numbers of players over the season", ylab = "Age",  col = "black")
points(k_means$centers[,'Age'] ~ k_means$centers[,'JC'], col = 1:3, pch = 8, cex = 1)
text(stats$Age~stats$JC,labels=rownames(stats), col=color, cex = 0.5)

First of all, giving the points, we can clearly distinguish the top teams (in green) to the two others groups. Now that we know which are the best teams, we can move on to the next plots. Actually, the first plot is too influenced by the ‘Points’ variable to determine differences in the team’s performance by the number of player who played over the season.

On the second graph, we could say that the less the starters play (variable ‘Min titu/match’), the the best a team is ranked. However, it is strongly influenced by the position of the team Lyon in 2022.

Finally, when we look at the last graph, we can clearly see that the top teams have more experienced players and allow themselves to let more players have a chance on the field.

Data centered

It could be intesting to center and scale our data to see if there are any changs in the results.

stats_center = scale(stats, center = TRUE, scale = TRUE)

k_means2 = kmeans(stats_center, centers = 3)
table(k_means2$cluster)
## 
##  1  2  3 
##  4  8 12
#
variables = stats_center[,c('Age',"JC")]

plot(variables,pch=19,col=4,cex=0.2)
text(variables,labels=rownames(stats_center), col=k_means2$cluster+1, cex=0.7)
means<-k_means2$centers
points(means,col=c(2,3,4),cex=2,pch=4)

Here, the clustering produces different results. Top teams are in blue but it is harder to distinguish the others. In fact, red teams are middle ones so it would need further investigation to understand how they are ranked.

Hierarchical aglomerative clustering algorithm

Now let’s produce a hierarchical clustering. We will use the same number of groups (3) than in K-Means.

D <- dist(stats) ### for the euclidian distance by default

HC1 <- hclust(D, method="complete")
plot(HC1, xlab = "Observations", ylab = "Proximity measure", cex=0.5)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

After few tries, it appears that the “complete” method corresponds best to the data. This clustering gives a good view of the ranking over the two last seasons. We can see the best teams on the left (PSG and Lyon), then bottom teams (the second group) and finally the middle teams.

Finally we can put some colors to our plots :

# Color labels by specifying the number of cluster (k)
dend <- as.dendrogram(HC1)
dend %>% set("labels_col", value = c("green", "red",'blue'), k=3) %>% 
  plot(main = "Color labels \nper cluster")
abline(h = 2, lty = 2)