For this assignment, I chose to work on a dataset about football statistics. The data give informations about the twelve teams of the french women first division from 2021 until today.
head(stats)
## JC Age Buts PD PenM PenT CJ CR Titu Min/Match Remp Mn.Remp
## AS Saint-etienne22 24 25.1 14 12 0 0 39 2 83 73 20
## Bordeaux22 27 26.4 35 24 2 3 25 0 81 63 29
## Dijon22 22 25.8 10 9 0 0 20 1 85 53 19
## Fleury22 22 26.6 23 14 2 3 37 1 84 74 17
## Guingamp22 23 23.2 18 9 3 5 31 0 85 61 18
## Issy22 20 25.7 16 8 1 1 19 0 84 64 21
## RempNE Clt BE Pts
## AS Saint-etienne22 59 12 46 7
## Bordeaux22 70 5 20 32
## Dijon22 69 8 37 14
## Fleury22 57 4 24 34
## Guingamp22 62 9 52 14
## Issy22 62 10 43 13
summary(stats)
## JC Age Buts PD
## Min. :19.00 Min. :23.00 Min. :10.00 Min. : 8.00
## 1st Qu.:22.00 1st Qu.:25.07 1st Qu.:15.75 1st Qu.:10.00
## Median :23.50 Median :25.65 Median :26.00 Median :17.00
## Mean :23.92 Mean :25.48 Mean :32.00 Mean :21.79
## 3rd Qu.:25.25 3rd Qu.:26.32 3rd Qu.:38.75 3rd Qu.:25.00
## Max. :30.00 Max. :27.10 Max. :82.00 Max. :63.00
## PenM PenT CJ CR
## Min. :0.000 Min. : 0.000 Min. :15.00 Min. :0.000
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:23.75 1st Qu.:0.000
## Median :2.000 Median : 3.000 Median :27.00 Median :1.000
## Mean :2.167 Mean : 3.167 Mean :29.33 Mean :1.083
## 3rd Qu.:3.000 3rd Qu.: 4.250 3rd Qu.:32.50 3rd Qu.:2.000
## Max. :6.000 Max. :10.000 Max. :59.00 Max. :3.000
## Titu Min/Match Remp Mn.Remp RempNE
## Min. :78.00 Min. :53.00 Min. :16.00 Min. :45.00
## 1st Qu.:83.00 1st Qu.:62.00 1st Qu.:18.75 1st Qu.:61.25
## Median :84.00 Median :63.50 Median :20.50 Median :75.50
## Mean :83.62 Mean :65.79 Mean :21.08 Mean :73.58
## 3rd Qu.:85.00 3rd Qu.:69.25 3rd Qu.:22.00 3rd Qu.:88.25
## Max. :86.00 Max. :80.00 Max. :31.00 Max. :95.00
## Clt BE Pts
## Min. : 1.00 Min. : 4.00 Min. : 7.00
## 1st Qu.: 3.75 1st Qu.:21.50 1st Qu.:14.00
## Median : 6.50 Median :35.50 Median :29.50
## Mean : 6.50 Mean :32.88 Mean :29.54
## 3rd Qu.: 9.25 3rd Qu.:43.75 3rd Qu.:38.75
## Max. :12.00 Max. :77.00 Max. :62.00
There are 15 variables which are mainly about the team’s performance in matches (goals, assists, yellow and red cards), in the championship (points, classements) and about the coach’s rotation (substitutions, etc).
I will apply a k-means algorithm on my dataset. I chose to use only 3 clusters to try differentiate top teams to bottom teams while the others should be in the third cluster.
k_means = kmeans(stats, centers = 3)
#k_means
table(k_means$cluster)
##
## 1 2 3
## 9 6 9
#k_means$centers
color <- as.numeric(k_means$cluster)
We can see the teams are divided into three groups of 9, 9 and 6 teams in each.
Now, we can plot some results to see how the teams behave according to different variables.
plot(stats$JC~stats$Pts, bg = color, pch = 21, xlab = "Points", ylab = "Number of players over the season", col = "black")
points(k_means$centers[,1] ~ k_means$centers[,15], col = 1:3, pch = 8, cex = 1)
text(stats$JC~stats$Pts,labels=rownames(stats), col=color, cex = 0.4)
plot(stats$Clt~stats$`Titu Min/Match`, bg = color, pch = 21, xlab = "Min titu/match", ylab = "Classement", col = "black")
points(k_means$centers[,'Clt'] ~ k_means$centers[,'Titu Min/Match'], col = 1:3, pch = 8, cex = 1)
text(stats$Clt~stats$`Titu Min/Match`,labels=rownames(stats), col=color, cex = 0.5)
plot(stats$Age~stats$JC, bg = color, pch = 21, xlab = "Numbers of players over the season", ylab = "Age", col = "black")
points(k_means$centers[,'Age'] ~ k_means$centers[,'JC'], col = 1:3, pch = 8, cex = 1)
text(stats$Age~stats$JC,labels=rownames(stats), col=color, cex = 0.5)
First of all, giving the points, we can clearly distinguish the top teams (in green) to the two others groups. Now that we know which are the best teams, we can move on to the next plots. Actually, the first plot is too influenced by the ‘Points’ variable to determine differences in the team’s performance by the number of player who played over the season.
On the second graph, we could say that the less the starters play (variable ‘Min titu/match’), the the best a team is ranked. However, it is strongly influenced by the position of the team Lyon in 2022.
Finally, when we look at the last graph, we can clearly see that the top teams have more experienced players and allow themselves to let more players have a chance on the field.
It could be intesting to center and scale our data to see if there are any changs in the results.
stats_center = scale(stats, center = TRUE, scale = TRUE)
k_means2 = kmeans(stats_center, centers = 3)
table(k_means2$cluster)
##
## 1 2 3
## 4 8 12
#
variables = stats_center[,c('Age',"JC")]
plot(variables,pch=19,col=4,cex=0.2)
text(variables,labels=rownames(stats_center), col=k_means2$cluster+1, cex=0.7)
means<-k_means2$centers
points(means,col=c(2,3,4),cex=2,pch=4)
Here, the clustering produces different results. Top teams are in blue but it is harder to distinguish the others. In fact, red teams are middle ones so it would need further investigation to understand how they are ranked.
Now let’s produce a hierarchical clustering. We will use the same number of groups (3) than in K-Means.
D <- dist(stats) ### for the euclidian distance by default
HC1 <- hclust(D, method="complete")
plot(HC1, xlab = "Observations", ylab = "Proximity measure", cex=0.5)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")
After few tries, it appears that the “complete” method corresponds best to the data. This clustering gives a good view of the ranking over the two last seasons. We can see the best teams on the left (PSG and Lyon), then bottom teams (the second group) and finally the middle teams.
Finally we can put some colors to our plots :
# Color labels by specifying the number of cluster (k)
dend <- as.dendrogram(HC1)
dend %>% set("labels_col", value = c("green", "red",'blue'), k=3) %>%
plot(main = "Color labels \nper cluster")
abline(h = 2, lty = 2)