First of all, let’s explore and discover the data. I chose to work with the dataset “uscomp”, which gives pieces of information concerning the sales, the number of employees, the sector, etc about many US companies. There are seven variables. Six of them are numerical while the last one is categorical.
rm(list=ls())
data(uscomp)
#?uscomp
Now that we understand the data, we make some slight modifications to prepare the dataset for the multivariate graphical analysis. For example, the variable “Sales” is not numerical in the original dataset. However, it seems logical to transform it into a numerical variable.
head(uscomp)
## Assets Sales Market Value Profits Cash Flow
## Bell_Atlantic 19788 9084 10636 1092.9 2576.8
## Continental_Telecom 5074 2557 1892 239.9 578.3
## American_Electric_Power 13621 4848 4572 485.0 898.9
## Brooklyn_Union_Gas 1117 1038 478 59.7 91.7
## Central_Illinois_Publ._Serv. 1633 701 679 74.3 135.9
## Cleveland_Electric_Illum. 5651 1254 2002 310.7 407.9
## Employees Sector
## Bell_Atlantic 79.4 Communication
## Continental_Telecom 21.9 Communication
## American_Electric_Power 23.4 Energy
## Brooklyn_Union_Gas 3.8 Energy
## Central_Illinois_Publ._Serv. 2.8 Energy
## Cleveland_Electric_Illum. 6.2 Energy
uscomp$Sales = as.numeric(uscomp$Sales)
summary(uscomp)
## Assets Sales Market Value Profits
## Min. : 223 Min. : 1.0 Min. : 53.0 Min. :-771.5
## 1st Qu.: 1122 1st Qu.:20.5 1st Qu.: 512.5 1st Qu.: 39.0
## Median : 2788 Median :40.0 Median : 944.0 Median : 70.5
## Mean : 5941 Mean :40.0 Mean : 3269.1 Mean : 209.8
## 3rd Qu.: 5802 3rd Qu.:59.5 3rd Qu.: 1961.5 3rd Qu.: 188.1
## Max. :52634 Max. :79.0 Max. :95697.0 Max. :6555.0
##
## Cash Flow Employees Sector
## Min. :-651.90 Min. : 0.60 Finance :17
## 1st Qu.: 75.15 1st Qu.: 3.95 Energy :15
## Median : 133.30 Median : 15.40 Manufacturing:10
## Mean : 400.93 Mean : 37.60 Retail :10
## 3rd Qu.: 328.85 3rd Qu.: 48.50 HiTech : 8
## Max. :9874.00 Max. :400.20 Other : 7
## (Other) :12
We know there are six numerical variables. A good start for the analysis could be to visualize the correlation between these variables.
Here is the correlation matrix.
mat_num = uscomp[,1:6]
mat_num= as.matrix(mat_num)
corrplot(cor(mat_num),
method = "shade",
type = "upper",
bg = "blue",
title = "Correlation matrix between numerical variables",
is.corr = TRUE,
cl.cex = 0.8,
tl.cex = 0.9,
tl.col='black',
tl.srt = 15
)
First of all, every variable is positively correlated with the others. That makes sense because the bigger the company is, the more employees it has, the more products it sells, etc. However, we can distinguish the variables highly correlated to the variables which are not. For example, Profits and Cash Flow are highly positively correlated as well as Market Value and Cash Flow. On the other hand, the number of employees is not correlated with the other variables as well as the Sales.
We can visualize these two variables to see how they behave commonly.
plot(uscomp$Employees,
uscomp$Sales,
type='p',
main = "Companies' sales with number of employees",
xlab = "Number of employees",
ylab = "Sales")
As we saw with the correlation matrix, a company’s sales value does not depend on its number of employees. The best sellers companies are not the most numbered companies.
Another way to explore the data and its multivariate aspect is the heatmap. Here, we can use the last variable “Sector” to choose some companies from each domain proportionally of the total number.
plot(uscomp$Sector, col='blue', main = "Occurences of each sector", xlab = "Sectors",ylab = "Number of occurences")
As we can see, there are many companies from sectors such as Finance or Retail but few companies from others sectors like Communication for example. Therefore, because the number of companies is too high (79) to show them all, we are going to present a heatmap of 40% of the companies.
The heatmap permits to distinguish companies who don’t “behave” like the others. Unfortunately, the random part in our selection does not allow us to write a single conclusion about this heatmap. However, the reader is invited to explore the data from himself thanks to the heatmap’s interactivity.
all_companies = round(table(uscomp$Sector),0)
sectors = round(table(uscomp$Sector) * 2/5,0)
smaller_data = data.frame(matrix(data = NA, nrow = 0, ncol = dim(uscomp)[2] ))
for (i in 1:length(sectors)){
sector = subset(uscomp, uscomp$Sector == names(sectors[i]))[sample(1:all_companies[[i]], sectors[[i]]),]
smaller_data = rbind(smaller_data,sector)
}
mat = smaller_data[1:6]
heatmaply(mat,
dendrogram = "none",
xlab = "Numerical variables", ylab = "Companies",
main = "HeatMap",
scale = "column",
margins = c(60,100,40,20),
grid_color = "white",
grid_width = 0.00001,
titleX = TRUE,
hide_colorbar = TRUE,
branches_lwd = 0.1,
label_names = c("Company", "Feature:", "Value"),
fontsize_row = 5, fontsize_col = 5,
labCol = colnames(mat),
labRow = rownames(mat),
heatmap_layers = theme(axis.line=element_blank()),
)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Heatmap is very interesting because it allows the readers to immediately see which individuals differ from the others. In this example, the colors make the difference between the companies that have the same figures and the others.