Understanding the data

First of all, let’s explore and discover the data. I chose to work with the dataset “uscomp”, which gives pieces of information concerning the sales, the number of employees, the sector, etc about many US companies. There are seven variables. Six of them are numerical while the last one is categorical.

rm(list=ls())

data(uscomp)
#?uscomp

Preparing the dataset

Now that we understand the data, we make some slight modifications to prepare the dataset for the multivariate graphical analysis. For example, the variable “Sales” is not numerical in the original dataset. However, it seems logical to transform it into a numerical variable.

head(uscomp)

##                              Assets Sales Market Value Profits Cash Flow
## Bell_Atlantic                 19788  9084        10636  1092.9    2576.8
## Continental_Telecom            5074  2557         1892   239.9     578.3
## American_Electric_Power       13621  4848         4572   485.0     898.9
## Brooklyn_Union_Gas             1117  1038          478    59.7      91.7
## Central_Illinois_Publ._Serv.   1633   701          679    74.3     135.9
## Cleveland_Electric_Illum.      5651  1254         2002   310.7     407.9
##                              Employees        Sector
## Bell_Atlantic                     79.4 Communication
## Continental_Telecom               21.9 Communication
## American_Electric_Power           23.4        Energy
## Brooklyn_Union_Gas                 3.8        Energy
## Central_Illinois_Publ._Serv.       2.8        Energy
## Cleveland_Electric_Illum.          6.2        Energy

uscomp$Sales = as.numeric(uscomp$Sales)
summary(uscomp)

##      Assets          Sales       Market Value        Profits      
##  Min.   :  223   Min.   : 1.0   Min.   :   53.0   Min.   :-771.5  
##  1st Qu.: 1122   1st Qu.:20.5   1st Qu.:  512.5   1st Qu.:  39.0  
##  Median : 2788   Median :40.0   Median :  944.0   Median :  70.5  
##  Mean   : 5941   Mean   :40.0   Mean   : 3269.1   Mean   : 209.8  
##  3rd Qu.: 5802   3rd Qu.:59.5   3rd Qu.: 1961.5   3rd Qu.: 188.1  
##  Max.   :52634   Max.   :79.0   Max.   :95697.0   Max.   :6555.0  
##                                                                   
##    Cash Flow         Employees                Sector  
##  Min.   :-651.90   Min.   :  0.60   Finance      :17  
##  1st Qu.:  75.15   1st Qu.:  3.95   Energy       :15  
##  Median : 133.30   Median : 15.40   Manufacturing:10  
##  Mean   : 400.93   Mean   : 37.60   Retail       :10  
##  3rd Qu.: 328.85   3rd Qu.: 48.50   HiTech       : 8  
##  Max.   :9874.00   Max.   :400.20   Other        : 7  
##                                     (Other)      :12

Correlation

We know there are six numerical variables. A good start for the analysis could be to visualize the correlation between these variables.

Here is the correlation matrix.

mat_num = uscomp[,1:6]
mat_num= as.matrix(mat_num)
corrplot(cor(mat_num), 
         method = "shade", 
         type = "upper", 
         bg = "blue",
         title = "Correlation matrix between numerical variables",
         is.corr = TRUE,
         cl.cex = 0.8,
         tl.cex = 0.9,
         tl.col='black',
         tl.srt = 15
         )

First of all, every variable is positively correlated with the others. That makes sense because the bigger the company is, the more employees it has, the more products it sells, etc. However, we can distinguish the variables highly correlated to the variables which are not. For example, Profits and Cash Flow are highly positively correlated as well as Market Value and Cash Flow. On the other hand, the number of employees is not correlated with the other variables as well as the Sales.

We can visualize these two variables to see how they behave commonly.

plot(uscomp$Employees, 
     uscomp$Sales, 
     type='p', 
     main = "Companies' sales with number of employees",
     xlab = "Number of employees", 
     ylab = "Sales")

As we saw with the correlation matrix, a company’s sales value does not depend on its number of employees. The best sellers companies are not the most numbered companies.

Heatmap

Another way to explore the data and its multivariate aspect is the heatmap. Here, we can use the last variable “Sector” to choose some companies from each domain proportionally of the total number.

The sectors

plot(uscomp$Sector, col='blue', main = "Occurences of each sector", xlab = "Sectors",ylab = "Number of occurences")

As we can see, there are many companies from sectors such as Finance or Retail but few companies from others sectors like Communication for example. Therefore, because the number of companies is too high (79) to show them all, we are going to present a heatmap of 40% of the companies.

The heatmap permits to distinguish companies who don’t “behave” like the others. Unfortunately, the random part in our selection does not allow us to write a single conclusion about this heatmap. However, the reader is invited to explore the data from himself thanks to the heatmap’s interactivity.

all_companies = round(table(uscomp$Sector),0)

sectors = round(table(uscomp$Sector) * 2/5,0)

smaller_data = data.frame(matrix(data = NA, nrow = 0, ncol = dim(uscomp)[2] ))

for (i in 1:length(sectors)){
  sector = subset(uscomp, uscomp$Sector == names(sectors[i]))[sample(1:all_companies[[i]], sectors[[i]]),]
  smaller_data = rbind(smaller_data,sector)
}

mat = smaller_data[1:6]

heatmaply(mat, 
          dendrogram = "none",
          xlab = "Numerical variables", ylab = "Companies", 
          main = "HeatMap",
          scale = "column",
          margins = c(60,100,40,20),
          grid_color = "white",
          grid_width = 0.00001,
          titleX = TRUE,
          hide_colorbar = TRUE,
          branches_lwd = 0.1,
          label_names = c("Company", "Feature:", "Value"),
          fontsize_row = 5, fontsize_col = 5,
          labCol = colnames(mat),
          labRow = rownames(mat),
          heatmap_layers = theme(axis.line=element_blank()),

)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Heatmap is very interesting because it allows the readers to immediately see which individuals differ from the others. In this example, the colors make the difference between the companies that have the same figures and the others.

First assignment Cadiou

Multivariate Analysis

17/02/2022

Understanding the data

Preparing the dataset

Correlation

Heatmap

The sectors