First of all, we import some useful libraries.
We will work on the “uscomp” data.Thanks to the and functions, we can see how the data looks like. We transform Sales as an integer vector.
data(uscomp)
str(uscomp)
## 'data.frame': 79 obs. of 7 variables:
## $ Assets : int 19788 5074 13621 1117 1633 5651 5835 3494 1654 1679 ...
## $ Sales : Factor w/ 79 levels "1037","1038",..: 78 32 55 2 71 6 48 18 52 10 ...
## $ Market Value: int 10636 1892 4572 478 679 2002 1601 1442 779 687 ...
## $ Profits : num 1092.9 239.9 485 59.7 74.3 ...
## $ Cash Flow : num 2576.8 578.3 898.9 91.7 135.9 ...
## $ Employees : num 79.4 21.9 23.4 3.8 2.8 6.2 10.8 6.4 1.6 4.6 ...
## $ Sector : Factor w/ 9 levels "Communication",..: 1 1 2 2 2 2 2 2 2 2 ...
uscomp$Sales <- as.integer(uscomp$Sales)
summary(uscomp)
## Assets Sales Market Value Profits
## Min. : 223 Min. : 1.0 Min. : 53.0 Min. :-771.5
## 1st Qu.: 1122 1st Qu.:20.5 1st Qu.: 512.5 1st Qu.: 39.0
## Median : 2788 Median :40.0 Median : 944.0 Median : 70.5
## Mean : 5941 Mean :40.0 Mean : 3269.1 Mean : 209.8
## 3rd Qu.: 5802 3rd Qu.:59.5 3rd Qu.: 1961.5 3rd Qu.: 188.1
## Max. :52634 Max. :79.0 Max. :95697.0 Max. :6555.0
##
## Cash Flow Employees Sector
## Min. :-651.90 Min. : 0.60 Finance :17
## 1st Qu.: 75.15 1st Qu.: 3.95 Energy :15
## Median : 133.30 Median : 15.40 Manufacturing:10
## Mean : 400.93 Mean : 37.60 Retail :10
## 3rd Qu.: 328.85 3rd Qu.: 48.50 HiTech : 8
## Max. :9874.00 Max. :400.20 Other : 7
## (Other) :12
Because the relationships between variables are important we plot each variable in relation with the others.
pairs(uscomp[,1:6])
Furthermore, this graphical output can help us understand the interaction between covariates thanks to their correlation coefficient.
corrplot(cor(uscomp[,1:6]))
Now, we conduct our PCA using the “PCA” function from the “FactoMineR” package. We want the 7th variable, which is the type of sector of each company, as a supplementary qualitative. It means it will not intervene in the calculations but it will be interesting to see how it behaves when plotting some results.
pca = PCA(uscomp, scale.unit = TRUE, quali.sup = c(7), graph = FALSE)
summary(pca)
##
## Call:
## PCA(X = uscomp, scale.unit = TRUE, quali.sup = c(7), graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 4.181 0.991 0.502 0.288 0.029 0.009
## % of var. 69.689 16.520 8.358 4.800 0.485 0.147
## Cumulative % of var. 69.689 86.210 94.568 99.368 99.853 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr
## Bell_Atlantic | 3.239 | 2.681 2.175 0.685 | 1.322 2.231
## Continental_Telecom | 0.480 | -0.140 0.006 0.085 | -0.358 0.164
## American_Electric_Power | 1.225 | 0.688 0.143 0.315 | 0.480 0.294
## Brooklyn_Union_Gas | 1.871 | -0.893 0.242 0.228 | -1.584 3.205
## Central_Illinois_Publ._Serv. | 1.581 | -0.586 0.104 0.137 | 1.385 2.448
## Cleveland_Electric_Illum. | 1.579 | -0.340 0.035 0.046 | -1.508 2.905
## Columbia_Gas_System | 0.709 | -0.494 0.074 0.484 | 0.354 0.160
## Florida_Progress | 1.129 | -0.528 0.084 0.219 | -0.943 1.135
## Idaho_Power | 0.969 | -0.656 0.130 0.458 | 0.561 0.402
## Kansas_Power_Light | 1.527 | -0.781 0.185 0.262 | -1.251 1.999
## cos2 Dim.3 ctr cos2
## Bell_Atlantic 0.167 | 0.624 0.984 0.037 |
## Continental_Telecom 0.556 | -0.132 0.044 0.075 |
## American_Electric_Power 0.154 | 0.607 0.931 0.246 |
## Brooklyn_Union_Gas 0.717 | -0.434 0.475 0.054 |
## Central_Illinois_Publ._Serv. 0.767 | -0.164 0.067 0.011 |
## Cleveland_Electric_Illum. 0.912 | -0.147 0.055 0.009 |
## Columbia_Gas_System 0.250 | 0.218 0.120 0.094 |
## Florida_Progress 0.697 | -0.240 0.145 0.045 |
## Idaho_Power 0.335 | -0.230 0.133 0.056 |
## Kansas_Power_Light 0.671 | -0.379 0.363 0.062 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2
## Assets | 0.753 13.562 0.567 | -0.121 1.478 0.015 |
## Sales | 0.175 0.729 0.030 | 0.981 97.040 0.962 |
## Market Value | 0.981 23.003 0.962 | -0.041 0.171 0.002 |
## Profits | 0.957 21.921 0.917 | -0.067 0.453 0.004 |
## Cash Flow | 0.971 22.552 0.943 | -0.046 0.218 0.002 |
## Employees | 0.873 18.233 0.762 | 0.080 0.640 0.006 |
## Dim.3 ctr cos2
## Assets 0.643 82.382 0.413 |
## Sales 0.055 0.603 0.003 |
## Market Value -0.100 2.014 0.010 |
## Profits -0.217 9.385 0.047 |
## Cash Flow -0.165 5.422 0.027 |
## Employees -0.031 0.194 0.001 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2 cos2
## Communication | 1.533 | 1.270 0.686 0.884 | 0.482 0.099
## Energy | 0.589 | -0.438 0.552 -0.915 | -0.195 0.109
## Finance | 0.808 | -0.390 0.233 -0.883 | -0.119 0.022
## HiTech | 2.807 | 2.781 0.981 4.032 | 0.144 0.003
## Manufacturing | 0.396 | -0.246 0.384 -0.404 | 0.020 0.003
## Medical | 0.680 | -0.497 0.535 -0.496 | 0.358 0.277
## Other | 0.597 | -0.435 0.530 -0.586 | -0.324 0.295
## Retail | 0.793 | -0.084 0.011 -0.139 | 0.430 0.294
## Transportation | 0.600 | -0.542 0.814 -0.671 | -0.137 0.052
## v.test Dim.3 cos2 v.test
## Communication 0.689 | 0.246 0.026 0.495 |
## Energy -0.837 | -0.121 0.042 -0.732 |
## Finance -0.554 | 0.631 0.610 4.120 |
## HiTech 0.428 | -0.331 0.014 -1.384 |
## Manufacturing 0.069 | -0.048 0.015 -0.230 |
## Medical 0.733 | -0.261 0.147 -0.752 |
## Other -0.897 | -0.182 0.093 -0.708 |
## Retail 1.451 | -0.272 0.118 -1.290 |
## Transportation -0.349 | -0.206 0.118 -0.736 |
Here, we decide how many dimensions we want to keep thank to the PCA.
pca$var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Assets 13.5620450 1.4777714 82.3816697 1.713535 0.572731324
## Sales 0.7290855 97.0404116 0.6032843 1.600443 0.002999277
## Market Value 23.0029865 0.1714245 2.0136301 1.916726 71.145041094
## Profits 21.9208422 0.4527000 9.3848443 8.776482 6.532067465
## Cash Flow 22.5520097 0.2175532 5.4224227 6.121941 20.892259609
## Employees 18.2330310 0.6401393 0.1941488 79.870871 0.854901231
barplot(pca$eig[,1], main="Eigen values", names.arg =paste0("dim",1:nrow(pca$eig)))
It seems to appear that keeping only the two first dimensions can help explain lots of information about the data.
plot.PCA(pca, choix ="varcor", axes=c(1,2), cex=0.9)
plot.PCA(pca, choix="ind", axes=c(1,2), invisible = c("ind","var"),cex=0.8)
plot.PCA(pca,choix ="ind", axes=c(1,2), invisible = "quali", cex=0.5)
We can see the first two axes explain more than 85% of the observations, which is quite important. The second axe is mainly explained by the variable while the first axe is explained by the others covariates. The second plot gives an indication about where the domains are considering the variables. Finally, the last graph indicates where is each company on the two first axes. Therefore we can understand how the variables affect a company. For example, the IBM company is far from all the others but close to the 0 in the second axe. This means that the sales do not affect it very much but others variables such as the profits, the assests, etc, have great importance on how to describe the company.