First of all, we import some useful libraries.

We will work on the “uscomp” data.Thanks to the and functions, we can see how the data looks like. We transform Sales as an integer vector.

data(uscomp)
str(uscomp)
## 'data.frame':    79 obs. of  7 variables:
##  $ Assets      : int  19788 5074 13621 1117 1633 5651 5835 3494 1654 1679 ...
##  $ Sales       : Factor w/ 79 levels "1037","1038",..: 78 32 55 2 71 6 48 18 52 10 ...
##  $ Market Value: int  10636 1892 4572 478 679 2002 1601 1442 779 687 ...
##  $ Profits     : num  1092.9 239.9 485 59.7 74.3 ...
##  $ Cash Flow   : num  2576.8 578.3 898.9 91.7 135.9 ...
##  $ Employees   : num  79.4 21.9 23.4 3.8 2.8 6.2 10.8 6.4 1.6 4.6 ...
##  $ Sector      : Factor w/ 9 levels "Communication",..: 1 1 2 2 2 2 2 2 2 2 ...
uscomp$Sales <- as.integer(uscomp$Sales)
summary(uscomp)
##      Assets          Sales       Market Value        Profits      
##  Min.   :  223   Min.   : 1.0   Min.   :   53.0   Min.   :-771.5  
##  1st Qu.: 1122   1st Qu.:20.5   1st Qu.:  512.5   1st Qu.:  39.0  
##  Median : 2788   Median :40.0   Median :  944.0   Median :  70.5  
##  Mean   : 5941   Mean   :40.0   Mean   : 3269.1   Mean   : 209.8  
##  3rd Qu.: 5802   3rd Qu.:59.5   3rd Qu.: 1961.5   3rd Qu.: 188.1  
##  Max.   :52634   Max.   :79.0   Max.   :95697.0   Max.   :6555.0  
##                                                                   
##    Cash Flow         Employees                Sector  
##  Min.   :-651.90   Min.   :  0.60   Finance      :17  
##  1st Qu.:  75.15   1st Qu.:  3.95   Energy       :15  
##  Median : 133.30   Median : 15.40   Manufacturing:10  
##  Mean   : 400.93   Mean   : 37.60   Retail       :10  
##  3rd Qu.: 328.85   3rd Qu.: 48.50   HiTech       : 8  
##  Max.   :9874.00   Max.   :400.20   Other        : 7  
##                                     (Other)      :12

Because the relationships between variables are important we plot each variable in relation with the others.

pairs(uscomp[,1:6])

Furthermore, this graphical output can help us understand the interaction between covariates thanks to their correlation coefficient.

corrplot(cor(uscomp[,1:6]))

PCA

Now, we conduct our PCA using the “PCA” function from the “FactoMineR” package. We want the 7th variable, which is the type of sector of each company, as a supplementary qualitative. It means it will not intervene in the calculations but it will be interesting to see how it behaves when plotting some results.

pca = PCA(uscomp, scale.unit = TRUE,  quali.sup = c(7), graph = FALSE)
summary(pca)
## 
## Call:
## PCA(X = uscomp, scale.unit = TRUE, quali.sup = c(7), graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               4.181   0.991   0.502   0.288   0.029   0.009
## % of var.             69.689  16.520   8.358   4.800   0.485   0.147
## Cumulative % of var.  69.689  86.210  94.568  99.368  99.853 100.000
## 
## Individuals (the 10 first)
##                                  Dist    Dim.1    ctr   cos2    Dim.2    ctr
## Bell_Atlantic                |  3.239 |  2.681  2.175  0.685 |  1.322  2.231
## Continental_Telecom          |  0.480 | -0.140  0.006  0.085 | -0.358  0.164
## American_Electric_Power      |  1.225 |  0.688  0.143  0.315 |  0.480  0.294
## Brooklyn_Union_Gas           |  1.871 | -0.893  0.242  0.228 | -1.584  3.205
## Central_Illinois_Publ._Serv. |  1.581 | -0.586  0.104  0.137 |  1.385  2.448
## Cleveland_Electric_Illum.    |  1.579 | -0.340  0.035  0.046 | -1.508  2.905
## Columbia_Gas_System          |  0.709 | -0.494  0.074  0.484 |  0.354  0.160
## Florida_Progress             |  1.129 | -0.528  0.084  0.219 | -0.943  1.135
## Idaho_Power                  |  0.969 | -0.656  0.130  0.458 |  0.561  0.402
## Kansas_Power_Light           |  1.527 | -0.781  0.185  0.262 | -1.251  1.999
##                                cos2    Dim.3    ctr   cos2  
## Bell_Atlantic                 0.167 |  0.624  0.984  0.037 |
## Continental_Telecom           0.556 | -0.132  0.044  0.075 |
## American_Electric_Power       0.154 |  0.607  0.931  0.246 |
## Brooklyn_Union_Gas            0.717 | -0.434  0.475  0.054 |
## Central_Illinois_Publ._Serv.  0.767 | -0.164  0.067  0.011 |
## Cleveland_Electric_Illum.     0.912 | -0.147  0.055  0.009 |
## Columbia_Gas_System           0.250 |  0.218  0.120  0.094 |
## Florida_Progress              0.697 | -0.240  0.145  0.045 |
## Idaho_Power                   0.335 | -0.230  0.133  0.056 |
## Kansas_Power_Light            0.671 | -0.379  0.363  0.062 |
## 
## Variables
##                                 Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Assets                       |  0.753 13.562  0.567 | -0.121  1.478  0.015 |
## Sales                        |  0.175  0.729  0.030 |  0.981 97.040  0.962 |
## Market Value                 |  0.981 23.003  0.962 | -0.041  0.171  0.002 |
## Profits                      |  0.957 21.921  0.917 | -0.067  0.453  0.004 |
## Cash Flow                    |  0.971 22.552  0.943 | -0.046  0.218  0.002 |
## Employees                    |  0.873 18.233  0.762 |  0.080  0.640  0.006 |
##                               Dim.3    ctr   cos2  
## Assets                        0.643 82.382  0.413 |
## Sales                         0.055  0.603  0.003 |
## Market Value                 -0.100  2.014  0.010 |
## Profits                      -0.217  9.385  0.047 |
## Cash Flow                    -0.165  5.422  0.027 |
## Employees                    -0.031  0.194  0.001 |
## 
## Supplementary categories
##                                  Dist    Dim.1   cos2 v.test    Dim.2   cos2
## Communication                |  1.533 |  1.270  0.686  0.884 |  0.482  0.099
## Energy                       |  0.589 | -0.438  0.552 -0.915 | -0.195  0.109
## Finance                      |  0.808 | -0.390  0.233 -0.883 | -0.119  0.022
## HiTech                       |  2.807 |  2.781  0.981  4.032 |  0.144  0.003
## Manufacturing                |  0.396 | -0.246  0.384 -0.404 |  0.020  0.003
## Medical                      |  0.680 | -0.497  0.535 -0.496 |  0.358  0.277
## Other                        |  0.597 | -0.435  0.530 -0.586 | -0.324  0.295
## Retail                       |  0.793 | -0.084  0.011 -0.139 |  0.430  0.294
## Transportation               |  0.600 | -0.542  0.814 -0.671 | -0.137  0.052
##                              v.test    Dim.3   cos2 v.test  
## Communication                 0.689 |  0.246  0.026  0.495 |
## Energy                       -0.837 | -0.121  0.042 -0.732 |
## Finance                      -0.554 |  0.631  0.610  4.120 |
## HiTech                        0.428 | -0.331  0.014 -1.384 |
## Manufacturing                 0.069 | -0.048  0.015 -0.230 |
## Medical                       0.733 | -0.261  0.147 -0.752 |
## Other                        -0.897 | -0.182  0.093 -0.708 |
## Retail                        1.451 | -0.272  0.118 -1.290 |
## Transportation               -0.349 | -0.206  0.118 -0.736 |

Here, we decide how many dimensions we want to keep thank to the PCA.

pca$var$contrib
##                   Dim.1      Dim.2      Dim.3     Dim.4        Dim.5
## Assets       13.5620450  1.4777714 82.3816697  1.713535  0.572731324
## Sales         0.7290855 97.0404116  0.6032843  1.600443  0.002999277
## Market Value 23.0029865  0.1714245  2.0136301  1.916726 71.145041094
## Profits      21.9208422  0.4527000  9.3848443  8.776482  6.532067465
## Cash Flow    22.5520097  0.2175532  5.4224227  6.121941 20.892259609
## Employees    18.2330310  0.6401393  0.1941488 79.870871  0.854901231
barplot(pca$eig[,1], main="Eigen values", names.arg =paste0("dim",1:nrow(pca$eig)))

It seems to appear that keeping only the two first dimensions can help explain lots of information about the data.

plot.PCA(pca, choix ="varcor", axes=c(1,2), cex=0.9)

plot.PCA(pca, choix="ind", axes=c(1,2), invisible = c("ind","var"),cex=0.8)

plot.PCA(pca,choix ="ind", axes=c(1,2), invisible = "quali", cex=0.5)

We can see the first two axes explain more than 85% of the observations, which is quite important. The second axe is mainly explained by the variable while the first axe is explained by the others covariates. The second plot gives an indication about where the domains are considering the variables. Finally, the last graph indicates where is each company on the two first axes. Therefore we can understand how the variables affect a company. For example, the IBM company is far from all the others but close to the 0 in the second axe. This means that the sales do not affect it very much but others variables such as the profits, the assests, etc, have great importance on how to describe the company.