The data

Like in the first assignment, I will work on the data ‘ushealth’. To proceed at some Hottelling’s tests, I will divide this data using the ‘reg’ variable, which has four levels : ‘Northwest’, ‘Midwest’, ‘South’ and ‘West’.

data("ushealth")
#?ushealth

apply(ushealth[,1:11], 2, mean) ### mean vector estimate

## land area popu 1985       acc      card      canc       pul      pneu      diab 
## 58415.020  4742.460    44.312   398.530   178.410    26.486    21.038    14.836 
##       liv       doc      shop 
##    10.548  8711.860   137.420

head(cov(ushealth[,1:11]))         ### variance-covariance estimate

##               land area    popu 1985           acc         card          canc
## land area 2286225819.78 85977648.684 144976.683429 -869734.7700 -493538.45735
## popu 1985   85977648.68 25865985.600 -17616.450531   88508.9471   41075.27898
## acc           144976.68   -17616.451    120.627608    -468.1912    -209.85196
## card         -869734.77    88508.947   -468.191184    7096.9972    2528.40990
## canc         -493538.46    41075.279   -209.851959    2528.4099    1092.39235
## pul            27652.15    -3124.528     -6.115951     208.0849      81.09137
##                    pul         pneu         diab          liv           doc
## land area 27652.145184 16839.235959 -61133.14155 -5545.003020 134669762.554
## popu 1985 -3124.528122    78.453592    121.48310  6715.081551  54720207.637
## acc          -6.115951    -8.893118    -20.83758    -4.682220    -43654.415
## card        208.084918   187.737612    191.35869    31.446082    197676.284
## canc         81.091367    48.961245     85.95657    32.921551     97287.128
## pul          31.386127     9.288910      4.66378     4.050482     -6378.194
##                    shop
## land area 2984510.33837
## popu 1985  566003.35388
## acc          -291.83576
## card         2196.99939
## canc          762.14265
## pul           -64.23278

num_data = ushealth[,1:12]
split.data = split(num_data[,-12],num_data$reg)
NorthWest <- split.data[[1]]
MidWest <- split.data[[2]]
South <- split.data[[3]]
West <- split.data[[4]]

The distributions

First of all, we want to compare if the mean vectors of each region can be equal or not. For this assignment, we are going to work only on the comparison between the NorthWest and the MidWest regions. However, the procedures remain exactly the same for the others regions.

Let’s plot some boxplots to compare the means :

par(mfrow=c(1,2))
boxplot(NorthWest$`popu 1985`)
title(main = 'Boxplot for the population variable', xlab = 'Northwest', ylab='Population')
boxplot(MidWest$`popu 1985`)
title(main = 'Boxplot for the population variable', xlab = 'Midwest', ylab='Population')

par(mfrow=c(1,2))
boxplot(NorthWest$diab)
title(main = 'Boxplot for the diabetis variable', xlab = 'Northwest', ylab='Population')
boxplot(MidWest$diab)
title(main = 'Boxplot for the diabetis variable', xlab = 'Midwest', ylab='Population')

The tests

Now, we can use the Hotelling’s T2 test to see if the mean vectors are the same or not. In this exercise, we use two different ways to perform this test. However, we obtain the same p-value (0.0009986) which allows us to say that they have different mean vectors.

HotellingsT2Test(NorthWest,MidWest)

## 
##  Hotelling's two sample T2-test
## 
## data:  NorthWest and MidWest
## T.2 = 9.7218, df1 = 11, df2 = 9, p-value = 0.0009986
## alternative hypothesis: true location difference is not equal to c(0,0,0,0,0,0,0,0,0,0,0)

###Xy
hotellingTest_xy <- hotelling.test(NorthWest,MidWest, data = num_data)
hotellingTest_xy

## Test stat:  225.76 
## Numerator df:  11 
## Denominator df:  9 
## P-value:  0.0009986

The confidence intervals

Finally, we can calculate some confidence intervals and display them :

xy = rbind(NorthWest,MidWest)
Fquant <- (11/(21 - 11)) * qf(1 - 0.05, 11, 10)
meanVec <- apply(xy[,1:11], 2, mean)
LB <- diag(11) %*% meanVec - sqrt(Fquant * diag(diag(11) %*% cov(xy[,1:11])  %*% diag(11)))
UB <- diag(11) %*% meanVec + sqrt(Fquant * diag(diag(11) %*% cov(xy[,1:11])  %*% diag(11)))
confInt <- data.frame(LB, UB, meanVec,  row.names = names(xy)[1:11])
names(confInt) <- c("Lower Boundary", "Upper Boundary", "Mean Estimate")
confInt

##           Lower Boundary Upper Boundary Mean Estimate
## land area   -6151.634055    95222.11025   44535.23810
## popu 1985   -3245.944347    13632.13482    5193.09524
## acc            26.612290       47.84485      37.22857
## card          393.213092      513.09167     453.15238
## canc          167.026183      228.12620     197.57619
## pul            19.452060       34.58603      27.01905
## pneu           15.794270       30.86287      23.32857
## diab           11.389270       21.62025      16.50476
## liv             5.621947       16.21615      10.91905
## doc        -10282.881075    31107.92869   10412.52381
## shop          -22.198321      314.76975     146.28571

4th assignment

Bastien Cadiou

15/03/2022

The data

The distributions

The tests

The confidence intervals