Simple Linear Regression

The Davis data set in the carData package contains the measured and self-reported heights and weights of 200 men and women engaged in regular exercise. A few of the data values are missing, and consequently there are only 183 complete cases for the variables that are used in the analysis reported below.

First, take a quick look at the data:

library(carData)
summary(Davis)
##  sex         weight          height          repwt            repht      
##  F:112   Min.   : 39.0   Min.   : 57.0   Min.   : 41.00   Min.   :148.0  
##  M: 88   1st Qu.: 55.0   1st Qu.:164.0   1st Qu.: 55.00   1st Qu.:160.5  
##          Median : 63.0   Median :169.5   Median : 63.00   Median :168.0  
##          Mean   : 65.8   Mean   :170.0   Mean   : 65.62   Mean   :168.5  
##          3rd Qu.: 74.0   3rd Qu.:177.2   3rd Qu.: 73.50   3rd Qu.:175.0  
##          Max.   :166.0   Max.   :197.0   Max.   :124.00   Max.   :200.0  
##                                          NA's   :17       NA's   :17

Q1. How many variables are in the Davis data and explain what they represent

We focus here on the regression of weight on repwt. This problem has response \(y=weight\) and one predictor, repwt, from which we obtain the regressor \(x_1=repwt\). We again construct the design matrix and response vector first.

X <- as.matrix(cbind(1, Davis[, "repwt"]))
Y <- as.matrix(Davis[, "weight"])

# Now get X rows with complete observations
X.complete.rows <- complete.cases(X)  # read the documentation of funciton complete.cases()
# Similarly get Y's
Y.complete.rows <- complete.cases(Y)

# subset X and Y 
X <- X[X.complete.rows & Y.complete.rows, ]
Y <- Y[X.complete.rows & Y.complete.rows, ]

Q2. Calculate the least square estimate using \(\hat{\beta_{ls}} = (X^TX)^{-1}X^TY\).

Q3. Calculate the total sum of squares, residual sum of squares and regression sum of squares. Test \(SST - SSR - SSReg\), explain what you get.

Q4. Calculate the mean squared error (MSE).

Q5. Calculate the test statistic for testing the null hypothesis \(H_0: \beta_0 = 0\) vs \(H_a: \beta_0 \ne 0\), report the p-value.

Q6. Fit the same model using lm() function and check your calculations above

Q7. Repeat Q2 to Q6 for the dataset Mandel to regress y on x1.