extractAIC
to obtain the AIC score for the
above models (the second output). Which model has the lowest AIC
score?reference R4DS
R uses formulas for most modelling functions.
A simple example: y ~ x
translates to: \(y = \beta_0 + x\beta_1\).
Below is a summary for some popular uses
Use | Example | Translation |
---|---|---|
Simple example | y ~ x |
\(y = \beta_0 + x\beta_1\) |
Exclude intercept | y ~ x - 1 or y ~ x + 0 |
\(y = x\beta\) |
Add a continuous variable | y ~ x1 + x2 |
\(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2\) |
Add a categorical variable | y ~ x1 + sex |
\(y = \beta_0 + x_1 \beta_1 + \mbox{sex_male} \beta_2\) |
Interactions (continuous and categorical) | y ~ x1 * sex |
\(y = \beta_0 + x_1 \beta1 + \mbox{sex_male} \beta_2 + x_1 \mbox{sex_male} \beta_{12}\) |
Interactions (continuous and continuous) | y ~ x1 * x_2 |
\(y = \beta_0 + x_1 \beta1 + x_2 \beta_2 + x_1 x_2 \beta_{12}\) |
You can specify transformations inside the model formula too (besides creating new transformed variables in your data frame).
For example, log(y) ~ sqrt(x1) + x2
is transformed to
\(\log(y) = \beta_0 + \sqrt{x_1} \beta_1 + x_2
\beta_2\).
One thing to note is that if your transformation involves
+
, *
, ^
or -
, you
will need to wrap it in I()
so R doesn’t treat it like part
of the model specification.
For example, y ~ x + I(x ^ 2)
is translated to \(y = \beta_0 + x \beta_1 + x^2
\beta_2\).
And remember, R automatically drops redundant variables so
x + x
become x
.
y ~ x ^ 2 + x
y ~ x + x ^ 2 + x ^ 3
y ~ poly(x, 3)
y ~ (x1 + x2 + x3) ^2 + (x2 + x3) * (x4 + x5)
Let’s put more thinking into modeling the fish data and try out several formulas. We know that from physics, the Weight is the proportional to the density multiply volume. As in the mid-term exam, let’s include two “basic” models first:
Model 1:
Weight ~ Length1 + Length2 + Length3 + Height + Width
Model 2:
Weight ~ Species + Length1 + Length2 + Length3 + Height + Width
Inspired from what we learn in physics, we may expect
Species
to be a good variable to model the differences in
density (and probably other unintended effects) among species of fish.
How do we model volume?
Model 3:
Weight ~ Species * (Length1 + Length2 + Length3 + Height + Width)
.
If the size of cross-sections among different fish species is quite
close, then it might be reasonable to approximate the volume by
one-dimensional lengths.
Model 4:
Weight ~ Species * (Length1 + Length2 + Length3 + Height + Width)^2
.
Well, how about considering the higher order terms between length
measurements with their interaction terms, therefore, possibly including
sizes for some cross-sections.
Model 5:
Weight ~ Species * (Length1 + Length2 + Length3 + Height + Width)^3
.
Well, how about considering the higher order terms between length
measurements with their interaction terms, therefore, possibly including
sizes for some cross-sections.
Model 6:
Weight ~ Species * Length1 * Height * Width
. How about
something really like density * volume?
Model 7:
log(Weight) ~ Species + Length1 + Length2 + Length3 + Height + Width
.
How about transforming the response, then the addition on the right-hand
side of the formula will be multiplicative in the original Weight
dimension.
Model 8:
log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width)
.
What about adding back the interactions?
fish.data <- read.csv("Fish.csv")
# before fitting your model, check the weight parameter values to see if all of them make sense.
# you might want to disgard the row with weight = 0.
fish.data <- fish.data[fish.data[, "Weight"] > 0, ]
Which model obtains the highest adjusted-\(R^2\) value? Do all coefficient estimates
make sense? Do you get any estimates as NA
? Why? (Hint: you
may want to check with rank
. Also, compare the number of
parameters in the model with the number of parameters.)
finish the code below
lmod.model.1 <- lm(Weight ~ Length1 + Length2 + Length3 + Height + Width, data = fish.data)
(summary(lmod.model.1))
##
## Call:
## lm(formula = Weight ~ Length1 + Length2 + Length3 + Height +
## Width, data = fish.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243.96 -63.57 -25.82 57.90 448.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -496.802 29.616 -16.775 < 2e-16 ***
## Length1 63.969 40.169 1.592 0.11335
## Length2 -9.109 41.749 -0.218 0.82759
## Length3 -28.119 17.343 -1.621 0.10701
## Height 27.926 8.721 3.202 0.00166 **
## Width 23.412 20.355 1.150 0.25188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123 on 152 degrees of freedom
## Multiple R-squared: 0.8855, Adjusted R-squared: 0.8817
## F-statistic: 235.1 on 5 and 152 DF, p-value: < 2.2e-16
lmod.model.2 <- lm(Weight ~ Species + Length1 + Length2 + Length3 + Height + Width, data = fish.data)
(summary(lmod.model.2))
##
## Call:
## lm(formula = Weight ~ Species + Length1 + Length2 + Length3 +
## Height + Width, data = fish.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.56 -52.52 -11.70 36.55 419.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -912.7110 127.4597 -7.161 3.64e-11 ***
## SpeciesParkki 160.9212 75.9591 2.119 0.035824 *
## SpeciesPerch 133.5542 120.6083 1.107 0.269969
## SpeciesPike -209.0262 135.4911 -1.543 0.125061
## SpeciesRoach 104.9243 91.4636 1.147 0.253188
## SpeciesSmelt 442.2125 119.6944 3.695 0.000311 ***
## SpeciesWhitefish 91.5688 96.8338 0.946 0.345901
## Length1 -79.8443 36.3322 -2.198 0.029552 *
## Length2 81.7091 45.8395 1.783 0.076746 .
## Length3 30.2726 29.4837 1.027 0.306233
## Height 5.8069 13.0931 0.444 0.658057
## Width -0.7819 23.9477 -0.033 0.974000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 93.95 on 146 degrees of freedom
## Multiple R-squared: 0.9358, Adjusted R-squared: 0.931
## F-statistic: 193.6 on 11 and 146 DF, p-value: < 2.2e-16
Yes, model 4, 5, 6 have estimates being NA
(Not
available). By checking the rank of the models (for example,
lmod.model.4$rank
), we see that these models suffer from
rank deficiency (in other words, linear dependencies between explanatory
variables). At the same time, with the 158 observations in the data,
these model consist too many parameters that we might simply do not have
enough data to estimate them. We probably should not consider model 4,
5, 6 at the first place given the size of our data.
extractAIC
to obtain the AIC score
for the above models (the second output). Which model has the lowest AIC
score?finish the code below
(lmod.model.1.AIC <- extractAIC())
Model 8 has the lowest AIC value.
Now let’s play with sequential methods.
Just recall…
Forward selection:
Determine a shreshold significance level for entering predictors into the model
Begin with a model without any predictors, i.e., \(y = \beta_0 + \epsilon\)
For each predictor not in the model, fit a model that adds that predictor to the working model. Record the t-statistic and p-value for the added predictor.
Do any of the predictors considered in step 3 have a p-value less than the shreshold specified in step 1?
Yes: Add the predictor with the smallest p-value to the working model and return to step 3
No: Stop. The working model is the final model.
Backwards elimination
Determine a threshold significance level for removing predictors from the model.
Begin with a model that includes all predictors.
Examine the t-statistic and p-value of the partial regression coefficients associated with each predictor.
Do any of the predictors have a p-value greater than the threshold specified in step 1?
Yes: Remove the predictor with the largest p-value from the working model and return to step 3.
No: Stop. The working model is the final model.
Stepwise selection
Combines forward selection and backwards elimination steps.
Now perform Model selection on the Fish dataset.
Checkout the documentation of function step
in R.
Weight ~ Species * (Length1 + Length2 + Length3 + Height + Width)
insert your code below
log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width)
insert your code below