The goal of this lab is to familiarize ourselves with the concepts and interpretations for studying correlations and linear regressions. We will also continue practicing hypothesis testing and confidence intervals. By the end of this lab you should be able to: estimate and interpret the slope and intercept of a regression line summarizing the relationship between a single explanatory variable and a single outcome/response. You will also be able to conduct appropriate hypothesis tests to evaluate the presence of a linear relationship, and use confidence intervals to estimate the strenght of the relationship. We will use the infer
package for hypothesis tests and confidence intervals involving correlations. We will use the broom
package to “tidy-up” the outputs from regression models.
From the end of Lecture 17, here are the basic steps to a hypothesis test:
infer
, broom
, tidyverse
, and NHANES
. You may need to install the broom
package.You are interesting in determing if a linear relationship exists between Age and BMI in the NHANES population.
A. Which variable do you think should be the explantory variable, which should be the response? Think about which of these might explain the other?
## Age is the explanatory variable (X)
## BMI is the response (Y)
B. Take a sample of 50 adults and plot their Age
vs BMI.
Describe the plot and state whether or not you think a relationship exists.
sample1 <- NHANES %>%
select(Age,BMI) %>%
sample_n(size = 50)
sample1 %>%
ggplot(aes(x= Age, y = BMI))+
geom_point() +
labs(x = "Age", y = "BMI")
C. Calculate an interprest the correlation between Age and BMI in your sample. Don’t forget, when calcualting the correlation with the “cor()” function you need to specify what to do about missing values (NA
s). I usually specify use = "pairwise.complete"
which means R will use all of the pairs of Age and BMI where both values are present and will exclude any pairs missing either Age or BMI.
corr1 <- sample1 %>%
summarise(corr= cor(Age, BMI, use = "pairwise.complete")) %>%
round(2)
corr1
## # A tibble: 1 x 1
## corr
## <dbl>
## 1 0.36
The correlation of {r corr1}
indicates a moderate positive linear relationship between Age and BMI.
D. Use a formal hypothesis test to decide at the 5% level if a linear relationship exists between Age and BMI. Complete all steps of the hypothesis test (i.e. State your null and alternative, give a conclusion incomplete sentences.)
The hypotheses we are interested in testing are \[H_0: \rho = 0 \\ H_A: \rho \neq 0\]. Our stated \(\alpha\) level is 5%.
null_dist <- sample1 %>%
specify(formula = BMI ~ Age) %>%
hypothesize(null = "independence") %>%
generate(reps = 500, type = "permute") %>%
calculate(stat = "correlation")
p_value <- get_p_value(null_dist,
obs_stat = corr1,
direction = "two_sided")
p_value
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.012
Our observed correlation between Age and BMI of 0.36 and a sample size of 50 corresponds to a p-value of 0.012. Because our p-value is less than our stated \(\alpha\) level of 5%, we have enough evidence to reject the null hypothesis, suggesting evidence of a linear relationship between BMI and Age.
E. Create and interpret 95% a confidence interval for the true population level parameter describing the correlation between Age and BMI.
boot_dist <- sample1 %>%
specify(formula = BMI ~ Age) %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "correlation")
rho_ci <- get_ci(boot_dist, level = 0.95,
type = "se", point_estimate = corr1)
rho_ci
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.135 0.585
We are 95% confident that the true correlation, \(\rho\), between Age and BMI is contained in the interval from 0.1345689 to 0.5854311.
F. Calculate the correlation for the entire NHANES population (not just your sample). Usually we can’t see this population value. Did you make the correct conclusion from your hypothesis test? Did your confidence interal cover the true correlation?
rho <- NHANES %>%
summarise(rho = cor(Age, BMI, use = "pairwise.complete"))
rho
## # A tibble: 1 x 1
## rho
## <dbl>
## 1 0.408
The population correlation: \(\rho\) = 0.4078759, which is contained in our 95% confidence interval, and agrees with the conclusion of our hypothesis test.
Using the lm(Y ~ X, data = Dataset)
function, fit a model that regresses your explanatory variable onto your response variable. Use the sample of 50 people you took above as the dataset. Store this in a variable called model1
. Use summary(model1)
to examine the output of model1
.
model1 <- lm(BMI ~ Age, data = sample1)
summary(model1)
##
## Call:
## lm(formula = BMI ~ Age, data = sample1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.028 -5.006 -1.319 3.801 15.999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.20679 1.94629 11.410 2.84e-15 ***
## Age 0.11901 0.04386 2.714 0.00922 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.356 on 48 degrees of freedom
## Multiple R-squared: 0.133, Adjusted R-squared: 0.1149
## F-statistic: 7.363 on 1 and 48 DF, p-value: 0.009217
A. Using tidy()
from the broom
package, examine the output of tidy(model1)
and use this output to write out the equation of the estimated regression line.
mod <- tidy(model1) %>% mutate_if(is.numeric, round, 2)
mod
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 22.2 1.95 11.4 0
## 2 Age 0.12 0.04 2.71 0.01
The equations of the line is: BMI
= 22.21 + 0.12\(\cdot\)Age
.
B. What is your estimate and interpretation of the SLOPE of your line?
Our estimated slope is 0.12. This means that a one-year increase in Age is associated with a 0.12\(kg/m^2\) increase in BMI.
C. What is your estimate and interpretation of the INTERCEPT of your line?
Our estimated intercept is 22.21. This means that when Age is 0, the estimated BMI is 22.21\(kg/m^2\). This does not make sense, so we will ignore it’s interpretation.
D. Previously you used ggplot() + geom_point()
to create a scatterplot of Y vs X. In the same call, we can add our regression line to the plot. Use geom_smooth(aes(X=___, Y=___), method = "lm", se = FALSE)
to add the line. The aes()
statement indicates what the formula of the line will be. method = "lm"
indicates the type of line you want to fit: we want a ‘linear model.’ The se=FALSE
will plot just the line (se = TRUE
adds a line-wise confidence interval that we will talk about later.)
sample1 %>%
ggplot(aes(x=Age, y = BMI))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
E. Using the equation of the line, what is the predicted BMI for a person who is 25 years old?
## Intercept + Slope*25
pred_25 <- mod[1,2] + mod[2,2]*25
pred_25
## estimate
## 1 25.21
F. Using the equation of the line, what is the predicted BMI for a person who is 65 years old?
## Intercept + Slope*65
pred_65 <- mod[1,2] + mod[2,2]*65
pred_65
## estimate
## 1 30.01
G. Formally test, at the 5% level, whether or not Age is associated with BMI. State your hypotheses, use the tidy(model1)
output, and explain your conclusion.
Our line is \[BMI = \beta_0 + \beta_1*Age,\] so we will be testing whether or not the slope associated with Age
is equal to zero or not. \[H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0\] Based on the output of our model, the estimated slope is 0.12 and the corresponding p-value is 0.01. Because our p-value is less than our pre-specified \(\alpha\) level of 5%, we reject the null hypothesis and conclude that Age is associated with BMI.