The goal of this lab is to familiarize ourselves with the concepts and interpretations for studying correlations and linear regressions. We will also continue practicing hypothesis testing and confidence intervals. By the end of this lab you should be able to: estimate and interpret the slope and intercept of a regression line summarizing the relationship between a single explanatory variable and a single outcome/response. You will also be able to conduct appropriate hypothesis tests to evaluate the presence of a linear relationship, and use confidence intervals to estimate the strenght of the relationship. We will use the infer package for hypothesis tests and confidence intervals involving correlations. We will use the broom package to “tidy-up” the outputs from regression models.

From the end of Lecture 17, here are the basic steps to a hypothesis test:

  1. State Null and Alternative hypotheses.
  2. Pick or state your significance level (\(\alpha\)).
  3. Collect your sample data.
  4. Calculate the observed statistic.
  5. Generate the NULL Distribution.
  6. Calculate the p-value.
  7. Compare your p-value to your \(\alpha\) level.
  8. Make your conclusion:
    • p < \(\alpha\) –> REJECT THE NULL
    • p \(\geq \alpha\) –> FAIL TO REJECT THE NULL

0.1 Set up

1. The relationship between Age and BMI

You are interesting in determing if a linear relationship exists between Age and BMI in the NHANES population.

A. Which variable do you think should be the explantory variable, which should be the response? Think about which of these might explain the other?

## Age is the explanatory variable (X)
## BMI is the response (Y)

B. Take a sample of 50 adults and plot their Age vs BMI. Describe the plot and state whether or not you think a relationship exists.

sample1 <- NHANES %>% 
          select(Age,BMI) %>% 
          sample_n(size = 50)
sample1 %>% 
  ggplot(aes(x= Age, y = BMI))+
  geom_point() + 
  labs(x = "Age", y = "BMI")

C. Calculate an interprest the correlation between Age and BMI in your sample. Don’t forget, when calcualting the correlation with the “cor()” function you need to specify what to do about missing values (NAs). I usually specify use = "pairwise.complete" which means R will use all of the pairs of Age and BMI where both values are present and will exclude any pairs missing either Age or BMI.

corr1 <- sample1 %>% 
  summarise(corr= cor(Age, BMI, use = "pairwise.complete")) %>% 
  round(2)
corr1
## # A tibble: 1 x 1
##    corr
##   <dbl>
## 1  0.36

The correlation of {r corr1} indicates a moderate positive linear relationship between Age and BMI.

D. Use a formal hypothesis test to decide at the 5% level if a linear relationship exists between Age and BMI. Complete all steps of the hypothesis test (i.e. State your null and alternative, give a conclusion incomplete sentences.)

The hypotheses we are interested in testing are \[H_0: \rho = 0 \\ H_A: \rho \neq 0\]. Our stated \(\alpha\) level is 5%.

null_dist <- sample1 %>% 
  specify(formula = BMI ~ Age) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 500, type = "permute") %>% 
  calculate(stat = "correlation")

p_value <- get_p_value(null_dist,
                       obs_stat = corr1, 
                       direction = "two_sided")

p_value
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.012

Our observed correlation between Age and BMI of 0.36 and a sample size of 50 corresponds to a p-value of 0.012. Because our p-value is less than our stated \(\alpha\) level of 5%, we have enough evidence to reject the null hypothesis, suggesting evidence of a linear relationship between BMI and Age.

E. Create and interpret 95% a confidence interval for the true population level parameter describing the correlation between Age and BMI.

boot_dist <- sample1 %>% 
  specify(formula = BMI ~ Age) %>% 
  generate(reps = 500, type = "bootstrap") %>% 
  calculate(stat = "correlation")

rho_ci <- get_ci(boot_dist, level = 0.95,
                 type = "se", point_estimate = corr1)
rho_ci
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.135 0.585

We are 95% confident that the true correlation, \(\rho\), between Age and BMI is contained in the interval from 0.1345689 to 0.5854311.

F. Calculate the correlation for the entire NHANES population (not just your sample). Usually we can’t see this population value. Did you make the correct conclusion from your hypothesis test? Did your confidence interal cover the true correlation?

rho <- NHANES %>% 
  summarise(rho = cor(Age, BMI, use = "pairwise.complete"))
rho
## # A tibble: 1 x 1
##     rho
##   <dbl>
## 1 0.408

The population correlation: \(\rho\) = 0.4078759, which is contained in our 95% confidence interval, and agrees with the conclusion of our hypothesis test.

2. Quantifying the relationship

Using the lm(Y ~ X, data = Dataset) function, fit a model that regresses your explanatory variable onto your response variable. Use the sample of 50 people you took above as the dataset. Store this in a variable called model1. Use summary(model1) to examine the output of model1.

model1 <- lm(BMI ~ Age, data =  sample1)
summary(model1)
## 
## Call:
## lm(formula = BMI ~ Age, data = sample1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.028  -5.006  -1.319   3.801  15.999 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.20679    1.94629  11.410 2.84e-15 ***
## Age          0.11901    0.04386   2.714  0.00922 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.356 on 48 degrees of freedom
## Multiple R-squared:  0.133,  Adjusted R-squared:  0.1149 
## F-statistic: 7.363 on 1 and 48 DF,  p-value: 0.009217

A. Using tidy() from the broom package, examine the output of tidy(model1) and use this output to write out the equation of the estimated regression line.

mod <- tidy(model1) %>% mutate_if(is.numeric, round, 2)
mod
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    22.2       1.95     11.4     0   
## 2 Age             0.12      0.04      2.71    0.01

The equations of the line is: BMI = 22.21 + 0.12\(\cdot\)Age.

B. What is your estimate and interpretation of the SLOPE of your line?

Our estimated slope is 0.12. This means that a one-year increase in Age is associated with a 0.12\(kg/m^2\) increase in BMI.

C. What is your estimate and interpretation of the INTERCEPT of your line?

Our estimated intercept is 22.21. This means that when Age is 0, the estimated BMI is 22.21\(kg/m^2\). This does not make sense, so we will ignore it’s interpretation.

D. Previously you used ggplot() + geom_point() to create a scatterplot of Y vs X. In the same call, we can add our regression line to the plot. Use geom_smooth(aes(X=___, Y=___), method = "lm", se = FALSE) to add the line. The aes() statement indicates what the formula of the line will be. method = "lm" indicates the type of line you want to fit: we want a ‘linear model.’ The se=FALSE will plot just the line (se = TRUE adds a line-wise confidence interval that we will talk about later.)

sample1 %>% 
  ggplot(aes(x=Age, y = BMI))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

E. Using the equation of the line, what is the predicted BMI for a person who is 25 years old?

## Intercept + Slope*25
pred_25 <- mod[1,2] + mod[2,2]*25
pred_25
##   estimate
## 1    25.21

F. Using the equation of the line, what is the predicted BMI for a person who is 65 years old?

## Intercept + Slope*65
pred_65 <- mod[1,2] + mod[2,2]*65
pred_65
##   estimate
## 1    30.01

G. Formally test, at the 5% level, whether or not Age is associated with BMI. State your hypotheses, use the tidy(model1) output, and explain your conclusion.

Our line is \[BMI = \beta_0 + \beta_1*Age,\] so we will be testing whether or not the slope associated with Age is equal to zero or not. \[H_0: \beta_1 = 0 \\ H_A: \beta_1 \neq 0\] Based on the output of our model, the estimated slope is 0.12 and the corresponding p-value is 0.01. Because our p-value is less than our pre-specified \(\alpha\) level of 5%, we reject the null hypothesis and conclude that Age is associated with BMI.