0. Setting your working directory

Before you beging your R session, it’s always a good idea to set your working directory. This can make it easier to get data into and out of R and you will know where your results are being saved. You can use the following function to find out where your current working directory is:

getwd()

and you can set your working directory to whichever folder you would like. I prefer a dedicated folder for this course, perhaps saved on your Desktop or in your Documents. You will need to create the folder, then either navigate to it from the “Set Working Directory” option from the “Session” drop-down menu, or use the setwd() function and manually type in the directory:

setwd(~/Desktop/PHTH2210)

For this lab we will again be using the ggplot2 and gapminder packages. Use require(ggplot2) and require(gapminder) to load these packages. If you recieve an error, you may need to reinstall the packages from the bottom right pane of the R Studio window (Packages -> Install -> “ggplot2” and/or “gapminder”)

1. Line plots

Line plots are especially useful for looking at trends over time. Let’s examine how life expectancy changed over time for two different countries, China and the United States.

Here’s some clunky code to subset out the US and China. We will soon learn much more intuitive ways to do this:

gapminder_ChinaUS<- gapminder[gapminder$country == "United States" | gapminder$country == "China",]

Next create the line plot using this NEW dataset gapminder_ChinaUS. You’ll need aesthics of:

ggplot(data = gapminder_ChinaUS) + 
  geom_line(mapping = aes(x = year, y = lifeExp, color = country))+
  labs(title = "Life expectancy over time", x = "Year", y = "Life Expectancy")

You should always title your plot and label your x and y axes using + labs( title = "New Title", x = "X label", y = "Y label").

2. Histograms

We have seen histograms in class, similar to scatterplots, we can also add facets to display many histograms. Let’s create a histogram of Life Expectancies by Year

ggplot(data = gapminder, mapping = aes(x = lifeExp)) +
  geom_histogram(binwidth = 5, color = "white") + facet_grid(.~year)

This is probably hard to read as facet() attempts to put all of the graphs on the same line. We can “wrap” these graphs to a new line by using facet_wrap() instead. Use the + facet_wrap(.~) command to recreate the following plot.

ggplot(data = gapminder, mapping = aes(x = lifeExp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(.~year) + 
  labs(title = "Distribution of life expectancies by year", x = "Life Expectancy", y = "Count")

We can also facet on multiple variables, by splitting up into rows and columns. With the command facet_grid(rows ~ columns), we replace rows with whichever variable we want across the rows and columns with whichever variable we want on the columns.

Try faceting on both year and continent and determine which should be in the rows and whcih should be in the columns to tell a better story.

ggplot(data = gapminder , mapping = aes(x = lifeExp )) +
  geom_histogram(binwidth = 5, color = "white") + facet_grid(year ~ continent )

3. Boxplots

Boxplots are created using the geom_boxplot() command. Think about what aesthics you might need? For a single boxplot, there is just one aesthetic, generally given as the y coordinate. Starting with ggplot(data = gapminder, mapping = aes(y = lifeExp)) create a boxplot with an appropriately labeled y-axis.

Just as with histograms, we can create boxplots for a single variable AT different levels of a second variable (like faceting). With a boxplot, we jsut need to specify that second variable as another aesthetic, generally the x-axis.

The only important thing to consider is that you are faceting on a grouping variable, not a continuous or numerical. So, let’s create a boxplot for lifeExp at each year. To tell R that year should be treated as a “group” and not a “number”, wrap it in factor().

ggplot(data = gapminder, mapping = aes(x = factor(year), y = lifeExp)) +
  geom_boxplot() + 
  labs(title = "Life Expectancies over time", x = "Year", y = "Life expectancy")

Bar plots

Bar plots are created using geom_bar(). Let’s recreate the bar plot from class to see how many contries there are per continent in 1952.

First we need to reduce our dataset to just the year 1952.

gapminder_1952 <- gapminder[gapminder$year == 1952,]

Use View(gapminder_1952) to ensure that you’ve created the dataset correctly.

Next, create the bar plot:

ggplot(gapminder_1952) + geom_bar(mapping = aes(x = continent, fill = continent), stat = "count", )

Next, let’s fill the bars according to the life expectancies of each country. Let’s create 2 categories: Low and High, designated below or above 50 years.

## Create a new variable called "LifeExp_Cat"
## We will initialize it with 0s and make it the 
##        same number of rows as gapminder_1952
LifeExp_Cat<-rep(0,times = nrow(gapminder_1952))
LifeExp_Cat <- gapminder_1952$lifeExp<=50

Look at LifeExp_Cat, this is a vector the same lenght as the number of countries in gapminder_1952, but each entry is TRUE or FALSE to the question “Is the corresponding country’s life expectancy less than or equal to 50?”

Now, add this newly created variable to the gapminder_1952 dataset by using the cbind() function:

gapminder_1952 <- cbind(gapminder_1952, LifeExp_Cat)

Use the View() function to make sure your new dataset looks as expected.

Next, recreate the bar plot from above, but assign colors to the bars such that we get a sense of how many contries in each continent have low/high life expectancies. Your final plot will resemble this one:

ggplot(gapminder_1952) + geom_bar(mapping = aes(x = continent, fill = LifeExp_Cat), stat = "count") + scale_fill_discrete(name = "Life Expectancy", labels = c("High","Low")) + labs(title = "Life expectancies within continents", x = "Continent", y = "Number of countries")

You can change the wording/labeling of the legend using + scale_fill_discrete(name = "Title", labels = c("Top","Bottom"). “Scales” are how data is mapped to aesthics. They are named by “scale_” followed by the name of the aesthetic they represent (here “fill”), followed by "_discrete" or "_continuous" depending on the type of variable they align to. So here, it is scale_fill_discrete(). Just make sure you are adding the “labels” in the correct order.

ggplot(gapminder_1952) + geom_bar(mapping = aes(x = continent, fill = LifeExp_Cat), stat = "count") + scale_fill_discrete(name = "Life Expectancy", labels = c("High","Low"))