Lab 1 - ST551 Fall 2017

General Info

Labs consist of a set of exercises that you’ll complete to introduce and practice R skills. They are set up as a document for you to work through, where you are expected to be running code and then editing it, to answer questions (generally in bold). My suggestion is you create a new script for each lab, and keep the code you run there, along with any notes you add.

You can also download the lab as an RMarkdown document, which you can open in R, and run code directly.

Use your neighbors! Either they know the answer and can help, or they don’t and that’s a sign to call the TA over and ask a question.

Goals for today’s lab

Some initial exposure to data import, exploration and visualization.

Installing and loading packages

For this lab you’ll need some of the tidyverse packages and the Sleuth3 package. To use a package you need to install it (once on any computer) and then load it (every session you use it).

Install the tidyverse and Sleuth3 packages:

install.packages("tidyverse", "Sleuth3")

Then load them:

library(tidyverse)
library(Sleuth3)

Data in R

Data Import

There are three common ways you’ll obtain data in this class:

importing from a file
loading from a package
defining yourself

You’ll see examples of the first two today, and the third at a later date.

Loading from a file

I recommend the readr package for importing data from flat files (i.e. CSV, TSV etc.). readr is a part of the tidyverse, and loaded when tidyverse is loaded.

To start you’ll download the class_data.csv file from the class website. You could navigate to the file in the web browser, “Save As”, then locate it on your hard drive, or you can get R to do all that for you:

download.file(url = "http://st551.cwick.co.nz/data/class_data.csv",
  destfile = "class_data.csv")

The argument destfile gives the path to the downloaded file, and in this case will be relative to your working directory. Take a look in the Files pane in RStudio, you should see the file there.

Then to load the data into R, call read_csv() from the readr package:

class_data <- read_csv("class_data.csv")

You’ll see a message about how the columns in the data were interpreted and you can take a look at the data by simply typing it’s name:

class_data

How many observations are in class_data, how many variables?
What kind of variable is commute_type? What kind of variable is commute_time?

Loading data from a package

Packages often provide data (or only provide data!). Accessing the data, is usually just a matter of loading the package, then knowing the name of the data object.

You can find the data provided with a package using the data() function. For example to list all data provided with the Sleuth3 package, try:

data(package = "Sleuth3")

which should pop up another tab with a listing of the data with the name on the left and a short description on the right. To see a data set just type it’s name, e.g.

case0102

When data is included in a package you’ll also be able to get more info about the data by looking at its help page:

?case0102

What do the observations in case0102 represent?
Why did case0102 print in a different format to class_data?

Basic Summaries

There are some functions you’ll use a lot to inspect R objects to see what they are. str(), short for structure, prints information about the structure of the object,

str(case0102)

head() will print out the first few elements,

head(case0102)

and names() will print out the named elements of the object.

names(case0102)

You can access named elements using the $ (being careful to match case), e.g.

case0102$Sex

returns the Sex column of the data.

If want numerical summaries of a variable, one option is to extract the column, then apply a summary function. For example, we could find the mean() of the Salary column in case0102 with:

salaries <- case0102$Salary
mean(salaries)

# or in one go
mean(case0102$Salary)

I prefer using dplyrs summarize() because it more easily extends to summaries by group.

summarize() takes a data frame, and any number of expressions separated by , each for a different summary (the only restriction being they must return a single number). The same mean as above can be obtained with:

summarize(.data = case0102, mean(Salary))

If we preface the summary with name =, the output has named columns, e.g.

summarize(.data = case0102, avg_salary =  mean(Salary))

We can add further expressions to calculate more summaries at once:

summarize(.data = case0102, 
  avg_salary =  mean(Salary),
  sd_salary =  sd(Salary),
  n = n())

Can you add another summary to include the median() Salary?

The real advantage of using summarize() is to combine it with a grouping. Groupings are created with group_by() where we pass the data and a column name that forms the groups:

case0102_grouped <- group_by(.data = case0102, Sex)
case0102_grouped

The data hasn’t changed, but if we use this new grouped data with our previous summarize() statement, we get the summary for each group:

summarize(.data = case0102_grouped, 
  avg_salary =  mean(Salary),
  sd_salary =  sd(Salary),
  n = n())

Which group has the higher average salary?

Fill in the blanks to find the median commute_time by commute_type in the class_data.

class_data_by_type <- group_by(.data = class_data, ___)

summarize(.data = class_data_by_type,
  med_time = ___(___))

Add another line to the summarize() statment to include the number of observations in each commute_type.

Basic Plots

Recall your reading for the homework this week which introduces the plotting template:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

For now, two very useful plots for examining distributions will be a histogram for continuous variables and a bar plot for discrete variables, corresponding to the <GEOM_FUNCTION>s geom_histogram() and geom_bar() respectively. In both cases the <MAPPINGS> part of the template will simply map x to the variable of interest.

So, for example, a histogram of the Salary column in case0102 would be created with:

ggplot(data = case0102) + 
  geom_histogram(aes(x = Salary))

Notice the message about choosing a better binwidth, you can do so by specifying a binwidth in the geom_histogram() call, e.g.

ggplot(data = case0102) + 
  geom_histogram(aes(x = Salary), binwidth = 100)

sets the width of each bin of the histogram on the x-axis to 100 ($100 dollars in this case).

To look at Sex we’d want a bar plot, we need to change the <GEOM_FUNCTION> and the <MAPPINGS>:

ggplot(data = case0102) + 
  geom_bar(aes(x = Sex))

Look at a histogram of the commute times in class_data. Describe any interesting features
Look at a bar plot of the commute types in class_data. Describe any interesting features

Lab 1 RMarkdown