General Info
Labs consist of a set of exercises that you’ll complete to introduce and practice R skills. They are set up as a document for you to work through, where you are expected to be running code and then editing it, to answer questions (generally in bold). My suggestion is you create a new script for each lab, and keep the code you run there, along with any notes you add.
You can also download the lab as an RMarkdown document, which you can open in R, and run code directly.
Use your neighbors! Either they know the answer and can help, or they don’t and that’s a sign to call the TA over and ask a question.
Goals for today’s lab
- Some initial exposure to data import, exploration and visualization.
Installing and loading packages
For this lab you’ll need some of the tidyverse
packages and the Sleuth3
package. To use a package you need to install it (once on any computer) and then load it (every session you use it).
Install the tidyverse
and Sleuth3
packages:
install.packages("tidyverse", "Sleuth3")
Then load them:
library(tidyverse)
library(Sleuth3)
Data in R
Data Import
There are three common ways you’ll obtain data in this class:
- importing from a file
- loading from a package
- defining yourself
You’ll see examples of the first two today, and the third at a later date.
Loading from a file
I recommend the readr
package for importing data from flat files (i.e. CSV, TSV etc.). readr
is a part of the tidyverse, and loaded when tidyverse
is loaded.
To start you’ll download the class_data.csv
file from the class website. You could navigate to the file in the web browser, “Save As”, then locate it on your hard drive, or you can get R to do all that for you:
download.file(url = "http://st551.cwick.co.nz/data/class_data.csv",
destfile = "class_data.csv")
The argument destfile
gives the path to the downloaded file, and in this case will be relative to your working directory. Take a look in the Files pane in RStudio, you should see the file there.
Then to load the data into R, call read_csv()
from the readr
package:
class_data <- read_csv("class_data.csv")
You’ll see a message about how the columns in the data were interpreted and you can take a look at the data by simply typing it’s name:
class_data
- How many observations are in
class_data
, how many variables? - What kind of variable is
commute_type
? What kind of variable iscommute_time
?
Loading data from a package
Packages often provide data (or only provide data!). Accessing the data, is usually just a matter of loading the package, then knowing the name of the data object.
You can find the data provided with a package using the data()
function. For example to list all data provided with the Sleuth3
package, try:
data(package = "Sleuth3")
which should pop up another tab with a listing of the data with the name on the left and a short description on the right. To see a data set just type it’s name, e.g.
case0102
When data is included in a package you’ll also be able to get more info about the data by looking at its help page:
?case0102
- What do the observations in
case0102
represent? - Why did
case0102
print in a different format toclass_data
?
Basic Summaries
There are some functions you’ll use a lot to inspect R objects to see what they are. str()
, short for structure, prints information about the structure of the object,
str(case0102)
head()
will print out the first few elements,
head(case0102)
and names()
will print out the named elements of the object.
names(case0102)
You can access named elements using the $
(being careful to match case), e.g.
case0102$Sex
returns the Sex
column of the data.
If want numerical summaries of a variable, one option is to extract the column, then apply a summary function. For example, we could find the mean()
of the Salary
column in case0102
with:
salaries <- case0102$Salary
mean(salaries)
# or in one go
mean(case0102$Salary)
I prefer using dplyr
s summarize()
because it more easily extends to summaries by group.
summarize()
takes a data frame, and any number of expressions separated by ,
each for a different summary (the only restriction being they must return a single number). The same mean as above can be obtained with:
summarize(.data = case0102, mean(Salary))
If we preface the summary with name =
, the output has named columns, e.g.
summarize(.data = case0102, avg_salary = mean(Salary))
We can add further expressions to calculate more summaries at once:
summarize(.data = case0102,
avg_salary = mean(Salary),
sd_salary = sd(Salary),
n = n())
Can you add another summary to include the median()
Salary
?
The real advantage of using summarize()
is to combine it with a grouping. Groupings are created with group_by()
where we pass the data and a column name that forms the groups:
case0102_grouped <- group_by(.data = case0102, Sex)
case0102_grouped
The data hasn’t changed, but if we use this new grouped data with our previous summarize()
statement, we get the summary for each group:
summarize(.data = case0102_grouped,
avg_salary = mean(Salary),
sd_salary = sd(Salary),
n = n())
- Which group has the higher average salary?
Fill in the blanks to find the median
commute_time
bycommute_type
in theclass_data
.class_data_by_type <- group_by(.data = class_data, ___) summarize(.data = class_data_by_type, med_time = ___(___))
Add another line to the
summarize()
statment to include the number of observations in eachcommute_type
.
Basic Plots
Recall your reading for the homework this week which introduces the plotting template:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
For now, two very useful plots for examining distributions will be a histogram for continuous variables and a bar plot for discrete variables, corresponding to the <GEOM_FUNCTION>
s geom_histogram()
and geom_bar()
respectively. In both cases the <MAPPINGS>
part of the template will simply map x
to the variable of interest.
So, for example, a histogram of the Salary
column in case0102
would be created with:
ggplot(data = case0102) +
geom_histogram(aes(x = Salary))
Notice the message about choosing a better binwidth, you can do so by specifying a binwidth
in the geom_histogram()
call, e.g.
ggplot(data = case0102) +
geom_histogram(aes(x = Salary), binwidth = 100)
sets the width of each bin of the histogram on the x-axis to 100 ($100 dollars in this case).
To look at Sex
we’d want a bar plot, we need to change the <GEOM_FUNCTION>
and the <MAPPINGS>
:
ggplot(data = case0102) +
geom_bar(aes(x = Sex))
- Look at a histogram of the commute times in
class_data
. Describe any interesting features - Look at a bar plot of the commute types in
class_data
. Describe any interesting features