---
title: Lab 1
author: Charlotte Wickham
date: '2017-09-26'
slug: lab-1
draft: false
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
message = FALSE, fig.keep = "none")
```
## General Info
Labs consist of a set of exercises that you'll complete to introduce and practice R skills. They are set up as a document for you to work through, where you are expected to be running code and then editing it, to answer questions (generally in bold). My suggestion is you create a new script for each lab, and keep the code you run there, along with any notes you add.
You can also download the lab as an RMarkdown document, which you can open in R, and run code directly.
Use your neighbors! Either they know the answer and can help, or they don't and that's a sign to call the TA over and ask a question.
## Goals for today's lab
* Some initial exposure to data import, exploration and visualization.
## Installing and loading packages
For this lab you'll need some of the `tidyverse` packages and the `Sleuth3` package. To use a package you need to install it (**once** on any computer) and then load it (**every** session you use it).
Install the `tidyverse` and `Sleuth3` packages:
```{r, eval = FALSE}
install.packages("tidyverse", "Sleuth3")
```
Then load them:
```{r}
library(tidyverse)
library(Sleuth3)
```
## Data in R
### Data Import
There are three common ways you'll obtain data in this class:
* importing from a file
* loading from a package
* defining yourself
You'll see examples of the first two today, and the third at a later date.
#### Loading from a file
I recommend the `readr` package for importing data from flat files (i.e. CSV, TSV etc.). `readr` is a part of the tidyverse, and loaded when `tidyverse` is loaded.
To start you'll download the [`class_data.csv`](http://st551.cwick.co.nz/data/class_data.csv) file from the class website. You could navigate to the file in the web browser, "Save As", then locate it on your hard drive, or you can get R to do all that for you:
```{r}
download.file(url = "http://st551.cwick.co.nz/data/class_data.csv",
destfile = "class_data.csv")
```
The argument `destfile` gives the path to the downloaded file, and in this case will be relative to your working directory. Take a look in the **Files** pane in RStudio, you should see the file there.
Then to load the data into R, call [`read_csv()`](https://www.rdocumentation.org/packages/readr/topics/read_csv) from the `readr` package:
```{r}
class_data <- read_csv("class_data.csv")
```
You'll see a message about how the columns in the data were interpreted and you can take a look at the data by simply typing it's name:
```{r}
class_data
```
* **How many observations are in `class_data`, how many variables?**
* **What kind of variable is `commute_type`? What kind of variable is `commute_time`?**
#### Loading data from a package
Packages often provide data (or only provide data!). Accessing the data, is usually just a matter of loading the package, then knowing the name of the data object.
You can find the data provided with a package using the [`data()`](https://www.rdocumentation.org/packages/base/topics/data) function. For example to list all data provided with the `Sleuth3` package, try:
```{r, eval = FALSE}
data(package = "Sleuth3")
```
which should pop up another tab with a listing of the data with the name on the left and a short description on the right. To see a data set just type it's name, e.g.
```{r}
case0102
```
When data is included in a package you'll also be able to get more info about the data by looking at its help page:
```{r, eval = FALSE}
?case0102
```
* **What do the observations in `case0102` represent?**
* **Why did `case0102` print in a different format to `class_data`?**
### Basic Summaries
There are some functions you'll use a lot to inspect R objects to see what they are. [`str()`](https://www.rdocumentation.org/packages/base/topics/str), short for **str**ucture, prints information about the structure of the object,
```{r}
str(case0102)
```
[`head()`](https://www.rdocumentation.org/packages/base/topics/head) will print out the first few elements,
```{r}
head(case0102)
```
and [`names()`](https://www.rdocumentation.org/packages/base/topics/names) will print out the named elements of the object.
```{r}
names(case0102)
```
You can access named elements using the `$` (being careful to match case), e.g.
```{r}
case0102$Sex
```
returns the `Sex` column of the data.
If want numerical summaries of a variable, one option is to extract the column, then apply a summary function. For example, we could find the [`mean()`](https://www.rdocumentation.org/packages/base/topics/mean) of the `Salary` column in `case0102` with:
```{r}
salaries <- case0102$Salary
mean(salaries)
# or in one go
mean(case0102$Salary)
```
I prefer using `dplyr`s [`summarize()`](https://www.rdocumentation.org/packages/dplyr/topics/summarize)
because it more easily extends to summaries by group.
`summarize()` takes a data frame, and any number of expressions separated by `,` each for a different summary (the only restriction being they must return a single number). The same mean as above can be obtained with:
```{r}
summarize(.data = case0102, mean(Salary))
```
If we preface the summary with `name = `, the output has named columns, e.g.
```{r}
summarize(.data = case0102, avg_salary = mean(Salary))
```
We can add further expressions to calculate more summaries at once:
```{r}
summarize(.data = case0102,
avg_salary = mean(Salary),
sd_salary = sd(Salary),
n = n())
```
**Can you add another summary to include the `median()` `Salary`?**
The real advantage of using `summarize()` is to combine it with a grouping. Groupings are created with [`group_by()`](https://www.rdocumentation.org/packages/dplyr/topics/group_by) where we pass the data and a column name that forms the groups:
```{r}
case0102_grouped <- group_by(.data = case0102, Sex)
case0102_grouped
```
The data hasn't changed, but if we use this new grouped data with our previous `summarize()` statement, we get the summary for each group:
```{r}
summarize(.data = case0102_grouped,
avg_salary = mean(Salary),
sd_salary = sd(Salary),
n = n())
```
* **Which group has the higher average salary?**
* **Fill in the blanks to find the median `commute_time` by `commute_type` in the `class_data`.**
```{r, eval = FALSE}
class_data_by_type <- group_by(.data = class_data, ___)
summarize(.data = class_data_by_type,
med_time = ___(___))
```
* **Add another line to the `summarize()` statment to include the number of observations in each `commute_type`.**
### Basic Plots
Recall your [reading for the homework](http://r4ds.had.co.nz/data-visualisation.html) this week which introduces the plotting template:
```{r, eval=FALSE}
ggplot(data = ) +
(mapping = aes())
```
For now, two very useful plots for examining distributions will be a histogram for continuous variables and a bar plot for discrete variables, corresponding to the ``s [`geom_histogram()`](https://www.rdocumentation.org/packages/ggplot2/topics/geom_histogram) and [`geom_bar()`](https://www.rdocumentation.org/packages/ggplot2/topics/geom_bar) respectively. In both cases the `` part of the template will simply map `x` to the variable of interest.
So, for example, a histogram of the `Salary` column in `case0102` would be created with:
```{r}
ggplot(data = case0102) +
geom_histogram(aes(x = Salary))
```
Notice the message about choosing a better binwidth, you can do so by specifying a `binwidth` in the `geom_histogram()` call, e.g.
```{r}
ggplot(data = case0102) +
geom_histogram(aes(x = Salary), binwidth = 100)
```
sets the width of each bin of the histogram on the x-axis to 100 ($100 dollars in this case).
To look at `Sex` we'd want a bar plot, we need to change the `` and the ``:
```{r}
ggplot(data = case0102) +
geom_bar(aes(x = Sex))
```
* **Look at a histogram of the commute times in `class_data`. Describe any interesting features**
* **Look at a bar plot of the commute types in `class_data`. Describe any interesting features**