---
title: Lab 7
author: Charlotte Wickham
date: '2017-11-07'
slug: lab-7
draft: false
output:
blogdown::html_page:
toc: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
message = FALSE, fig.keep = "none")
set.seed(1918)
library(pander)
```
```{r}
library(tidyverse)
```
## `t.test()` in R
You've already seen how to use `t.test()` to perform a one sample t-test. `t.test()` can also perform two sample t-tests. To illustrate let's use the `case0102` data from the `Sleuth3` package:
```{r}
library(Sleuth3)
?case0102
case0102
```
The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.
There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of $(Y_i, G_i)$ where $Y_i$ is the variable of interest, and $G_i$ a variable that indicates the grouping is the formula syntax. For example with `case0102` to conduct a test for the difference in population mean between males and females we can do:
```{r}
t.test(Salary ~ Sex, data = case0102)
```
`Salary ~ Sex` is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the `~`. The `data` argument specifies the data frame these variables can be found in.
**Which kind of t-test has been performed?**
**Which group (`Male` or `Female`) had the higher salaries on average? In our notation from class which sample is $Y$ and which is $X$?**
**Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?**
**Read the *Arguments* section in `?t.test()`. How could we do the equal variance test instead?**
Another way use `t.test()` is to pass in the two samples as the `x` and `y` arguments. This requires a bit more work for `case0102` since we need to extract the male and female salaries first, but it may be easier in some simulation settings:
```{r}
male_salaries <- filter(case0102, Sex == "Male") %>%
pull(Salary)
female_salaries <- filter(case0102, Sex == "Female") %>%
pull(Salary)
t.test(x = male_salaries, y = female_salaries)
```
(If you get the error: `Error in filter(case0102, Sex == "Male") : object 'Sex' not found`, you most likely have forgotten to load the tidyverse).
**What happens if you switch the arguments so `x = female_salaries` and `y = male_salaries`?**
To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from $Y \sim N(0, 1)$ with $n = 10$, and a sample from $X \sim N(0, 10)$ with $m = 50$:
```{r}
n <- 10
m <- 50
y <- rnorm(n, mean = 0, sd = sqrt(1))
x <- rnorm(m, mean = 0, sd = sqrt(10))
```
Then the `t.test()` is performed with:
```{r}
t.test(x = x, y = y, var.equal = TRUE)
```
To do this many times, we can roughly follow our original procedure, starting with generating many samples:
```{r}
n_sim <- 1000
samples <- rerun(n_sim,
y = rnorm(n, mean = 0, sd = sqrt(1)),
x = rnorm(m, mean = 0, sd = sqrt(10))
)
```
Notice, now we provide multiple arguments to `rerun()` and name them. `samples` is still a list but now has `x` and `y` components:
```{r}
samples[[1]]
```
For each element of `samples` we can then conduct the t-test, and pull out the p-value:
```{r}
p_vals <- map_dbl(samples,
~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)
```
Being careful to pull out the right elements of `.x` for the `x` and `y` arguments.
Finally we could estimate the Type I error rate for level $\alpha = 0.05$:
```{r}
mean(p_vals < 0.05)
```
**Is this what you expect given the samples sizes and population variances?**
Aside: I prefer the formula syntax because it generalizes to the regression setting where we use `lm()`, and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.
# Facetting plots
Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in `case0102`. We could obtain a histogram for salaries for everyone with
```{r}
ggplot(case0102, aes(x = Salary)) +
geom_histogram()
```
To get the same plot for each sex, we simply add a `facet_wrap()` call:
```{r}
ggplot(case0102, aes(x = Salary)) +
geom_histogram() +
facet_wrap(~ Sex)
```
# Simulating outside of an Rmarkdown document
HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5:
> Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g. `write_rds()` from the `readr` package), and reading the output into your Rmarkdown file (with e.g. `read_rds()` from the `readr` package).
You may want to heed this advice, but make sure you also submit the separate R code file.