---
title: Lab 7
author: Charlotte Wickham
date: '2017-11-07'
slug: lab-7
draft: false
output:
  blogdown::html_page:
    toc: true
---

<!-- This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
  message = FALSE, fig.keep = "none")
set.seed(1918)
library(pander)
```

```{r}
library(tidyverse)
```

## `t.test()` in R

You've already seen how to use `t.test()` to perform a one sample t-test.  `t.test()` can also perform two sample t-tests.  To illustrate let's use the `case0102` data from the `Sleuth3` package:

```{r}
library(Sleuth3)
?case0102
case0102
```

The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.

There are a couple of variations to the syntax.  The most useful, when data is already in a data frame, and generally in the form of $(Y_i, G_i)$ where $Y_i$  is the variable of interest, and $G_i$ a variable that indicates the grouping is the formula syntax.  For example with `case0102` to conduct a test for the difference in population mean between males and females we can do:

```{r}
t.test(Salary ~ Sex, data = case0102)
```

`Salary ~ Sex` is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the `~`.  The `data` argument specifies the data frame these variables can be found in.

**Which kind of t-test has been performed?**

**Which group (`Male` or `Female`) had the higher salaries on average? In our notation from class which sample is $Y$ and which is $X$?**

**Write a statistical summary based on the output.  Can we conclude there is discrimination based on sex?**

**Read the *Arguments* section in `?t.test()`.  How could we do the equal variance test instead?**

Another way use `t.test()` is to pass in the two samples as the `x` and `y` arguments.  This requires a bit more work for `case0102` since we need to extract the male and female salaries first, but it may be easier in some simulation settings:

```{r}
male_salaries <- filter(case0102, Sex == "Male") %>%
  pull(Salary)
female_salaries <-  filter(case0102, Sex == "Female") %>%
  pull(Salary)
t.test(x = male_salaries, y = female_salaries)
```
(If you get the error: `Error in filter(case0102, Sex == "Male") : object 'Sex' not found`, you most likely have forgotten to load the tidyverse).

**What happens if you switch the arguments so `x = female_salaries` and `y = male_salaries`?**

To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal.  We might simulate a sample from $Y \sim N(0, 1)$ with $n = 10$, and a sample from $X \sim N(0, 10)$ with $m = 50$:
```{r}
n <- 10
m <- 50
y <- rnorm(n, mean = 0, sd = sqrt(1))
x <- rnorm(m, mean = 0, sd = sqrt(10))
```

Then the `t.test()` is performed with:
```{r}
t.test(x = x, y = y, var.equal = TRUE)
```

To do this many times, we can roughly follow our original procedure, starting with generating many samples:
```{r}
n_sim <- 1000
samples <- rerun(n_sim, 
  y = rnorm(n, mean = 0, sd = sqrt(1)),
  x = rnorm(m, mean = 0, sd = sqrt(10))
)
```

Notice, now we provide multiple arguments to `rerun()` and name them.  `samples` is still a list but now has `x` and `y` components:
```{r}
samples[[1]]
```

For each element of `samples` we can then conduct the t-test, and pull out the p-value:
```{r}
p_vals <- map_dbl(samples, 
  ~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)
``` 

Being careful to pull out the right elements of `.x` for the `x` and `y` arguments.

Finally we could estimate the Type I error rate for level $\alpha = 0.05$:
```{r}
mean(p_vals < 0.05)
```

**Is this what you expect given the samples sizes and population variances?**

Aside: I prefer the formula syntax because it generalizes to the regression setting where we use `lm()`, and it encourages keeping your data together in a data frame.  The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R.  I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.

# Facetting plots

Facetting is a useful technique for making the same plot for groups in a data set.  Imagine we want separate histograms for the salaries for men and women in `case0102`.  We could obtain a histogram for salaries for everyone with

```{r}
ggplot(case0102, aes(x = Salary)) +
  geom_histogram()
```
To get the same plot for each sex, we simply add a `facet_wrap()` call:
```{r}
ggplot(case0102, aes(x = Salary)) +
  geom_histogram() +
  facet_wrap(~ Sex)
```


# Simulating outside of an Rmarkdown document

HW #6 will require some simulation, and be submitted as Rmarkdown.  Recall from Lab 5:

> Every time you compile, all code is run in a fresh session.  Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g. `write_rds()` from the `readr` package), and reading the output into your Rmarkdown file (with e.g. `read_rds()` from the `readr` package).

You may want to heed this advice, but make sure you also submit the separate R code file.