`library(tidyverse)`

`t.test()`

in R

You’ve already seen how to use `t.test()`

to perform a one sample t-test. `t.test()`

can also perform two sample t-tests. To illustrate let’s use the `case0102`

data from the `Sleuth3`

package:

```
library(Sleuth3)
?case0102
case0102
```

The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.

There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of \((Y_i, G_i)\) where \(Y_i\) is the variable of interest, and \(G_i\) a variable that indicates the grouping is the formula syntax. For example with `case0102`

to conduct a test for the difference in population mean between males and females we can do:

`t.test(Salary ~ Sex, data = case0102)`

`Salary ~ Sex`

is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the `~`

. The `data`

argument specifies the data frame these variables can be found in.

**Which kind of t-test has been performed?**

**Which group ( Male or Female) had the higher salaries on average? In our notation from class which sample is \(Y\) and which is \(X\)?**

**Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?**

**Read the Arguments section in ?t.test(). How could we do the equal variance test instead?**

Another way use `t.test()`

is to pass in the two samples as the `x`

and `y`

arguments. This requires a bit more work for `case0102`

since we need to extract the male and female salaries first, but it may be easier in some simulation settings:

```
male_salaries <- filter(case0102, Sex == "Male") %>%
pull(Salary)
female_salaries <- filter(case0102, Sex == "Female") %>%
pull(Salary)
t.test(x = male_salaries, y = female_salaries)
```

(If you get the error: `Error in filter(case0102, Sex == "Male") : object 'Sex' not found`

, you most likely have forgotten to load the tidyverse).

**What happens if you switch the arguments so x = female_salaries and y = male_salaries?**

To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from \(Y \sim N(0, 1)\) with \(n = 10\), and a sample from \(X \sim N(0, 10)\) with \(m = 50\):

```
n <- 10
m <- 50
y <- rnorm(n, mean = 0, sd = sqrt(1))
x <- rnorm(m, mean = 0, sd = sqrt(10))
```

Then the `t.test()`

is performed with:

`t.test(x = x, y = y, var.equal = TRUE)`

To do this many times, we can roughly follow our original procedure, starting with generating many samples:

```
n_sim <- 1000
samples <- rerun(n_sim,
y = rnorm(n, mean = 0, sd = sqrt(1)),
x = rnorm(m, mean = 0, sd = sqrt(10))
)
```

Notice, now we provide multiple arguments to `rerun()`

and name them. `samples`

is still a list but now has `x`

and `y`

components:

`samples[[1]]`

For each element of `samples`

we can then conduct the t-test, and pull out the p-value:

```
p_vals <- map_dbl(samples,
~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)
```

Being careful to pull out the right elements of `.x`

for the `x`

and `y`

arguments.

Finally we could estimate the Type I error rate for level \(\alpha = 0.05\):

`mean(p_vals < 0.05)`

**Is this what you expect given the samples sizes and population variances?**

Aside: I prefer the formula syntax because it generalizes to the regression setting where we use `lm()`

, and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.

# Facetting plots

Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in `case0102`

. We could obtain a histogram for salaries for everyone with

```
ggplot(case0102, aes(x = Salary)) +
geom_histogram()
```

To get the same plot for each sex, we simply add a `facet_wrap()`

call:

```
ggplot(case0102, aes(x = Salary)) +
geom_histogram() +
facet_wrap(~ Sex)
```

# Simulating outside of an Rmarkdown document

HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5:

Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g.

`write_rds()`

from the`readr`

package), and reading the output into your Rmarkdown file (with e.g.`read_rds()`

from the`readr`

package).

You may want to heed this advice, but make sure you also submit the separate R code file.