library(tidyverse)
t.test()
in R
You’ve already seen how to use t.test()
to perform a one sample t-test. t.test()
can also perform two sample t-tests. To illustrate let’s use the case0102
data from the Sleuth3
package:
library(Sleuth3)
?case0102
case0102
The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.
There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of \((Y_i, G_i)\) where \(Y_i\) is the variable of interest, and \(G_i\) a variable that indicates the grouping is the formula syntax. For example with case0102
to conduct a test for the difference in population mean between males and females we can do:
t.test(Salary ~ Sex, data = case0102)
Salary ~ Sex
is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the ~
. The data
argument specifies the data frame these variables can be found in.
Which kind of t-test has been performed?
Which group (Male
or Female
) had the higher salaries on average? In our notation from class which sample is \(Y\) and which is \(X\)?
Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?
Read the Arguments section in ?t.test()
. How could we do the equal variance test instead?
Another way use t.test()
is to pass in the two samples as the x
and y
arguments. This requires a bit more work for case0102
since we need to extract the male and female salaries first, but it may be easier in some simulation settings:
male_salaries <- filter(case0102, Sex == "Male") %>%
pull(Salary)
female_salaries <- filter(case0102, Sex == "Female") %>%
pull(Salary)
t.test(x = male_salaries, y = female_salaries)
(If you get the error: Error in filter(case0102, Sex == "Male") : object 'Sex' not found
, you most likely have forgotten to load the tidyverse).
What happens if you switch the arguments so x = female_salaries
and y = male_salaries
?
To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from \(Y \sim N(0, 1)\) with \(n = 10\), and a sample from \(X \sim N(0, 10)\) with \(m = 50\):
n <- 10
m <- 50
y <- rnorm(n, mean = 0, sd = sqrt(1))
x <- rnorm(m, mean = 0, sd = sqrt(10))
Then the t.test()
is performed with:
t.test(x = x, y = y, var.equal = TRUE)
To do this many times, we can roughly follow our original procedure, starting with generating many samples:
n_sim <- 1000
samples <- rerun(n_sim,
y = rnorm(n, mean = 0, sd = sqrt(1)),
x = rnorm(m, mean = 0, sd = sqrt(10))
)
Notice, now we provide multiple arguments to rerun()
and name them. samples
is still a list but now has x
and y
components:
samples[[1]]
For each element of samples
we can then conduct the t-test, and pull out the p-value:
p_vals <- map_dbl(samples,
~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)
Being careful to pull out the right elements of .x
for the x
and y
arguments.
Finally we could estimate the Type I error rate for level \(\alpha = 0.05\):
mean(p_vals < 0.05)
Is this what you expect given the samples sizes and population variances?
Aside: I prefer the formula syntax because it generalizes to the regression setting where we use lm()
, and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.
Facetting plots
Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in case0102
. We could obtain a histogram for salaries for everyone with
ggplot(case0102, aes(x = Salary)) +
geom_histogram()
To get the same plot for each sex, we simply add a facet_wrap()
call:
ggplot(case0102, aes(x = Salary)) +
geom_histogram() +
facet_wrap(~ Sex)
Simulating outside of an Rmarkdown document
HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5:
Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g.
write_rds()
from thereadr
package), and reading the output into your Rmarkdown file (with e.g.read_rds()
from thereadr
package).
You may want to heed this advice, but make sure you also submit the separate R code file.