t.test() in R
You’ve already seen how to use
t.test() to perform a one sample t-test.
t.test() can also perform two sample t-tests. To illustrate let’s use the
case0102 data from the
library(Sleuth3) ?case0102 case0102
The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.
There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of \((Y_i, G_i)\) where \(Y_i\) is the variable of interest, and \(G_i\) a variable that indicates the grouping is the formula syntax. For example with
case0102 to conduct a test for the difference in population mean between males and females we can do:
t.test(Salary ~ Sex, data = case0102)
Salary ~ Sex is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the
data argument specifies the data frame these variables can be found in.
Which kind of t-test has been performed?
Which group (
Female) had the higher salaries on average? In our notation from class which sample is \(Y\) and which is \(X\)?
Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?
Read the Arguments section in
?t.test(). How could we do the equal variance test instead?
Another way use
t.test() is to pass in the two samples as the
y arguments. This requires a bit more work for
case0102 since we need to extract the male and female salaries first, but it may be easier in some simulation settings:
male_salaries <- filter(case0102, Sex == "Male") %>% pull(Salary) female_salaries <- filter(case0102, Sex == "Female") %>% pull(Salary) t.test(x = male_salaries, y = female_salaries)
(If you get the error:
Error in filter(case0102, Sex == "Male") : object 'Sex' not found, you most likely have forgotten to load the tidyverse).
What happens if you switch the arguments so
x = female_salaries and
y = male_salaries?
To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from \(Y \sim N(0, 1)\) with \(n = 10\), and a sample from \(X \sim N(0, 10)\) with \(m = 50\):
n <- 10 m <- 50 y <- rnorm(n, mean = 0, sd = sqrt(1)) x <- rnorm(m, mean = 0, sd = sqrt(10))
t.test() is performed with:
t.test(x = x, y = y, var.equal = TRUE)
To do this many times, we can roughly follow our original procedure, starting with generating many samples:
n_sim <- 1000 samples <- rerun(n_sim, y = rnorm(n, mean = 0, sd = sqrt(1)), x = rnorm(m, mean = 0, sd = sqrt(10)) )
Notice, now we provide multiple arguments to
rerun() and name them.
samples is still a list but now has
For each element of
samples we can then conduct the t-test, and pull out the p-value:
p_vals <- map_dbl(samples, ~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)
Being careful to pull out the right elements of
.x for the
Finally we could estimate the Type I error rate for level \(\alpha = 0.05\):
mean(p_vals < 0.05)
Is this what you expect given the samples sizes and population variances?
Aside: I prefer the formula syntax because it generalizes to the regression setting where we use
lm(), and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.
Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in
case0102. We could obtain a histogram for salaries for everyone with
ggplot(case0102, aes(x = Salary)) + geom_histogram()
To get the same plot for each sex, we simply add a
ggplot(case0102, aes(x = Salary)) + geom_histogram() + facet_wrap(~ Sex)
Simulating outside of an Rmarkdown document
HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5:
Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g.
readrpackage), and reading the output into your Rmarkdown file (with e.g.
You may want to heed this advice, but make sure you also submit the separate R code file.