Lab 7 RMarkdown

Tue Nov 7 or Wed Nov 8


t.test() in R

You’ve already seen how to use t.test() to perform a one sample t-test. t.test() can also perform two sample t-tests. To illustrate let’s use the case0102 data from the Sleuth3 package:


The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex.

There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of \((Y_i, G_i)\) where \(Y_i\) is the variable of interest, and \(G_i\) a variable that indicates the grouping is the formula syntax. For example with case0102 to conduct a test for the difference in population mean between males and females we can do:

t.test(Salary ~ Sex, data = case0102)

Salary ~ Sex is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the ~. The data argument specifies the data frame these variables can be found in.

Which kind of t-test has been performed?

Which group (Male or Female) had the higher salaries on average? In our notation from class which sample is \(Y\) and which is \(X\)?

Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?

Read the Arguments section in ?t.test(). How could we do the equal variance test instead?

Another way use t.test() is to pass in the two samples as the x and y arguments. This requires a bit more work for case0102 since we need to extract the male and female salaries first, but it may be easier in some simulation settings:

male_salaries <- filter(case0102, Sex == "Male") %>%
female_salaries <-  filter(case0102, Sex == "Female") %>%
t.test(x = male_salaries, y = female_salaries)

(If you get the error: Error in filter(case0102, Sex == "Male") : object 'Sex' not found, you most likely have forgotten to load the tidyverse).

What happens if you switch the arguments so x = female_salaries and y = male_salaries?

To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from \(Y \sim N(0, 1)\) with \(n = 10\), and a sample from \(X \sim N(0, 10)\) with \(m = 50\):

n <- 10
m <- 50
y <- rnorm(n, mean = 0, sd = sqrt(1))
x <- rnorm(m, mean = 0, sd = sqrt(10))

Then the t.test() is performed with:

t.test(x = x, y = y, var.equal = TRUE)

To do this many times, we can roughly follow our original procedure, starting with generating many samples:

n_sim <- 1000
samples <- rerun(n_sim, 
  y = rnorm(n, mean = 0, sd = sqrt(1)),
  x = rnorm(m, mean = 0, sd = sqrt(10))

Notice, now we provide multiple arguments to rerun() and name them. samples is still a list but now has x and y components:


For each element of samples we can then conduct the t-test, and pull out the p-value:

p_vals <- map_dbl(samples, 
  ~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value)

Being careful to pull out the right elements of .x for the x and y arguments.

Finally we could estimate the Type I error rate for level \(\alpha = 0.05\):

mean(p_vals < 0.05)

Is this what you expect given the samples sizes and population variances?

Aside: I prefer the formula syntax because it generalizes to the regression setting where we use lm(), and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself.

Facetting plots

Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in case0102. We could obtain a histogram for salaries for everyone with

ggplot(case0102, aes(x = Salary)) +

To get the same plot for each sex, we simply add a facet_wrap() call:

ggplot(case0102, aes(x = Salary)) +
  geom_histogram() +
  facet_wrap(~ Sex)

Simulating outside of an Rmarkdown document

HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5:

Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g. write_rds() from the readr package), and reading the output into your Rmarkdown file (with e.g. read_rds() from the readr package).

You may want to heed this advice, but make sure you also submit the separate R code file.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.