--- title: Lab 7 author: Charlotte Wickham date: '2017-11-07' slug: lab-7 draft: false output: blogdown::html_page: toc: true --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, results = "hide", message = FALSE, fig.keep = "none") set.seed(1918) library(pander) ``` ```{r} library(tidyverse) ``` ## `t.test()` in R You've already seen how to use `t.test()` to perform a one sample t-test. `t.test()` can also perform two sample t-tests. To illustrate let's use the `case0102` data from the `Sleuth3` package: ```{r} library(Sleuth3) ?case0102 case0102 ``` The data examine starting salaries for entry-level clerical jobs at a bank, and might be used to explore whether there is a difference between the starting salaries for male and female employees, i.e. to look for evidence of discrimination based on sex. There are a couple of variations to the syntax. The most useful, when data is already in a data frame, and generally in the form of $(Y_i, G_i)$ where $Y_i$ is the variable of interest, and $G_i$ a variable that indicates the grouping is the formula syntax. For example with `case0102` to conduct a test for the difference in population mean between males and females we can do: ```{r} t.test(Salary ~ Sex, data = case0102) ``` `Salary ~ Sex` is the formula, and references the column for the response and grouping variable, the response always being on the LHS side of the `~`. The `data` argument specifies the data frame these variables can be found in. **Which kind of t-test has been performed?** **Which group (`Male` or `Female`) had the higher salaries on average? In our notation from class which sample is $Y$ and which is $X$?** **Write a statistical summary based on the output. Can we conclude there is discrimination based on sex?** **Read the *Arguments* section in `?t.test()`. How could we do the equal variance test instead?** Another way use `t.test()` is to pass in the two samples as the `x` and `y` arguments. This requires a bit more work for `case0102` since we need to extract the male and female salaries first, but it may be easier in some simulation settings: ```{r} male_salaries <- filter(case0102, Sex == "Male") %>% pull(Salary) female_salaries <- filter(case0102, Sex == "Female") %>% pull(Salary) t.test(x = male_salaries, y = female_salaries) ``` (If you get the error: `Error in filter(case0102, Sex == "Male") : object 'Sex' not found`, you most likely have forgotten to load the tidyverse). **What happens if you switch the arguments so `x = female_salaries` and `y = male_salaries`?** To see how this might be easier with simulated data, consider investigating the performance of the equal variance test when the population variances are unequal. We might simulate a sample from $Y \sim N(0, 1)$ with $n = 10$, and a sample from $X \sim N(0, 10)$ with $m = 50$: ```{r} n <- 10 m <- 50 y <- rnorm(n, mean = 0, sd = sqrt(1)) x <- rnorm(m, mean = 0, sd = sqrt(10)) ``` Then the `t.test()` is performed with: ```{r} t.test(x = x, y = y, var.equal = TRUE) ``` To do this many times, we can roughly follow our original procedure, starting with generating many samples: ```{r} n_sim <- 1000 samples <- rerun(n_sim, y = rnorm(n, mean = 0, sd = sqrt(1)), x = rnorm(m, mean = 0, sd = sqrt(10)) ) ``` Notice, now we provide multiple arguments to `rerun()` and name them. `samples` is still a list but now has `x` and `y` components: ```{r} samples[[1]] ``` For each element of `samples` we can then conduct the t-test, and pull out the p-value: ```{r} p_vals <- map_dbl(samples, ~ t.test(x = .x$x, y = .x$y, var.equal = TRUE)$p.value) ``` Being careful to pull out the right elements of `.x` for the `x` and `y` arguments. Finally we could estimate the Type I error rate for level $\alpha = 0.05$: ```{r} mean(p_vals < 0.05) ``` **Is this what you expect given the samples sizes and population variances?** Aside: I prefer the formula syntax because it generalizes to the regression setting where we use `lm()`, and it encourages keeping your data together in a data frame. The downside is that the formula syntax for the paired test is confusing and inconsistent with how formulas are generally used in R. I prefer thinking about paired t-tests as one sample tests of differences and computing the differences explicitly myself. # Facetting plots Facetting is a useful technique for making the same plot for groups in a data set. Imagine we want separate histograms for the salaries for men and women in `case0102`. We could obtain a histogram for salaries for everyone with ```{r} ggplot(case0102, aes(x = Salary)) + geom_histogram() ``` To get the same plot for each sex, we simply add a `facet_wrap()` call: ```{r} ggplot(case0102, aes(x = Salary)) + geom_histogram() + facet_wrap(~ Sex) ``` # Simulating outside of an Rmarkdown document HW #6 will require some simulation, and be submitted as Rmarkdown. Recall from Lab 5: > Every time you compile, all code is run in a fresh session. Code that takes a long time to run (i.e. a big simulation) will slow this down. Consider moving the simulation code to a separate file, and saving the output there (with e.g. `write_rds()` from the `readr` package), and reading the output into your Rmarkdown file (with e.g. `read_rds()` from the `readr` package). You may want to heed this advice, but make sure you also submit the separate R code file.