Homework 7

Due 2017/11/24

Submit your homework as a compiled Rmarkdown document. Submit both .pdf (either generated directly from the Rmarkdown, or saved from the Word version generated for the Rmarkdown), and the Rmarkdown file itself (the .Rmd). If you do your simulations in an separate R file, please also submit that, but the summary of the simulations must be in the .pdf, or you will receive no credit for the problem.

Submit your answers on canvas.

1. Fisher’s exact test & Chi-square test

Compare the performance and p-values for Fisher’s exact test and the Chi-square test by simulation.

The following compares the p-values from the two tests in the scenario:

  • \(n_Y = 10, n_X = 15; p_Y = 0.5, p_X = 0.5\) (Null true)
n_sim <- 1000

n_y <- 10
n_x <- 15
p_y <- 0.5
p_x <- 0.5

# Generate sample data of (Y, G) form.
samples <- rerun(n_sim,
  y = c(rbinom(n = n_y, size = 1, prob = p_y), 
        rbinom(n = n_x, size = 1, prob = p_x)),
  g = rep(c(0, 1), c(n_y, n_x))
)

# p-values for Chi-square test
chisq_p <- map_dbl(samples, 
  ~ chisq.test(table(.x$y, .x$g), correct = FALSE)$p.value)

# p-values for Fisher Exact test
fisher_p <- map_dbl(samples, 
  ~ fisher.test(table(.x$y, .x$g))$p.value)

# Rejection Rates at alpha = 0.05
c(chisq_rr = mean(chisq_p < 0.05), fisher_rr = mean(fisher_p < 0.05))

# p-value distributions
ggplot() +
 geom_histogram(aes(x = chisq_p))
ggplot() +
 geom_histogram(aes(x = fisher_p))

# Comparing p-values
ggplot() +
 geom_point(aes(x = chisq_p, y = fisher_p))

You should compare the tests in the following additional settings:

  1. \(n_Y =\) 100, \(n_X =\) 150; \(p_Y =\) 0.5, \(p_X =\) = 0.5 (Null true)

  2. \(n_Y =\) 100, \(n_X =\) 150; \(p_Y =\) 0.25, \(p_X =\) 0.25 (Null true)

  3. \(n_Y =\) 10, \(n_X =\) 15; \(p_Y =\) 0.5, \(p_X =\) 0.25 (Null false)

  4. \(n_Y =\) 100, \(n_X =\) 150; \(p_Y =\) 0.5, \(p_X =\) 0.4 (Null false)

  5. \(n_Y =\) 100, \(n_X =\) 150; \(p_Y =\) 0.4, \(p_X =\) 0.5 (Null false)

Comment on the calibration and power of the two tests, and the extent to which they make the same decisions.

2. Log Odds Ratio test

  1. In class we saw that the odds ratio for \(Y_i = 1\) given a grouping variable \(G_i\), is the same as the odds ratio for \(G_i = 1\) given the outcome \(Y_i\), that is

    \[ \frac{P(Y_i = 1 \mid G_i = 1)/P(Y_i = 0 \mid G_i = 1)}{P(Y_i = 1 \mid G_i = 0) / P(Y_i = 0 \mid G_i = 0)} = \frac{P(G_i = 1 \mid Y_i = 1)/P(G_i = 0 \mid Y_i = 1)}{P(G_i = 1 \mid Y_i = 0) / P(G_i = 0 \mid Y_i = 0)} \]

    Use the properties of conditional probability to derive this fact.

Consider the following table from class:

  no yes
cats 6 9
dogs 6 14

where \[ G_i = \begin{cases} 0, & \text{subject i prefers cats} \\ 1, & \text{subject i prefers dogs} \end{cases} \quad Y_i = \begin{cases} 0, & \text{subject i didn't eat breakfast} \\ 1, & \text{subject i ate breakfast} \end{cases} \]

  1. Find the sample odds ratio for \(Y_i = 1\) given the grouping variable \(G_i\). Interpret your answer in the context of the data.

  2. Show that if we consider not eating breakfast as a success (i.e. \(Y_i = 1\), if the subject didn’t eat breakast), the sample odds ratio is the reciprical of that in b).

  3. Find the log of the sample odds ratio from b), and the variance for log of the sample odds ratio from b).

  4. Use your results from c) to construct a 95% confidence interval for the population log odds ratio, and population odds ratio.

  5. Based on your CI in d) would you reject or fail to reject the null hypothesis \(H_0: \omega = 1\) at the 5% level.

3. Data Analysis

Consider the brfss data from previous homeworks. The variables exercising and dieting are the answers to the questions:

  • exercising: Are you using physical activity or exercise to lose weight or keep from gaining weight?

  • dieting: Are you eating either fewer calories or less fat to lose weight or keep from gaining weight?

  1. Create the \(2\times 2\) contingency table with the response to exercising in the columns and response to dieting in the rows.

  2. Which margins in the table are fixed by the study design?

  3. Use the data to estimate (make sure you also include sentences that present each estimate in context):

    • the probability a US resident is exercising to lose weight
    • the probability a US resident is exercising to lose weight, given they are dieting to lose weight
    • the difference in the probability a US resident is exercising to lose weight between those that are also dieting and those that are not
    • the odds of a US resident exercising to lose weight, given they are dieting to lose weight
    • the odds ratio of exercising to lose weight, between dieting and not dieting
  4. Find a 95% confidence interval for the difference in the probability a US resident is exercising to lose weight between those that are also dieting and those that are not.

  5. Is there an association between exercising to lose weight and dieting to lose weight? Conduct a Chi-square test and Fisher’s Exact test. Would you expect the two tests to reach the same conclusion in this setting? Write a statistical summary.