Two sample: Binary Response
Setting: two independent samples
\(Y_1, \ldots, Y_n\) i.i.d from Bernoulli\((p_Y)\)
\(X_1, \ldots, X_m\) i.i.d from Bernoulli\((p_X)\)
Parameter: Difference in population proportions \(p_Y - p_X\)
\(p_Y = E(Y_i) =P(Y_i = 1)\)
\(p_X = E(X_i) =P(X_i = 1)\)
As a contingency table
Represent resulting data in a 2 x 2 contingency table:
0 | 1 | Total | |
---|---|---|---|
\(Y_i\) | a | b | n = a+b |
\(X_i\) | c | d | m = c+d |
Total | a+c | b+d | m + n |
Two sample: Binary Response - Alternate view
Setting: two independent samples
\[ \begin{aligned} (Y_1, G_1), (Y_2, G_2), \ldots,(Y_n, G_n), (Y_{n+1}, G_{n+1}), \ldots, (Y_{n+m}, G_{n+m}) \end{aligned} \]
where \(G\) is a binary grouping variable which indicates which population the observation came from: \[ G_i = \begin{cases} 0, & \text{observation from } Y \\ 1, & \text{observation from } X \end{cases} \]
As a contingency table - Alternate view
Represent resulting data in a 2 x 2 contingency table:
\(Y_i = 0\) | \(Y_i = 1\) | Total | |
---|---|---|---|
\(G_i = 0\) | \(n_{11} = a\) | \(n_{12} = b\) | n = a+b = \(R_1\) |
\(G_i = 1\) | \(n_{21} = c\) | \(n_{22} = d\) | m = c+d = \(R_2\) |
Total | a+c = \(C_1\) | b+d=\(C_2\) | a + b + c + d = N |
Two views are equivalent
If we are interested in the response variable given the group.
I sample 40 OSU graduate students and 20 OSU undergraduate students:
- \(Y_i\) = graduate student, did you vote in 2016? \(i = 1, \ldots, 40\)
- \(X_i\) = undergraduate student did you vote in 2016? \(i = 1, \ldots, 20\)
I sample 60 OSU students and record:
- \(Y_i\) = did you vote in 2016?, \(i = 1, \ldots, 60\)
- \(G_i\) = student’s level (0 = graduate, 1 = undergraduate), \(i = 1, \ldots, 60\)
Inference focuses on:
Comparing \(P(Y_i = 1)\) and \(P(X_i = 1)\) - first view
Comparing \(P(Y_i = 1 | G_i = 0)\) and \(P(Y_i = 1 | G_i = 1)\) - second view
Ways to compare two proportions
\(Y_1, \ldots, Y_n\) i.i.d from Bernoulli\((p_Y)\)
\(X_1, \ldots, X_m\) i.i.d from Bernoulli\((p_X)\)
Typical null hypothesis: \(H_0: p_Y = p_X\))
Difference in population proportions: \(p_Y - p_X\)
- \(H_0: p_Y - p_X = 0\)
Relative risk: \(p_Y/p_X\)
- \(H_0: p_Y/p_X = 1\)
Odds ratio: \(\frac{p_Y}{1-p_Y}/\frac{p_X}{1-p_X}\)
- \(H_0: \frac{p_Y}{1-p_Y}/\frac{p_X}{1-p_X} = 1\)
Example
(From class_data
, assuming you are like some random sample from a larger population)
\(G_i\): Do you prefer cats or dogs? \(Y_i\): Did you east breakfast this morning?
## ate_breakfast
## cat_dog no yes
## cats 6 9
## dogs 6 14
Your turn: Fill in the table margins
Estimates
Probability of eating breakfast, given you prefer cats: \[ p_Y = P(Y_i = 1 | G_i = 0) = \frac{P(Y_i = 1 \, \& \, G_i = 0)}{P(G_i = 0)} \] Estimate \[ \hat{p}_{Y} = \frac{b/N}{R_1/N} = \frac{9}{15} = 0.6 \]
Probability of eating breakfast, given you prefer dogs: \[ p_X = P(Y_i = 1 | G_i = 1) = \frac{P(Y_i = 1 \, \& \, G_i = 1)}{P(G_i = 1)} \]
Estimate \[ \hat{p}_{X} = \frac{d/N}{R_2/N} = \frac{14}{20} = 0.7 \]
Estimates
Difference in proportions \[ \hat{p}_Y - \hat{p}_X = 0.6 - 0.7 = -0.1 \]
Relative Risk \[ \frac{\hat{p}_Y}{\hat{p}_X} = \frac{0.6}{0.7} = 0.86 \]
Odds Ratio \[ \frac{\hat{p}_Y}{1-\hat{p}_Y}/\frac{\hat{p}_X}{1-\hat{p}_X} = \frac{0.6}{1-0.6}/\frac{0.7}{1-0.7} = 0.64 = \frac{bc}{ad} \]
Two sample Z-test of proportions
(Comes from considering proportion as mean and looking at two sample Z-test)
Null hypothesis: \(H_0: p_Y = p_X\)
\[ Z = \frac{\hat{p}_Y - \hat{p}_X}{\sqrt{\hat{p}_c(1 - \hat{p}_c) \left(\frac{1}{n} + \frac{1}{m}\right)}} \]
where \(p_c = \frac{( np_Y + mp_X)}{n + m} = \frac{b + d}{N}\)
When null is true \(Z\) has a N(0, 1) distribution.
Confidence interval for difference in proportions
\((1- \alpha)100\%\) CI: \[ \hat{p}_Y - \hat{p}_X \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}_Y(1 - \hat{p}_Y)}{n} + \frac{\hat{p}_X(1 - \hat{p}_X)}{m}} \]
Like in one sample case, binomial test and CI may not agree because they use different estimates of the variance of the difference in sample proportions.
Your Turn
\(p_c = \frac{( np_Y + mp_X)}{n + m} = \frac{b + d}{N}\)
What is \(p_c\) for our table?
## ate_breakfast
## cat_dog no yes Sum
## cats 6 9 15
## dogs 6 14 20
## Sum 12 23 35
Example: Z-stat
\[ Z = \frac{\hat{p}_Y - \hat{p}_X}{\sqrt{\hat{p}_c(1 - \hat{p}_c) \left(\frac{1}{n} + \frac{1}{m}\right)}} = \frac{-0.1}{\sqrt{0.66(1 - 0.66)(\frac{1}{15} + \frac{1}{20})}} = -0.62 \]
Compare to \(z_{1-\alpha/2} = 1.96\)
p-value (for two sided alternative) =0.54
95% confidence interval: \[ \begin{aligned} \hat{p}_Y - \hat{p}_X \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}_Y(1 - \hat{p}_Y)}{n} + \frac{\hat{p}_X(1 - \hat{p}_X)}{m}} \\ = -0.1 \pm \sqrt{\frac{0.6(1 - 0.6)}{15} + \frac{0.7(1 - 0.7)}{20}} \\ = (-0.26, 0.06) \end{aligned} \]
Pearson’s Chi-squared Test
\(H_0: p_Y - p_X = 0\)
\[ X = \sum_{j,k = 1, 2} \frac{(O_{jk} - E_{jk})^2}{E_{jk}} \]
\(O_{jk} = n_{jk}\)
\(E_{jk} = \frac{R_j C_k}{N}\)
If null is true, \(X\) has \(\chi^2_{1}\) distribution
Example: Chi-squared test
## ate_breakfast
## cat_dog no yes Sum
## cats 6 9 15
## dogs 6 14 20
## Sum 12 23 35
## no yes
## cats 5.14 9.86
## dogs 6.86 13.14
E.g \(\frac{15\times 12}{35} = 5.14\)
Example: Chi-squared test
\[ \begin{aligned} X &= \frac{(6 - 5.14)^2}{5.14} + \frac{(9 - 9.86)^2}{9.86} + \frac{(6 - 6.86)^2}{6.86} + \frac{(14 - 13.14)^2}{13.14} \\ &= 0.38 \end{aligned} \]
Compare to \(\chi^2_1(1 - \alpha)= 3.84\)
p-value: =0.54
Summary
Pearson’s Chi-squared test for homogeneity of proportions across groups is equivalent (i.e. results in the same p-value) to the Z-test for proportions (when there are two groups).
\(X = Z^2\)