# Two sample: Binary Response ST551 Lecture 21

## Two sample: Binary Response

Setting: two independent samples

$$Y_1, \ldots, Y_n$$ i.i.d from Bernoulli$$(p_Y)$$
$$X_1, \ldots, X_m$$ i.i.d from Bernoulli$$(p_X)$$

Parameter: Difference in population proportions $$p_Y - p_X$$

$$p_Y = E(Y_i) =P(Y_i = 1)$$
$$p_X = E(X_i) =P(X_i = 1)$$

## As a contingency table

Represent resulting data in a 2 x 2 contingency table:

0 1 Total
$$Y_i$$ a b n = a+b
$$X_i$$ c d m = c+d
Total a+c b+d m + n

## Two sample: Binary Response - Alternate view

Setting: two independent samples

\begin{aligned} (Y_1, G_1), (Y_2, G_2), \ldots,(Y_n, G_n), (Y_{n+1}, G_{n+1}), \ldots, (Y_{n+m}, G_{n+m}) \end{aligned}

where $$G$$ is a binary grouping variable which indicates which population the observation came from: $G_i = \begin{cases} 0, & \text{observation from } Y \\ 1, & \text{observation from } X \end{cases}$

## As a contingency table - Alternate view

Represent resulting data in a 2 x 2 contingency table:

$$Y_i = 0$$ $$Y_i = 1$$ Total
$$G_i = 0$$ $$n_{11} = a$$ $$n_{12} = b$$ n = a+b = $$R_1$$
$$G_i = 1$$ $$n_{21} = c$$ $$n_{22} = d$$ m = c+d = $$R_2$$
Total a+c = $$C_1$$ b+d=$$C_2$$ a + b + c + d = N

## Two views are equivalent

If we are interested in the response variable given the group.

• I sample 40 OSU graduate students and 20 OSU undergraduate students:

• $$Y_i$$ = graduate student, did you vote in 2016? $$i = 1, \ldots, 40$$
• $$X_i$$ = undergraduate student did you vote in 2016? $$i = 1, \ldots, 20$$
• I sample 60 OSU students and record:

• $$Y_i$$ = did you vote in 2016?, $$i = 1, \ldots, 60$$
• $$G_i$$ = student’s level (0 = graduate, 1 = undergraduate), $$i = 1, \ldots, 60$$

Inference focuses on:

Comparing $$P(Y_i = 1)$$ and $$P(X_i = 1)$$ - first view
Comparing $$P(Y_i = 1 | G_i = 0)$$ and $$P(Y_i = 1 | G_i = 1)$$ - second view

## Ways to compare two proportions

$$Y_1, \ldots, Y_n$$ i.i.d from Bernoulli$$(p_Y)$$
$$X_1, \ldots, X_m$$ i.i.d from Bernoulli$$(p_X)$$

Typical null hypothesis: $$H_0: p_Y = p_X$$)

Difference in population proportions: $$p_Y - p_X$$

• $$H_0: p_Y - p_X = 0$$

Relative risk: $$p_Y/p_X$$

• $$H_0: p_Y/p_X = 1$$

Odds ratio: $$\frac{p_Y}{1-p_Y}/\frac{p_X}{1-p_X}$$

• $$H_0: \frac{p_Y}{1-p_Y}/\frac{p_X}{1-p_X} = 1$$

## Example

(From class_data, assuming you are like some random sample from a larger population)

$$G_i$$: Do you prefer cats or dogs? $$Y_i$$: Did you east breakfast this morning?

##        ate_breakfast
## cat_dog no yes
##    cats  6   9
##    dogs  6  14

Your turn: Fill in the table margins

## Estimates

Probability of eating breakfast, given you prefer cats: $p_Y = P(Y_i = 1 | G_i = 0) = \frac{P(Y_i = 1 \, \& \, G_i = 0)}{P(G_i = 0)}$ Estimate $\hat{p}_{Y} = \frac{b/N}{R_1/N} = \frac{9}{15} = 0.6$

Probability of eating breakfast, given you prefer dogs: $p_X = P(Y_i = 1 | G_i = 1) = \frac{P(Y_i = 1 \, \& \, G_i = 1)}{P(G_i = 1)}$

Estimate $\hat{p}_{X} = \frac{d/N}{R_2/N} = \frac{14}{20} = 0.7$

## Estimates

Difference in proportions $\hat{p}_Y - \hat{p}_X = 0.6 - 0.7 = -0.1$

Relative Risk $\frac{\hat{p}_Y}{\hat{p}_X} = \frac{0.6}{0.7} = 0.86$

Odds Ratio $\frac{\hat{p}_Y}{1-\hat{p}_Y}/\frac{\hat{p}_X}{1-\hat{p}_X} = \frac{0.6}{1-0.6}/\frac{0.7}{1-0.7} = 0.64 = \frac{bc}{ad}$

## Two sample Z-test of proportions

(Comes from considering proportion as mean and looking at two sample Z-test)

Null hypothesis: $$H_0: p_Y = p_X$$

$Z = \frac{\hat{p}_Y - \hat{p}_X}{\sqrt{\hat{p}_c(1 - \hat{p}_c) \left(\frac{1}{n} + \frac{1}{m}\right)}}$

where $$p_c = \frac{( np_Y + mp_X)}{n + m} = \frac{b + d}{N}$$

When null is true $$Z$$ has a N(0, 1) distribution.

## Confidence interval for difference in proportions

$$(1- \alpha)100\%$$ CI: $\hat{p}_Y - \hat{p}_X \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}_Y(1 - \hat{p}_Y)}{n} + \frac{\hat{p}_X(1 - \hat{p}_X)}{m}}$

Like in one sample case, binomial test and CI may not agree because they use different estimates of the variance of the difference in sample proportions.

$$p_c = \frac{( np_Y + mp_X)}{n + m} = \frac{b + d}{N}$$

What is $$p_c$$ for our table?

##        ate_breakfast
## cat_dog no yes Sum
##    cats  6   9  15
##    dogs  6  14  20
##    Sum  12  23  35

## Example: Z-stat

$Z = \frac{\hat{p}_Y - \hat{p}_X}{\sqrt{\hat{p}_c(1 - \hat{p}_c) \left(\frac{1}{n} + \frac{1}{m}\right)}} = \frac{-0.1}{\sqrt{0.66(1 - 0.66)(\frac{1}{15} + \frac{1}{20})}} = -0.62$

Compare to $$z_{1-\alpha/2} = 1.96$$

p-value (for two sided alternative) =0.54

95% confidence interval: \begin{aligned} \hat{p}_Y - \hat{p}_X \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}_Y(1 - \hat{p}_Y)}{n} + \frac{\hat{p}_X(1 - \hat{p}_X)}{m}} \\ = -0.1 \pm \sqrt{\frac{0.6(1 - 0.6)}{15} + \frac{0.7(1 - 0.7)}{20}} \\ = (-0.26, 0.06) \end{aligned}

## Pearson’s Chi-squared Test

$$H_0: p_Y - p_X = 0$$

$X = \sum_{j,k = 1, 2} \frac{(O_{jk} - E_{jk})^2}{E_{jk}}$

$$O_{jk} = n_{jk}$$

$$E_{jk} = \frac{R_j C_k}{N}$$

If null is true, $$X$$ has $$\chi^2_{1}$$ distribution

## Example: Chi-squared test

##        ate_breakfast
## cat_dog no yes Sum
##    cats  6   9  15
##    dogs  6  14  20
##    Sum  12  23  35
##        no   yes
## cats 5.14  9.86
## dogs 6.86 13.14

E.g $$\frac{15\times 12}{35} = 5.14$$

## Example: Chi-squared test

\begin{aligned} X &= \frac{(6 - 5.14)^2}{5.14} + \frac{(9 - 9.86)^2}{9.86} + \frac{(6 - 6.86)^2}{6.86} + \frac{(14 - 13.14)^2}{13.14} \\ &= 0.38 \end{aligned}

Compare to $$\chi^2_1(1 - \alpha)= 3.84$$

p-value: =0.54

## Summary

Pearson’s Chi-squared test for homogeneity of proportions across groups is equivalent (i.e. results in the same p-value) to the Z-test for proportions (when there are two groups).

$$X = Z^2$$