# Two sample inference

## Two sample setting

Setting: two independent samples

$$Y_1, \ldots, Y_n$$ i.i.d from population with c.d.f $$F_Y$$, and
$$X_1, \ldots, X_m$$ i.i.d from population with c.d.f $$F_X$$

Parameter: now focus on some comparison between the two populations $$F_Y$$ and $$F_X$$

## Alternative view

Setting: two independent samples

\begin{aligned} (Y_1, G_1), (Y_2, G_2), \ldots,(Y_n, G_n), (Y_{n+1}, G_{n+1}), \ldots, (Y_{n+m}, G_{n+m}) \end{aligned}

where $$G$$ is a binary grouping variable which indicates which population the observation came from: $G_i = \begin{cases} 0, & \text{observation from } Y \\ 1, & \text{observation from } X \end{cases}$

## Two views are equivalent

Depending on sampling scheme one view may seem more natural:

• I sample 40 OSU graduate students and 20 OSU undergraduate students:

• $$Y_i$$ = graduate student time to complete 1 mile run, $$i = 1, \ldots, 40$$
• $$X_i$$ = undergraduate student time to complete 1 mile run, $$i = 1, \ldots, 20$$
• I sample 60 OSU students and record:

• $$Y_i$$ = time to complete 1 mile run, $$i = 1, \ldots, 60$$
• $$G_i$$ = student’s level (0 = graduate, 1 = undergraduate), $$i = 1, \ldots, 60$$

In second view, if we condition on the counts in each group, inference is the same as first view.

## Two sample inference for difference in population means

To compare population means: $$\mu_Y = E(Y_i)$$, $$\mu_X = E(X_i)$$, we might look at their difference:

$\delta = \mu_Y - \mu_X$

(In alternative view: equivalent to $$\delta = E(Y_i \,| \, G_i = 0) - E(Y_i \, | \, G_i = 1)$$)

• Estimate for $$\delta$$
• Test for $$H_0: \delta = \delta_0$$
• Confidence interval for $$\delta$$

## Difference in sample means

It seems reasonable to use:

$\hat{\delta} = \overline{Y} - \overline{X}$ as a good starting point for inference on $$\delta = \mu_X - \mu_Y$$.

Complete worksheet (Charlotte will provide)

## Leads to two sample Z-test and intervals

Assume known population variances: $$Var(Y_i) = \sigma_Y^2$$ $$Var(X_i) = \sigma_X^2$$.

$Z(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\sigma_Y^2/n + \sigma^2_X/m}}$

Reference Distribution: If null hypothesis $$H_0:\delta = \delta_0$$ is true, then $Z(\delta_0) \, \dot \sim \, N(0, 1)$

Rejection Regions:

• $$H_A: \delta > \delta_0$$, reject $$H_0$$ for $$Z(\delta_0) > z_{1-\alpha}$$
• $$H_A: \delta < \delta_0$$, reject $$H_0$$ for $$Z(\delta_0) < z_{\alpha}$$
• $$H_A: \delta \ne \delta_0$$, reject $$H_0$$ for $$|Z(\delta_0)| > z_{1 - \alpha/2}$$

## Leads to two sample Z-test and intervals

$$(1-\alpha)100$$% Confidence interval for $$\delta = \mu_Y - \mu_X$$

$(\overline{Y} - \overline{X}) \pm z_{1 - \alpha/2}\sqrt{\frac{\sigma_Y^2}{n} + \frac{\sigma_X^2}{m}}$

## Next time…

What if population variances aren’t known?