Finish last time’s slides
Two sample inference
Two sample setting
Setting: two independent samples
\(Y_1, \ldots, Y_n\) i.i.d from population with c.d.f \(F_Y\), and
\(X_1, \ldots, X_m\) i.i.d from population with c.d.f \(F_X\)
Parameter: now focus on some comparison between the two populations \(F_Y\) and \(F_X\)
Alternative view
Setting: two independent samples
\[ \begin{aligned} (Y_1, G_1), (Y_2, G_2), \ldots,(Y_n, G_n), (Y_{n+1}, G_{n+1}), \ldots, (Y_{n+m}, G_{n+m}) \end{aligned} \]
where \(G\) is a binary grouping variable which indicates which population the observation came from: \[ G_i = \begin{cases} 0, & \text{observation from } Y \\ 1, & \text{observation from } X \end{cases} \]
Two views are equivalent
Depending on sampling scheme one view may seem more natural:
I sample 40 OSU graduate students and 20 OSU undergraduate students:
- \(Y_i\) = graduate student time to complete 1 mile run, \(i = 1, \ldots, 40\)
- \(X_i\) = undergraduate student time to complete 1 mile run, \(i = 1, \ldots, 20\)
I sample 60 OSU students and record:
- \(Y_i\) = time to complete 1 mile run, \(i = 1, \ldots, 60\)
- \(G_i\) = student’s level (0 = graduate, 1 = undergraduate), \(i = 1, \ldots, 60\)
In second view, if we condition on the counts in each group, inference is the same as first view.
Two sample inference for difference in population means
To compare population means: \(\mu_Y = E(Y_i)\), \(\mu_X = E(X_i)\), we might look at their difference:
\[ \delta = \mu_Y - \mu_X \]
(In alternative view: equivalent to \(\delta = E(Y_i \,| \, G_i = 0) - E(Y_i \, | \, G_i = 1)\))
- Estimate for \(\delta\)
- Test for \(H_0: \delta = \delta_0\)
- Confidence interval for \(\delta\)
Difference in sample means
It seems reasonable to use:
\[ \hat{\delta} = \overline{Y} - \overline{X} \] as a good starting point for inference on \(\delta = \mu_X - \mu_Y\).
Complete worksheet (Charlotte will provide)
Leads to two sample Z-test and intervals
Assume known population variances: \(Var(Y_i) = \sigma_Y^2\) \(Var(X_i) = \sigma_X^2\).
\[ Z(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\sigma_Y^2/n + \sigma^2_X/m}} \]
Reference Distribution: If null hypothesis \(H_0:\delta = \delta_0\) is true, then \[ Z(\delta_0) \, \dot \sim \, N(0, 1) \]
Rejection Regions:
- \(H_A: \delta > \delta_0\), reject \(H_0\) for \(Z(\delta_0) > z_{1-\alpha}\)
- \(H_A: \delta < \delta_0\), reject \(H_0\) for \(Z(\delta_0) < z_{\alpha}\)
- \(H_A: \delta \ne \delta_0\), reject \(H_0\) for \(|Z(\delta_0)| > z_{1 - \alpha/2}\)
Leads to two sample Z-test and intervals
\((1-\alpha)100\)% Confidence interval for \(\delta = \mu_Y - \mu_X\)
\[ (\overline{Y} - \overline{X}) \pm z_{1 - \alpha/2}\sqrt{\frac{\sigma_Y^2}{n} + \frac{\sigma_X^2}{m}} \]
Next time…
What if population variances aren’t known?