Finish last time’s slides

Two sample inference

Two sample setting

Setting: two independent samples

\(Y_1, \ldots, Y_n\) i.i.d from population with c.d.f \(F_Y\), and
\(X_1, \ldots, X_m\) i.i.d from population with c.d.f \(F_X\)

Parameter: now focus on some comparison between the two populations \(F_Y\) and \(F_X\)

Alternative view

Setting: two independent samples

\[ \begin{aligned} (Y_1, G_1), (Y_2, G_2), \ldots,(Y_n, G_n), (Y_{n+1}, G_{n+1}), \ldots, (Y_{n+m}, G_{n+m}) \end{aligned} \]

where \(G\) is a binary grouping variable which indicates which population the observation came from: \[ G_i = \begin{cases} 0, & \text{observation from } Y \\ 1, & \text{observation from } X \end{cases} \]

Two views are equivalent

Depending on sampling scheme one view may seem more natural:

I sample 40 OSU graduate students and 20 OSU undergraduate students:
- \(Y_i\) = graduate student time to complete 1 mile run, \(i = 1, \ldots, 40\)
- \(X_i\) = undergraduate student time to complete 1 mile run, \(i = 1, \ldots, 20\)
I sample 60 OSU students and record:
- \(Y_i\) = time to complete 1 mile run, \(i = 1, \ldots, 60\)
- \(G_i\) = student’s level (0 = graduate, 1 = undergraduate), \(i = 1, \ldots, 60\)

In second view, if we condition on the counts in each group, inference is the same as first view.

Two sample inference for difference in population means

To compare population means: \(\mu_Y = E(Y_i)\), \(\mu_X = E(X_i)\), we might look at their difference:

\[ \delta = \mu_Y - \mu_X \]

(In alternative view: equivalent to \(\delta = E(Y_i \,| \, G_i = 0) - E(Y_i \, | \, G_i = 1)\))

Estimate for \(\delta\)
Test for \(H_0: \delta = \delta_0\)
Confidence interval for \(\delta\)

Difference in sample means

It seems reasonable to use:

\[ \hat{\delta} = \overline{Y} - \overline{X} \] as a good starting point for inference on \(\delta = \mu_X - \mu_Y\).

Complete worksheet (Charlotte will provide)

Leads to two sample Z-test and intervals

Assume known population variances: \(Var(Y_i) = \sigma_Y^2\) \(Var(X_i) = \sigma_X^2\).

\[ Z(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\sigma_Y^2/n + \sigma^2_X/m}} \]

Reference Distribution: If null hypothesis \(H_0:\delta = \delta_0\) is true, then \[ Z(\delta_0) \, \dot \sim \, N(0, 1) \]

Rejection Regions:

\(H_A: \delta > \delta_0\), reject \(H_0\) for \(Z(\delta_0) > z_{1-\alpha}\)
\(H_A: \delta < \delta_0\), reject \(H_0\) for \(Z(\delta_0) < z_{\alpha}\)
\(H_A: \delta \ne \delta_0\), reject \(H_0\) for \(|Z(\delta_0)| > z_{1 - \alpha/2}\)

Leads to two sample Z-test and intervals

\((1-\alpha)100\)% Confidence interval for \(\delta = \mu_Y - \mu_X\)

\[ (\overline{Y} - \overline{X}) \pm z_{1 - \alpha/2}\sqrt{\frac{\sigma_Y^2}{n} + \frac{\sigma_X^2}{m}} \]

Next time…

What if population variances aren’t known?

Two sample inference ST551 Lecture 18

Finish last time’s slides

Two sample inference

Two sample setting

Alternative view

Two views are equivalent

Two sample inference for difference in population means

Difference in sample means

Leads to two sample Z-test and intervals

Leads to two sample Z-test and intervals

Next time…