Paired Data ST551 Lecture 20

Review of last week’s t-tests

Setting

Setting: two independent samples

\(Y_1, \ldots, Y_n\) i.i.d from population with c.d.f \(F_Y\), and
\(X_1, \ldots, X_m\) i.i.d from population with c.d.f \(F_X\)

Parameter: Difference in population means \(\mu_Y - \mu_X\)

Equal variance two sample t-test

Assume \(\sigma_X^2 = \sigma_Y^2\).

\[ t(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{s_p^2 \left(\frac{1}{n} + \frac{1}{m}\right)}} \]

Compare to \(t_{(n+m-2)}\).

Welch’s t-test

\(\sigma_X^2\) not necessarily equal to \(\sigma_Y^2\).

\[ t(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\frac{s_Y^2}{n} + \frac{s^2_X}{m}}} \]

Compare to \(t_{v}\), where

\[ v = \frac{(s_Y^2/n + s_X^2/m)^2}{\frac{s_Y^4}{n^2(n-1)} + \frac{s_X^4}{m^2(m-1)} } \]

In both cases

If \(\textit{df}\) is the appropriate degrees of freedom for the test.

Rejection regions:

  • \(H_A: \mu_Y - \mu_X > 0\): Reject \(H_0\) for \(t(\delta_0) > t_{(df), 1 - \alpha}\)
  • \(H_A: \mu_Y - \mu_X < 0\): Reject \(H_0\) for \(t(\delta_0) < t_{(df), \alpha}\)
  • \(H_A: \mu_Y - \mu_X \ne 0\): Reject \(H_0\) for \(|t(\delta_0)| > t_{(df), 1 - \alpha/2}\)

Confidence intervals:

\[ \overline{Y} - \overline{X} \pm t_{(df), 1-\alpha/2} \text{SE}_{\overline{Y} - \overline{X}} \]

Paired Data

Setting

Two dependent samples

\(Y_1, \ldots, Y_n\) i.i.d from population with c.d.f \(F_Y\), and
\(X_1, \ldots, X_n\) i.i.d from population with c.d.f \(F_X\)

Observations come in pairs: \[ (Y_1, X_1), (Y_2, X_2), \ldots, (Y_n, X_n) \]

with joint distribution \(F_{YX}\). Observations are somehow matched.

\(Cov(Y_i, X_i) = \sigma_{YX}\) and \(Cov(Y_i, X_j) = 0\) for all \(i \ne j\).

Parameter: Difference in population means \(\mu_Y - \mu_X\)

Examples

We’ve already seen examples like this:

  • midterm Mother’s IQ (\(Y_i\)) and Father’s IQ (\(X_i\))
  • homework Current weight (\(Y_i\)) and desired weight (\(X_i\))

We did one sample t-tests on the differences \(Y_i - X_i\). This works, but why?

Sampling Distirbution of difference in sample means

Consider \[ D_i = Y_i - X_i \]

CLT says \[ \frac{\overline{D} - E(D_i)}{\sqrt{Var(D_i)/n}} \, \dot \sim \, N(0, 1) \]

What are \(\overline{D}\), \(E(D_i)\) and \(Var(D_i)\)?

What are \(\overline{D}\), \(E(D_i)\) and \(Var(D_i)\)?

A Z-test for paired data

If \(\sigma^2_Y\), \(\sigma^2_X\), and \(\sigma_{YX}\) are known.

Hypothesis Test for \(H_0: \delta = \delta_0\)

Test Statistic: \[ Z(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\frac{\sigma^2_Y}{n} + \frac{\sigma^2_X}{n} - 2 \frac{\sigma_{YX}}{n}}} \]

Reference Distribution: Under \(H_0\), \(Z(\delta_0) \dot \sim N(0, 1)\)

Rejection Region:

  • \(H_A: \delta > \delta_0\): Reject \(H_0\) for \(z(\delta_0) > z_{1-\alpha}\)
  • \(H_A: \delta < \delta_0\): Reject \(H_0\) for \(z(\delta_0) < z_{\alpha}\)
  • \(H_A: \delta \ne \delta_0\): Reject \(H_0\) for \(|z(\delta_0)| > z_{1-\alpha/2}\)

A Z-test for paired data

\[ Z(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\frac{\sigma^2_Y}{n} + \frac{\sigma^2_X}{n} - 2 \frac{\sigma_{YX}}{n}}} \]

Notice the test statistic is just like a two sample Z-test, but with a correction to \(Var(\overline{Y} - \overline{X})\) for the correlation between \(Y_i\) and \(X_i\).

What if population variances and covariances aren’t known?

Plug in estimates for \(\sigma_Y^2\), \(\sigma_X^2\) and \(\sigma_YX\).

Sample covariance: \[ \hat{\sigma}_{YX} = s_{YX} = \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y}) (X_i - \overline{X}) \] is an unbiased estimate of \(\sigma_{YX}\).

Plugging in the estimates gives the estimated variance of \(\overline{Y} - \overline{X}\): \[ \widehat{Var}\left(\overline{D}\right) = \frac{s^2_Y}{n} + \frac{s^2_X}{n} - 2 \frac{s_{YX}}{n} \]

Compare to estimated \(Var(D_i)\)

\(s_D^2 =\)

Paired data t-test

\[ t(\delta_0) = \frac{(\overline{Y} - \overline{X}) - \delta_0}{\sqrt{\frac{s^2_Y}{n} + \frac{s^2_X}{n} - 2 \frac{s_{YX}}{n}}} = \frac{\overline{D}- \delta_0}{\sqrt{\frac{s_D^2}{n}}} \]

If differences are Normal, \(t(\delta_0)\) has exactly a t-distribution with \(n-1\) degrees of freedom when the null hypothesis is true.

Summary

For paired samples:

  1. Take differences \(D_i = X_i - Y_i\)
  2. Perform a one-sample hypothesis test for the population mean difference \(\mu_D = \mu_Y - \mu_X\)

That is, do a one-sample t-test on the differences

This is equivalent to estimating the population covariance and appropriately adjusting the denominator of the two-sample t-test to take this covariance into account