Inference for difference in sample means ST551 Lecture 19

From last time

Setting: two independent samples

Y1,,Yn i.i.d from population with c.d.f FY, and
X1,,Xm i.i.d from population with c.d.f FX

Parameter: Difference in population means μYμX

Properties of sampling distribution for Y¯X¯, lead to Z-test and associated intervals:

Z(δ0)=(Y¯X¯)δ0σY2/n+σX2/m

With known population variances σY2, σX2.

When variances aren’t known

Like in one-sample Z-test, we proceed by substituting in good estimates for the variances, then alter reference distibutions accordingly.

Two scenarios:

  • Populations variances are unknown but assumed equal, σ2=σY2=σX2. Both samples give information about σ2.

  • Populations variances are unknown and not assumed equal.

Equal variances

Need to use both samples to estimate σ2=σY2=σX2

sp2=σ^2=i=1n(YiY¯)2+i=1m(XiX¯)2(n1)+(m1)=(n1)sY2+(m1)sX2n+m2

where sY2 and sX2 are the samples variances for the Yi and Xi respectively.

Intuition: weighted average of sample variances, so that larger sample should contribute more in the average.

Plugging in to Z-stat

Hypothesis: H0:μYμX=δ0

Assumption: σY2=σX2

Leads to test statistic: t(δ0)=(Y¯X¯)δ0sp2/n+sp2/m=(Y¯X¯)δ0sp2(1n+1m)=(Y¯X¯)δ0sp(1n+1m)

Leads to equal variance t-test

Compare t(δ0)$ to a t-distribution with n+m2 degrees of freedom.

Also leads to CI of form:

(Y¯X¯)±t(n+m2),1α/2sp2(1n+1m)

This distribution is exact if the populations are Normal.

Assymptotically exact otherwise.

For large sample sizes, it doesn’t make much difference tm+n2z as n+m2

Equal variance assumption: What can go wrong?

Compare E(sp2/n+sp2/m) to Var(Y¯X¯)

Equal variance assumption: What can go wrong?

Actual = Var(Y¯X¯)=σY2n+σX2m

Estimated = E(Var^(Y¯X¯))σY2m+σX2n

m σX2 n σY2 Actual Estimated
10 1 50 4 0.18 0.42
10 9 50 1 0.92 0.28

Equal variance assumption: Consequences

The expected value of the estimated variance is:

  • Larger than it should be when the smaller sample comes from the population with the smaller variance.

    • Test statistic will be closer to zero than it should be, and rejection rates will be smaller.
  • Smaller than it should be when the smaller sample comes from the population with the larger variance.

    • Test statistic will have a larger absolute value than it should, and rejection rates will be larger.

If we don’t assume equal variance?

What’s the best estimate of σY2n+σX2m?

sY2n+sX2m

Plugging into Z-stat:

t(δ0)=(Y¯X¯)δ0sY2/n+sX2/m

Reference distribution? Even when populations are Normal, this test statistic doesn’t have exactly a t-distribution.

Welch-Satterthwaite

Slightly better than just using a Normal approximation.

Compare to t with v degrees of freedom, where v=(sY2/n+sX2/m)2sY4n2(n1)+sX4m2(m1) Somewhere between min(m1,n1) and m+n2