Chi-square Goodness of Fit ST551 Lecture 17

Finish last time’s slides

What about discrete distributions?

The K-S test is only appropriate for continuous distirbutions (the hypothesized distribution is continuous).

But what about if our hypothesis if for a discrete distribution, e.g.:

  • Discrete Uniform
  • Bernoulli
  • Poisson

Setting

Population: Y some discrete population distribution with p.m.f p(y)=P(Y=y)

Sample: n i.i.d from population, Y1,,Yn

Parameter: Whole p.m.f

Hypotheses: H0:P(Y=y)=p0(y), versus HA:P(Y=y)p0(y)

Sample estimate of the p.m.f

The discrete sample based estimate of the probability mass function:

p^(y)=1ni=1n11{Yi=y}

The Chi-square Goodness of Fit test

Chi-square goodness of fit compares the estimated p.m.f to the hypothesized one.

Pearson’s Chi-square statistic: X(p0)=yn(p^(y)p0(y))2p0(y)

Under null hypothesis: X(p0) converges in distibrution (as n goes to infinity) to χ2 with k1 degrees of freedom

k= number of possible values for Y.

An alternative presentation

  • j=1,,n indexes possible values/categories for Y
  • Oj be the observed number of values in category j
  • Ej=np0(j) be the expected number of values in category j, based on the hypothesized distribution.

Pearson’s Chi-square statistic: X(p0)=j=1k(OjEj)2Ej

Example: Dice rolling

I rolled a die 60 times and recorded how many times I got each side: 1, 2, 3, 4, 5, 6

Question: Is the die fair? That is, is p0(j)=1/6 for j=1,2,3,4,5,6?

  1 2 3 4 5 6
Oj 20 11 6 7 6 10

Rejection region

Reject H0 for X(p0)>χ(k1)2(1α)

Rejection region always to the right, p-values always area to right. Why? We usualy are only interested in evidence of poor fit (not evidence of extra good fit)

Example: Dice rolling cont.

Compare to χ(5)2

qchisq(0.95, df = 5)
## [1] 11.0705
1 - pchisq(chi_sq, df = 5)
## [1] 0.01438768

In R:

rolls
##  [1] 6 5 2 2 6 2 1 1 5 1 2 2 1 1 3 4 5 1 2 4 1 6 6
## [24] 2 4 1 6 6 1 6 1 1 6 1 3 1 3 4 1 1 5 3 4 2 5 1
## [47] 6 3 1 4 3 2 5 1 6 2 1 4 1 2
chisq.test(table(rolls), p = rep(1/6, 6))
## 
##  Chi-squared test for given probabilities
## 
## data:  table(rolls)
## X-squared = 14.2, df = 5, p-value = 0.01439

Estimation of parameters

If the null hypothesis doesn’t completely specify the distribution p0, but specifes a family of distributions, p0(θ1,θ2,,θd) where the θ are unknown parameters.

You can still use the Chi-square test with some modification

  1. Estimate the parameters θ1,θ2,,θd
  2. Find Ej based on estimated parameters and p0.
  3. Compute Pearson’s χ2 statistic as usual
  4. Compare statistic to a χ2 with kd1 degrees of freedom, where d is the number of parameters that were estimated.

Example: Poisson

I counted the number of passengers in n=40 vehicles passing through an intersection.

Question: Is the number of passengers per vehicle distributed according to a Poisson distribution?

  0 1 2 3 4 5 6
Oj 6 11 11 8 3 0 1
mean(passengers)
## [1] 1.875

Example: Poisson

  0 1 2 3 4 5 6
Oj 6 11 11 8 3 0 1
p_0 <- dpois(0:6, lambda = mean(passengers))
(E <- p_0 * n)
## [1]  6.1341987 11.5016225 10.7827711  6.7392319
## [5]  3.1590150  1.1846306  0.3701971
# For 7+ category
(E_7 <- (1 - sum(p_0)) * n)
## [1] 0.1283331

Example: Poisson

Test statistic:

(66.13)26.13+(1111.5)211.5++(00.13)20.13=2.66

Compare to χkd12

1 - pchisq(X, df)
## [1] 0.8504434

Other points

  • For binary data, the X(p0) statistic is equal to the square of the Z-statistic for testing a hypothesis regarding a binary proportion.

    Therefore, for two sided hypothesis testing H0:p=p0 vs HA:pp0, the χ2 test and the z-test give the exact same result.

  • The χ2 statistic has an asymptotic χ2 distribution.

    Therefore this test is approximate: the test is asymptotically exact.

    The approximation is generally considered appropriate when Ej>5 for all j.

Next time

Starting two sample inference…