Finish last time’s slides

What about discrete distributions?

The K-S test is only appropriate for continuous distirbutions (the hypothesized distribution is continuous).

But what about if our hypothesis if for a discrete distribution, e.g.:

Discrete Uniform
Bernoulli
Poisson

Setting

Population: $Y \sim$ some discrete population distribution with p.m.f $p (y) = P (Y = y)$

Sample: n i.i.d from population, $Y_{1}, \dots, Y_{n}$

Parameter: Whole p.m.f

Hypotheses: $H_{0} : P (Y = y) = p_{0} (y)$ , versus $H_{A} : P (Y = y) \neq p_{0} (y)$

Sample estimate of the p.m.f

The discrete sample based estimate of the probability mass function:

$\hat{p} (y) = \frac{1}{n} \sum_{i = 1}^{n} 1 1 {Y_{i} = y}$

The Chi-square Goodness of Fit test

Chi-square goodness of fit compares the estimated p.m.f to the hypothesized one.

Pearson’s Chi-square statistic: $X (p_{0}) = \sum_{y} \frac{n {(\hat{p} (y) - p_{0} (y))}^{2}}{p_{0} (y)}$

Under null hypothesis: $X (p_{0})$ converges in distibrution (as n goes to infinity) to $χ^{2}$ with $k - 1$ degrees of freedom

$k =$ number of possible values for $Y$ .

An alternative presentation

$j = 1, \dots, n$ indexes possible values/categories for $Y$
$O_{j}$ be the observed number of values in category $j$
$E_{j} = n p_{0} (j)$ be the expected number of values in category $j$ , based on the hypothesized distribution.

Pearson’s Chi-square statistic: $X (p_{0}) = \sum_{j = 1}^{k} \frac{(O_{j} - E_{j})^{2}}{E_{j}}$

Example: Dice rolling

I rolled a die 60 times and recorded how many times I got each side: 1, 2, 3, 4, 5, 6

Question: Is the die fair? That is, is $p_{0} (j) = 1 / 6$ for $j = 1, 2, 3, 4, 5, 6$ ?

	1	2	3	4	5	6
$O_{j}$	20	11	6	7	6	10

Rejection region

Reject $H_{0}$ for $X (p_{0}) > χ_{(k - 1)}^{2} (1 - α)$

Rejection region always to the right, p-values always area to right. Why? We usualy are only interested in evidence of poor fit (not evidence of extra good fit)

Example: Dice rolling cont.

Compare to $χ_{(5)}^{2}$

qchisq(0.95, df = 5)

## [1] 11.0705

1 - pchisq(chi_sq, df = 5)

## [1] 0.01438768

In R:

rolls

##  [1] 6 5 2 2 6 2 1 1 5 1 2 2 1 1 3 4 5 1 2 4 1 6 6
## [24] 2 4 1 6 6 1 6 1 1 6 1 3 1 3 4 1 1 5 3 4 2 5 1
## [47] 6 3 1 4 3 2 5 1 6 2 1 4 1 2

chisq.test(table(rolls), p = rep(1/6, 6))

## 
##  Chi-squared test for given probabilities
## 
## data:  table(rolls)
## X-squared = 14.2, df = 5, p-value = 0.01439

Estimation of parameters

If the null hypothesis doesn’t completely specify the distribution $p_{0}$ , but specifes a family of distributions, $p_{0} (θ_{1}, θ_{2}, \dots, θ_{d})$ where the $θ$ are unknown parameters.

You can still use the Chi-square test with some modification

Estimate the parameters $θ_{1}, θ_{2}, \dots, θ_{d}$
Find $E_{j}$ based on estimated parameters and $p_{0}$ .
Compute Pearson’s $χ^{2}$ statistic as usual
Compare statistic to a $χ^{2}$ with $k - d - 1$ degrees of freedom, where $d$ is the number of parameters that were estimated.

Example: Poisson

I counted the number of passengers in $n = 40$ vehicles passing through an intersection.

Question: Is the number of passengers per vehicle distributed according to a Poisson distribution?

	0	1	2	3	4	5	6
$O_{j}$	6	11	11	8	3	0	1

mean(passengers)

## [1] 1.875

Example: Poisson

	0	1	2	3	4	5	6
$O_{j}$	6	11	11	8	3	0	1

p_0 <- dpois(0:6, lambda = mean(passengers))
(E <- p_0 * n)

## [1]  6.1341987 11.5016225 10.7827711  6.7392319
## [5]  3.1590150  1.1846306  0.3701971

# For 7+ category
(E_7 <- (1 - sum(p_0)) * n)

## [1] 0.1283331

Example: Poisson

Test statistic:

$\frac{(6 - 6.13)^{2}}{6.13} + \frac{(11 - 11.5)^{2}}{11.5} + \dots + \frac{(0 - 0.13)^{2}}{0.13} = 2.66$

Compare to $χ_{k - d - 1}^{2}$

1 - pchisq(X, df)

## [1] 0.8504434

Other points

For binary data, the $X (p_{0})$ statistic is equal to the square of the Z-statistic for testing a hypothesis regarding a binary proportion.

Therefore, for two sided hypothesis testing $H_{0} : p = p_{0}$ vs $H_{A} : p \neq p_{0}$ , the $χ^{2}$ test and the z-test give the exact same result.
The $χ^{2}$ statistic has an asymptotic $χ^{2}$ distribution.

Therefore this test is approximate: the test is asymptotically exact.

The approximation is generally considered appropriate when $E_{j} > 5$ for all $j$ .

Next time

Starting two sample inference…

Chi-square Goodness of Fit ST551 Lecture 17

Finish last time’s slides

What about discrete distributions?

Setting

Sample estimate of the p.m.f

The Chi-square Goodness of Fit test

An alternative presentation

Example: Dice rolling

Rejection region

Example: Dice rolling cont.

In R:

Estimation of parameters

Example: Poisson

Example: Poisson

Example: Poisson

Other points

Next time