Finish last time’s slides
What about discrete distributions?
The K-S test is only appropriate for continuous distirbutions (the hypothesized distribution is continuous).
But what about if our hypothesis if for a discrete distribution, e.g.:
- Discrete Uniform
- Bernoulli
- Poisson
Setting
Population: some discrete population distribution with p.m.f
Sample: n i.i.d from population,
Parameter: Whole p.m.f
Hypotheses: , versus
Sample estimate of the p.m.f
The discrete sample based estimate of the probability mass function:
The Chi-square Goodness of Fit test
Chi-square goodness of fit compares the estimated p.m.f to the hypothesized one.
Pearson’s Chi-square statistic:
Under null hypothesis: converges in distibrution (as n goes to infinity) to with degrees of freedom
number of possible values for .
An alternative presentation
- indexes possible values/categories for
- be the observed number of values in category
- be the expected number of values in category , based on the hypothesized distribution.
Pearson’s Chi-square statistic:
Example: Dice rolling
I rolled a die 60 times and recorded how many times I got each side: 1, 2, 3, 4, 5, 6
Question: Is the die fair? That is, is for ?
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
20 | 11 | 6 | 7 | 6 | 10 |
Rejection region
Reject for
Rejection region always to the right, p-values always area to right. Why? We usualy are only interested in evidence of poor fit (not evidence of extra good fit)
Example: Dice rolling cont.
Compare to
qchisq(0.95, df = 5)
## [1] 11.0705
1 - pchisq(chi_sq, df = 5)
## [1] 0.01438768
In R:
rolls
## [1] 6 5 2 2 6 2 1 1 5 1 2 2 1 1 3 4 5 1 2 4 1 6 6
## [24] 2 4 1 6 6 1 6 1 1 6 1 3 1 3 4 1 1 5 3 4 2 5 1
## [47] 6 3 1 4 3 2 5 1 6 2 1 4 1 2
chisq.test(table(rolls), p = rep(1/6, 6))
##
## Chi-squared test for given probabilities
##
## data: table(rolls)
## X-squared = 14.2, df = 5, p-value = 0.01439
Estimation of parameters
If the null hypothesis doesn’t completely specify the distribution , but specifes a family of distributions, where the are unknown parameters.
You can still use the Chi-square test with some modification
- Estimate the parameters
- Find based on estimated parameters and .
- Compute Pearson’s statistic as usual
- Compare statistic to a with degrees of freedom, where is the number of parameters that were estimated.
Example: Poisson
I counted the number of passengers in vehicles passing through an intersection.
Question: Is the number of passengers per vehicle distributed according to a Poisson distribution?
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | 11 | 11 | 8 | 3 | 0 | 1 |
mean(passengers)
## [1] 1.875
Example: Poisson
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | 11 | 11 | 8 | 3 | 0 | 1 |
p_0 <- dpois(0:6, lambda = mean(passengers))
(E <- p_0 * n)
## [1] 6.1341987 11.5016225 10.7827711 6.7392319
## [5] 3.1590150 1.1846306 0.3701971
# For 7+ category
(E_7 <- (1 - sum(p_0)) * n)
## [1] 0.1283331
Example: Poisson
Test statistic:
Compare to
1 - pchisq(X, df)
## [1] 0.8504434
Other points
For binary data, the statistic is equal to the square of the Z-statistic for testing a hypothesis regarding a binary proportion.
Therefore, for two sided hypothesis testing vs , the test and the z-test give the exact same result.
The statistic has an asymptotic distribution.
Therefore this test is approximate: the test is asymptotically exact.
The approximation is generally considered appropriate when for all .
Next time
Starting two sample inference…