---
output: pdf_document
title: "Homework 5"
author: "Charlotte Wickham"
date: "`r format(Sys.Date(), '%d %B, %Y')`"
---

<!-- Start by changing your name in the header above -->

```{r, include = FALSE}
# These are reasonable options that give nice size figures
# and supress warnings in the output
knitr::opts_chunk$set(message = FALSE, 
  fig.width = 6, fig.height = 3)
```

# 1. Chi-square goodness of fit with estimated parameters

Cornhole is a popular lawn game in the US, where players throw a bean bag at a wooden platform with a hole in it.  A bag in the hole scores 3 points, while one on the platform scores 1 point. 

An avid cornhole analyst has observed n = 100 experienced players, and recorded the number of misses before they get a bag in the hole.  These 100 observations can be read into R:

```{r, results="hide"}
Y <- c(0L, 1L, 0L, 9L, 0L, 0L, 4L, 0L, 0L, 1L, 0L, 2L, 2L, 1L, 10L, 
3L, 0L, 13L, 0L, 0L, 0L, 43L, 3L, 0L, 6L, 1L, 11L, 0L, 0L, 0L, 
3L, 0L, 3L, 1L, 0L, 0L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 0L, 0L, 
0L, 0L, 0L, 16L, 9L, 2L, 15L, 1L, 4L, 3L, 0L, 18L, 1L, 3L, 0L, 
0L, 4L, 3L, 9L, 0L, 1L, 19L, 1L, 2L, 0L, 2L, 6L, 0L, 0L, 0L, 
2L, 3L, 0L, 8L, 41L, 2L, 1L, 2L, 22L, 0L, 6L, 17L, 17L, 0L, 0L, 
6L, 7L, 0L, 0L, 9L, 1L, 0L, 15L, 1L)
Y
```

(FYI the `L` just forces these numbers to be of integer type, it's not essential for this problem).

The analyst is curious if these values are consistent with being drawn from a Geometric distribution.  


a.  The geometric distribution has probability mass function:
    $$
    P(Y = y) = (1-p)^yp,
    $$
    where $p$ is an unknown parameter of the distribution.  If $Y_1, \ldots, Y_n$ is an i.i.d sample from a Geometric(p) distribution then, a good estimate of $p$ is 
    $$
    \hat{p} = \frac{1}{\overline{Y}}
    $$
    Use the data to estimate $p$.
    
    ```{r}
    # Here's where you might include code for the first answer
    ```
    
    And here is where you might describe/explain your answer.  Notice that this and the code chunk are indented by 4 spaces - this ensures they stay lined up under the question.

b. Tabulate the observed numbers of misses into the categories: 1, 2, 3, 4, 5, 6+  

    ```{r}
    # And then you'd write more code here...
    ```


c. Find probabilities for each category above using the Geometric distribution with the estimated parameter (you can use `dgeom()`).  

d. Find the expected counts for each category using the probabilities found above.

e. Check the condition for the Chi-square approximation to be appropriate.

f. Calculate the value of the Chi-square statistic.

g. Check your calculation of the test statistic by running `chisq.test(x = E, p = probs)`, where `E` comes from part (b) and `probs` from part (d).

h. What distribution should this statistic be compared to? Find the p-value for the test that the number of misses are consistent with a Geometric distribution.  What would you conclude?

i.  Why does the `chisq.test()` run in (g) return the wrong p-value?

# 2. Data Analysis

Using the same `brfss` data as in HW #3.

I found the following as the start of an example in a textbook: "The heights of male adults between the ages 20 and 62 in the US are nearly normal with mean 70.0 inches and standard deviation 3.0 inches." 

a) Create histogram of the heights (in inches) of the male respondents.  Describe the distribution, is there anything unusual about it?

b) Conduct a t-test of variance, with the null hypothesis $H_0: \sigma^2 = 3^2$, where $\sigma$ is the population variance of heights in inches for male respondents to the BRFSS survey.  
    
    Write your conclusion in the form of a statistical summary including a point estimate and confidence interval.

c)  Test the hypothesis that the heights of the male respondents in the BRFSS survey come from a Normal(70, 3.3) distribution.
    
    Write your conclusion in the form of a statistical summary (here there is no need for a point estimate or confidence interval).


# 3. One-sided K-S tests

In lecture we saw an example where two one-sided K-S tests gave conflicting results.

The setup of that example was:

**True population:**  Y ~ N(0, 1)  
**Null Hypothesis:** Y ~ N(0, 100)

a) Draw a sample of size 20 from the true population.

b) Conduct tests of the null hypothesis with one-sided lesser, one-sided greater, and two-sided alternatives.

c) Plot the ECDF of the sample, along with the CDF of the hypothesized distribution.  Indicate on your plot where the test statistic for each test comes from.

d) What properties of the true and hypothesized distributions leads to the contradiction?

e) In your own words, describe why this suggests one-sided K-S tests are hard to interpret.