---
title: Lab 2
author: Charlotte Wickham
date: '2017-10-03'
slug: lab-2
draft: false
output:
  blogdown::html_page:
    toc: true
---

<!-- This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
  message = FALSE, fig.keep = "none")
```

## Goals

* Review the probability distribution functions in R
* Explore the Central Limit Theorem

```{r}
library(tidyverse)
```

## Distributions in R

### The Normal distribution 

You've now seen (in lecture and homework) four functions that relate to the Normal distribution: `rnorm()`, `pnorm()`, `qnorm()`, `dnorm()`.  They respectively generate a sample, find the area under the curve, find quantiles, and find the density.

You can see the help for all four on one page:
```{r, eval = FALSE}
?rnorm
```

You'll notice all four take the arguments `mean` and `sd` which allow you to specify the desired mean ($\mu$) and standard deviation ($\sigma$).  

**Review your understanding by finding:**

* The area to the left of 1 under a Normal$(\mu = 3, \sigma^2 = 4)$.
* The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the left of $z$ is 0.05.
* The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the **right** of $z$ is 0.05.
* A sample of size 10 from a Normal$(\mu = 1, \sigma = 0.5)$

### Other distributions

R has a whole host of other distribution functions.  You can see a list of those built in at:

```{r, eval = FALSE}
?Distributions
```

And they tend to all take a similar form, a function that starts with: `r` to generate samples, `p` for probabilities, `q` for quantiles and `d` for densities.

**What function generates a sample from a Binomial distribution?**

Take a look at:
```{r, eval = FALSE}
?dexp
```

* **What distribution is this?**
* **What parameter can you specify? What is the default value of this parameter**?

### Adding a density to a histogram

When we are investigating properties of distributions by simulation, it's often useful to be able to add a density from a known distribution to a histogram of values.

To illustrate let's generate a sample of 1000 from a Normal$(\mu = 1, \sigma^2 = 2)$:

```{r}
norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2))
```

We've seen you can create a histogram
```{r}
ggplot() + 
  geom_histogram(aes(x = norm_sample))
```
But, say we want to overlay the density these samples were drawn from. In practice, you'll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from.  

One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth.  To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count.  

You can do that by specifying `y = ..density..` is the mapping:

```{r}
ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..))
```

`..density..` is a variable that gets calculated by `geom_histogram()`.  The double dots (`..`) are just there to avoid a calculated variable conflicting with a name of an existing variable.  You can see the computed variables available on the help page for geoms, i.e. `?geom_histogram`.

If you look at the histogram, you'll notice the scale of the y-axis has changed.  It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities.

There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line.  The first is generally easier, but more restrictive, so occasionally you'll need the second.

### `stat_function()`

`stat_function()` is a `ggplot2` function that plots a line corresponding to its `fun` argument to an existing plot.  

What function do we want to add: the Normal density, `dnorm`.  We simply add it (using `+`) to our existing histogram making sure the `x` aesthetic is the same as for our histogram:
```{r}
ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..)) +
  stat_function(aes(x = norm_sample), fun = dnorm)
```

That doesn't look quite right!  We didn't specify the mean and variance for `dnorm`.  In `stat_function()` that is done by adding an `args` argument with the arguments as a list:

```{r}
ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..)) +
  stat_function(aes(x = norm_sample), fun = dnorm, 
    args = list(mean = 1, sd = sqrt(2)))
```
Much better!  We can save a bit of typing by moving the common mapping `aes(x = norm_sample)` up into the `ggplot()` call:

```{r}
ggplot(mapping = aes(x = norm_sample)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
    args = list(mean = 1, sd = sqrt(2)))
```
**How close is the histogram to the density? If you wanted to see closer agreement what would you change? Try it.**

**Generate a sample from an Exponential(rate = 1) distribution and add the true density to a histogram of your sample.**

### Generating points and adding a line

The other option is to create a sequence of points that cover the x-axis, then evaluate `dnorm()` at those points and add them as a line:

```{r}
x_seq <- seq(-3, 6, 0.01)

ggplot(mapping = aes(x = norm_sample)) + 
  geom_histogram(aes(y = ..density..)) +
  geom_line(aes(x = x_seq, y = dnorm(x_seq, mean = 1, sd = sqrt(2))))

```
The result is the same as using `stat_function()` but you have a little more control over the arguments to the function, which for example allows you to do things like have them depend on the data.

## Central Limit Theorem exploration

Recall the CLT says that the sampling distribution of the sample mean is approximately Normal for large sample sizes.  How *large* is large enough?

If we have a known population we can simulate the sampling distribution and then compare it to that predicted by the CLT.

Remember our code for simulating a sampling distribution for the sample mean.  Here it is again but with a placeholder for the sampling process:

```
n_sim <- 2000

# Generate many samples
samples <- rerun(.n = n_sim, 
  <<SAMPLE FROM POPULATION>> )

# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```

We can plug in different populations and sample sizes to explore how fast the approximation holds.  Try substituting `rexp(n = 10)` for `<<SAMPLE FROM POPULATION>>` to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population:
```{r}
n_sim <- 2000

# Generate many samples
samples <- rerun(.n = n_sim, 
  rexp(n = 10))

# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```

We can take a look at the histogram:

```{r}
ggplot(mapping = aes(x = sample_means)) + 
  geom_histogram(aes(y = ..density..)) 
```
 
**Add an appropriate Normal density to the histogram.** Recall an Exponential$(\lambda)$ has mean $\lambda$ and variance $\lambda^2$.

The CLT based Normal approximation says we'd expect about 50% of sample means below 1. We could verify how close this is, by estimating this probability using our simulated sampling distribution:
```{r}
mean(sample_means < 1)
```
Our estimate is subject to some variability because we only simulated $`r n_sim`$ sample means.  Roughly we'd expect our estimate to be within about $\frac{1}{\sqrt{n}}$ of the true value.  

**Is your estimate consistent with 0.5 (i.e. the value the CLT approximation predicts), at least up to a margin of $\frac{1}{\sqrt{`r n_sim`}}$?  What does this say about the approximation?**

**How about at a sample size of 50?**

**Try the same (n = 10 and n = 50) for a Poisson$(\lambda = 5)$.  Does the Normal approximation seem to be better or worse?**