---
title: Lab 2
author: Charlotte Wickham
date: '2017-10-03'
slug: lab-2
draft: false
output:
blogdown::html_page:
toc: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
message = FALSE, fig.keep = "none")
```
## Goals
* Review the probability distribution functions in R
* Explore the Central Limit Theorem
```{r}
library(tidyverse)
```
## Distributions in R
### The Normal distribution
You've now seen (in lecture and homework) four functions that relate to the Normal distribution: `rnorm()`, `pnorm()`, `qnorm()`, `dnorm()`. They respectively generate a sample, find the area under the curve, find quantiles, and find the density.
You can see the help for all four on one page:
```{r, eval = FALSE}
?rnorm
```
You'll notice all four take the arguments `mean` and `sd` which allow you to specify the desired mean ($\mu$) and standard deviation ($\sigma$).
**Review your understanding by finding:**
* The area to the left of 1 under a Normal$(\mu = 3, \sigma^2 = 4)$.
* The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the left of $z$ is 0.05.
* The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the **right** of $z$ is 0.05.
* A sample of size 10 from a Normal$(\mu = 1, \sigma = 0.5)$
### Other distributions
R has a whole host of other distribution functions. You can see a list of those built in at:
```{r, eval = FALSE}
?Distributions
```
And they tend to all take a similar form, a function that starts with: `r` to generate samples, `p` for probabilities, `q` for quantiles and `d` for densities.
**What function generates a sample from a Binomial distribution?**
Take a look at:
```{r, eval = FALSE}
?dexp
```
* **What distribution is this?**
* **What parameter can you specify? What is the default value of this parameter**?
### Adding a density to a histogram
When we are investigating properties of distributions by simulation, it's often useful to be able to add a density from a known distribution to a histogram of values.
To illustrate let's generate a sample of 1000 from a Normal$(\mu = 1, \sigma^2 = 2)$:
```{r}
norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2))
```
We've seen you can create a histogram
```{r}
ggplot() +
geom_histogram(aes(x = norm_sample))
```
But, say we want to overlay the density these samples were drawn from. In practice, you'll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from.
One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth. To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count.
You can do that by specifying `y = ..density..` is the mapping:
```{r}
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..))
```
`..density..` is a variable that gets calculated by `geom_histogram()`. The double dots (`..`) are just there to avoid a calculated variable conflicting with a name of an existing variable. You can see the computed variables available on the help page for geoms, i.e. `?geom_histogram`.
If you look at the histogram, you'll notice the scale of the y-axis has changed. It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities.
There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line. The first is generally easier, but more restrictive, so occasionally you'll need the second.
### `stat_function()`
`stat_function()` is a `ggplot2` function that plots a line corresponding to its `fun` argument to an existing plot.
What function do we want to add: the Normal density, `dnorm`. We simply add it (using `+`) to our existing histogram making sure the `x` aesthetic is the same as for our histogram:
```{r}
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..)) +
stat_function(aes(x = norm_sample), fun = dnorm)
```
That doesn't look quite right! We didn't specify the mean and variance for `dnorm`. In `stat_function()` that is done by adding an `args` argument with the arguments as a list:
```{r}
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..)) +
stat_function(aes(x = norm_sample), fun = dnorm,
args = list(mean = 1, sd = sqrt(2)))
```
Much better! We can save a bit of typing by moving the common mapping `aes(x = norm_sample)` up into the `ggplot()` call:
```{r}
ggplot(mapping = aes(x = norm_sample)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
args = list(mean = 1, sd = sqrt(2)))
```
**How close is the histogram to the density? If you wanted to see closer agreement what would you change? Try it.**
**Generate a sample from an Exponential(rate = 1) distribution and add the true density to a histogram of your sample.**
### Generating points and adding a line
The other option is to create a sequence of points that cover the x-axis, then evaluate `dnorm()` at those points and add them as a line:
```{r}
x_seq <- seq(-3, 6, 0.01)
ggplot(mapping = aes(x = norm_sample)) +
geom_histogram(aes(y = ..density..)) +
geom_line(aes(x = x_seq, y = dnorm(x_seq, mean = 1, sd = sqrt(2))))
```
The result is the same as using `stat_function()` but you have a little more control over the arguments to the function, which for example allows you to do things like have them depend on the data.
## Central Limit Theorem exploration
Recall the CLT says that the sampling distribution of the sample mean is approximately Normal for large sample sizes. How *large* is large enough?
If we have a known population we can simulate the sampling distribution and then compare it to that predicted by the CLT.
Remember our code for simulating a sampling distribution for the sample mean. Here it is again but with a placeholder for the sampling process:
```
n_sim <- 2000
# Generate many samples
samples <- rerun(.n = n_sim,
<> )
# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```
We can plug in different populations and sample sizes to explore how fast the approximation holds. Try substituting `rexp(n = 10)` for `<>` to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population:
```{r}
n_sim <- 2000
# Generate many samples
samples <- rerun(.n = n_sim,
rexp(n = 10))
# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```
We can take a look at the histogram:
```{r}
ggplot(mapping = aes(x = sample_means)) +
geom_histogram(aes(y = ..density..))
```
**Add an appropriate Normal density to the histogram.** Recall an Exponential$(\lambda)$ has mean $\lambda$ and variance $\lambda^2$.
The CLT based Normal approximation says we'd expect about 50% of sample means below 1. We could verify how close this is, by estimating this probability using our simulated sampling distribution:
```{r}
mean(sample_means < 1)
```
Our estimate is subject to some variability because we only simulated $`r n_sim`$ sample means. Roughly we'd expect our estimate to be within about $\frac{1}{\sqrt{n}}$ of the true value.
**Is your estimate consistent with 0.5 (i.e. the value the CLT approximation predicts), at least up to a margin of $\frac{1}{\sqrt{`r n_sim`}}$? What does this say about the approximation?**
**How about at a sample size of 50?**
**Try the same (n = 10 and n = 50) for a Poisson$(\lambda = 5)$. Does the Normal approximation seem to be better or worse?**