## Goals

- Review the probability distribution functions in R
- Explore the Central Limit Theorem

`library(tidyverse)`

## Distributions in R

### The Normal distribution

You’ve now seen (in lecture and homework) four functions that relate to the Normal distribution: `rnorm()`

, `pnorm()`

, `qnorm()`

, `dnorm()`

. They respectively generate a sample, find the area under the curve, find quantiles, and find the density.

You can see the help for all four on one page:

`?rnorm`

You’ll notice all four take the arguments `mean`

and `sd`

which allow you to specify the desired mean (\(\mu\)) and standard deviation (\(\sigma\)).

**Review your understanding by finding:**

- The area to the left of 1 under a Normal\((\mu = 3, \sigma^2 = 4)\).
- The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the left of \(z\) is 0.05.
- The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the
**right**of \(z\) is 0.05. - A sample of size 10 from a Normal\((\mu = 1, \sigma = 0.5)\)

### Other distributions

R has a whole host of other distribution functions. You can see a list of those built in at:

`?Distributions`

And they tend to all take a similar form, a function that starts with: `r`

to generate samples, `p`

for probabilities, `q`

for quantiles and `d`

for densities.

**What function generates a sample from a Binomial distribution?**

Take a look at:

`?dexp`

**What distribution is this?****What parameter can you specify? What is the default value of this parameter**?

### Adding a density to a histogram

When we are investigating properties of distributions by simulation, it’s often useful to be able to add a density from a known distribution to a histogram of values.

To illustrate let’s generate a sample of 1000 from a Normal\((\mu = 1, \sigma^2 = 2)\):

`norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2))`

We’ve seen you can create a histogram

```
ggplot() +
geom_histogram(aes(x = norm_sample))
```

But, say we want to overlay the density these samples were drawn from. In practice, you’ll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from.

One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth. To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count.

You can do that by specifying `y = ..density..`

is the mapping:

```
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..))
```

`..density..`

is a variable that gets calculated by `geom_histogram()`

. The double dots (`..`

) are just there to avoid a calculated variable conflicting with a name of an existing variable. You can see the computed variables available on the help page for geoms, i.e. `?geom_histogram`

.

If you look at the histogram, you’ll notice the scale of the y-axis has changed. It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities.

There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line. The first is generally easier, but more restrictive, so occasionally you’ll need the second.

`stat_function()`

`stat_function()`

is a `ggplot2`

function that plots a line corresponding to its `fun`

argument to an existing plot.

What function do we want to add: the Normal density, `dnorm`

. We simply add it (using `+`

) to our existing histogram making sure the `x`

aesthetic is the same as for our histogram:

```
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..)) +
stat_function(aes(x = norm_sample), fun = dnorm)
```

That doesn’t look quite right! We didn’t specify the mean and variance for `dnorm`

. In `stat_function()`

that is done by adding an `args`

argument with the arguments as a list:

```
ggplot() +
geom_histogram(aes(x = norm_sample, y = ..density..)) +
stat_function(aes(x = norm_sample), fun = dnorm,
args = list(mean = 1, sd = sqrt(2)))
```

Much better! We can save a bit of typing by moving the common mapping `aes(x = norm_sample)`

up into the `ggplot()`

call:

```
ggplot(mapping = aes(x = norm_sample)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm,
args = list(mean = 1, sd = sqrt(2)))
```

**How close is the histogram to the density? If you wanted to see closer agreement what would you change? Try it.**

**Generate a sample from an Exponential(rate = 1) distribution and add the true density to a histogram of your sample.**

### Generating points and adding a line

The other option is to create a sequence of points that cover the x-axis, then evaluate `dnorm()`

at those points and add them as a line:

```
x_seq <- seq(-3, 6, 0.01)
ggplot(mapping = aes(x = norm_sample)) +
geom_histogram(aes(y = ..density..)) +
geom_line(aes(x = x_seq, y = dnorm(x_seq, mean = 1, sd = sqrt(2))))
```

The result is the same as using `stat_function()`

but you have a little more control over the arguments to the function, which for example allows you to do things like have them depend on the data.

## Central Limit Theorem exploration

Recall the CLT says that the sampling distribution of the sample mean is approximately Normal for large sample sizes. How *large* is large enough?

If we have a known population we can simulate the sampling distribution and then compare it to that predicted by the CLT.

Remember our code for simulating a sampling distribution for the sample mean. Here it is again but with a placeholder for the sampling process:

```
n_sim <- 2000
# Generate many samples
samples <- rerun(.n = n_sim,
<<SAMPLE FROM POPULATION>> )
# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```

We can plug in different populations and sample sizes to explore how fast the approximation holds. Try substituting `rexp(n = 10)`

for `<<SAMPLE FROM POPULATION>>`

to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population:

```
n_sim <- 2000
# Generate many samples
samples <- rerun(.n = n_sim,
rexp(n = 10))
# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))
```

We can take a look at the histogram:

```
ggplot(mapping = aes(x = sample_means)) +
geom_histogram(aes(y = ..density..))
```

**Add an appropriate Normal density to the histogram.** Recall an Exponential\((\lambda)\) has mean \(\lambda\) and variance \(\lambda^2\).

The CLT based Normal approximation says we’d expect about 50% of sample means below 1. We could verify how close this is, by estimating this probability using our simulated sampling distribution:

`mean(sample_means < 1)`

Our estimate is subject to some variability because we only simulated \(2000\) sample means. Roughly we’d expect our estimate to be within about \(\frac{1}{\sqrt{n}}\) of the true value.

**Is your estimate consistent with 0.5 (i.e. the value the CLT approximation predicts), at least up to a margin of \(\frac{1}{\sqrt{2000}}\)? What does this say about the approximation?**

**How about at a sample size of 50?**

**Try the same (n = 10 and n = 50) for a Poisson\((\lambda = 5)\). Does the Normal approximation seem to be better or worse?**