--- title: Lab 2 author: Charlotte Wickham date: '2017-10-03' slug: lab-2 draft: false output: blogdown::html_page: toc: true --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, results = "hide", message = FALSE, fig.keep = "none") ``` ## Goals * Review the probability distribution functions in R * Explore the Central Limit Theorem ```{r} library(tidyverse) ``` ## Distributions in R ### The Normal distribution You've now seen (in lecture and homework) four functions that relate to the Normal distribution: `rnorm()`, `pnorm()`, `qnorm()`, `dnorm()`. They respectively generate a sample, find the area under the curve, find quantiles, and find the density. You can see the help for all four on one page: ```{r, eval = FALSE} ?rnorm ``` You'll notice all four take the arguments `mean` and `sd` which allow you to specify the desired mean ($\mu$) and standard deviation ($\sigma$). **Review your understanding by finding:** * The area to the left of 1 under a Normal$(\mu = 3, \sigma^2 = 4)$. * The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the left of $z$ is 0.05. * The value, $z$, such that the area under a Normal$(\mu = 100, \sigma^2 = 100)$ to the **right** of $z$ is 0.05. * A sample of size 10 from a Normal$(\mu = 1, \sigma = 0.5)$ ### Other distributions R has a whole host of other distribution functions. You can see a list of those built in at: ```{r, eval = FALSE} ?Distributions ``` And they tend to all take a similar form, a function that starts with: `r` to generate samples, `p` for probabilities, `q` for quantiles and `d` for densities. **What function generates a sample from a Binomial distribution?** Take a look at: ```{r, eval = FALSE} ?dexp ``` * **What distribution is this?** * **What parameter can you specify? What is the default value of this parameter**? ### Adding a density to a histogram When we are investigating properties of distributions by simulation, it's often useful to be able to add a density from a known distribution to a histogram of values. To illustrate let's generate a sample of 1000 from a Normal$(\mu = 1, \sigma^2 = 2)$: ```{r} norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2)) ``` We've seen you can create a histogram ```{r} ggplot() + geom_histogram(aes(x = norm_sample)) ``` But, say we want to overlay the density these samples were drawn from. In practice, you'll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from. One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth. To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count. You can do that by specifying `y = ..density..` is the mapping: ```{r} ggplot() + geom_histogram(aes(x = norm_sample, y = ..density..)) ``` `..density..` is a variable that gets calculated by `geom_histogram()`. The double dots (`..`) are just there to avoid a calculated variable conflicting with a name of an existing variable. You can see the computed variables available on the help page for geoms, i.e. `?geom_histogram`. If you look at the histogram, you'll notice the scale of the y-axis has changed. It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities. There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line. The first is generally easier, but more restrictive, so occasionally you'll need the second. ### `stat_function()` `stat_function()` is a `ggplot2` function that plots a line corresponding to its `fun` argument to an existing plot. What function do we want to add: the Normal density, `dnorm`. We simply add it (using `+`) to our existing histogram making sure the `x` aesthetic is the same as for our histogram: ```{r} ggplot() + geom_histogram(aes(x = norm_sample, y = ..density..)) + stat_function(aes(x = norm_sample), fun = dnorm) ``` That doesn't look quite right! We didn't specify the mean and variance for `dnorm`. In `stat_function()` that is done by adding an `args` argument with the arguments as a list: ```{r} ggplot() + geom_histogram(aes(x = norm_sample, y = ..density..)) + stat_function(aes(x = norm_sample), fun = dnorm, args = list(mean = 1, sd = sqrt(2))) ``` Much better! We can save a bit of typing by moving the common mapping `aes(x = norm_sample)` up into the `ggplot()` call: ```{r} ggplot(mapping = aes(x = norm_sample)) + geom_histogram(aes(y = ..density..)) + stat_function(fun = dnorm, args = list(mean = 1, sd = sqrt(2))) ``` **How close is the histogram to the density? If you wanted to see closer agreement what would you change? Try it.** **Generate a sample from an Exponential(rate = 1) distribution and add the true density to a histogram of your sample.** ### Generating points and adding a line The other option is to create a sequence of points that cover the x-axis, then evaluate `dnorm()` at those points and add them as a line: ```{r} x_seq <- seq(-3, 6, 0.01) ggplot(mapping = aes(x = norm_sample)) + geom_histogram(aes(y = ..density..)) + geom_line(aes(x = x_seq, y = dnorm(x_seq, mean = 1, sd = sqrt(2)))) ``` The result is the same as using `stat_function()` but you have a little more control over the arguments to the function, which for example allows you to do things like have them depend on the data. ## Central Limit Theorem exploration Recall the CLT says that the sampling distribution of the sample mean is approximately Normal for large sample sizes. How *large* is large enough? If we have a known population we can simulate the sampling distribution and then compare it to that predicted by the CLT. Remember our code for simulating a sampling distribution for the sample mean. Here it is again but with a placeholder for the sampling process: ``` n_sim <- 2000 # Generate many samples samples <- rerun(.n = n_sim, <> ) # Find sample mean of each sample sample_means <- map_dbl(samples, ~ mean(.x)) ``` We can plug in different populations and sample sizes to explore how fast the approximation holds. Try substituting `rexp(n = 10)` for `<>` to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population: ```{r} n_sim <- 2000 # Generate many samples samples <- rerun(.n = n_sim, rexp(n = 10)) # Find sample mean of each sample sample_means <- map_dbl(samples, ~ mean(.x)) ``` We can take a look at the histogram: ```{r} ggplot(mapping = aes(x = sample_means)) + geom_histogram(aes(y = ..density..)) ``` **Add an appropriate Normal density to the histogram.** Recall an Exponential$(\lambda)$ has mean $\lambda$ and variance $\lambda^2$. The CLT based Normal approximation says we'd expect about 50% of sample means below 1. We could verify how close this is, by estimating this probability using our simulated sampling distribution: ```{r} mean(sample_means < 1) ``` Our estimate is subject to some variability because we only simulated $`r n_sim`$ sample means. Roughly we'd expect our estimate to be within about $\frac{1}{\sqrt{n}}$ of the true value. **Is your estimate consistent with 0.5 (i.e. the value the CLT approximation predicts), at least up to a margin of $\frac{1}{\sqrt{`r n_sim`}}$? What does this say about the approximation?** **How about at a sample size of 50?** **Try the same (n = 10 and n = 50) for a Poisson$(\lambda = 5)$. Does the Normal approximation seem to be better or worse?**