Lab 2 RMarkdown

Tue Oct 3 or Wed Oct 4

Goals

  • Review the probability distribution functions in R
  • Explore the Central Limit Theorem
library(tidyverse)

Distributions in R

The Normal distribution

You’ve now seen (in lecture and homework) four functions that relate to the Normal distribution: rnorm(), pnorm(), qnorm(), dnorm(). They respectively generate a sample, find the area under the curve, find quantiles, and find the density.

You can see the help for all four on one page:

?rnorm

You’ll notice all four take the arguments mean and sd which allow you to specify the desired mean (\(\mu\)) and standard deviation (\(\sigma\)).

Review your understanding by finding:

  • The area to the left of 1 under a Normal\((\mu = 3, \sigma^2 = 4)\).
  • The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the left of \(z\) is 0.05.
  • The value, \(z\), such that the area under a Normal\((\mu = 100, \sigma^2 = 100)\) to the right of \(z\) is 0.05.
  • A sample of size 10 from a Normal\((\mu = 1, \sigma = 0.5)\)

Other distributions

R has a whole host of other distribution functions. You can see a list of those built in at:

?Distributions

And they tend to all take a similar form, a function that starts with: r to generate samples, p for probabilities, q for quantiles and d for densities.

What function generates a sample from a Binomial distribution?

Take a look at:

?dexp
  • What distribution is this?
  • What parameter can you specify? What is the default value of this parameter?

Adding a density to a histogram

When we are investigating properties of distributions by simulation, it’s often useful to be able to add a density from a known distribution to a histogram of values.

To illustrate let’s generate a sample of 1000 from a Normal\((\mu = 1, \sigma^2 = 2)\):

norm_sample <- rnorm(1000, mean = 1, sd = sqrt(2))

We’ve seen you can create a histogram

ggplot() + 
  geom_histogram(aes(x = norm_sample))

But, say we want to overlay the density these samples were drawn from. In practice, you’ll be overlaying a density that may approximate the density the samples are drawn from - here it is exactly the density the samples are drawn from.

One complication, is that the area under a density is 1, but the area under a histogram depends on the sample size and the binwidth. To place both a histogram and a density function on the same scale, its useful to plot the histogram using density on the y-axis, rather than count.

You can do that by specifying y = ..density.. is the mapping:

ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..))

..density.. is a variable that gets calculated by geom_histogram(). The double dots (..) are just there to avoid a calculated variable conflicting with a name of an existing variable. You can see the computed variables available on the help page for geoms, i.e. ?geom_histogram.

If you look at the histogram, you’ll notice the scale of the y-axis has changed. It can now be used like a probability density function, in that areas under the curve can be used to estimate probabilities.

There a two ways to overlay a density: use a function that adds a line corresponding to a function to a plot, or generate a grid of points and connect them with a line. The first is generally easier, but more restrictive, so occasionally you’ll need the second.

stat_function()

stat_function() is a ggplot2 function that plots a line corresponding to its fun argument to an existing plot.

What function do we want to add: the Normal density, dnorm. We simply add it (using +) to our existing histogram making sure the x aesthetic is the same as for our histogram:

ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..)) +
  stat_function(aes(x = norm_sample), fun = dnorm)

That doesn’t look quite right! We didn’t specify the mean and variance for dnorm. In stat_function() that is done by adding an args argument with the arguments as a list:

ggplot() + 
  geom_histogram(aes(x = norm_sample, y = ..density..)) +
  stat_function(aes(x = norm_sample), fun = dnorm, 
    args = list(mean = 1, sd = sqrt(2)))

Much better! We can save a bit of typing by moving the common mapping aes(x = norm_sample) up into the ggplot() call:

ggplot(mapping = aes(x = norm_sample)) + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, 
    args = list(mean = 1, sd = sqrt(2)))

How close is the histogram to the density? If you wanted to see closer agreement what would you change? Try it.

Generate a sample from an Exponential(rate = 1) distribution and add the true density to a histogram of your sample.

Generating points and adding a line

The other option is to create a sequence of points that cover the x-axis, then evaluate dnorm() at those points and add them as a line:

x_seq <- seq(-3, 6, 0.01)

ggplot(mapping = aes(x = norm_sample)) + 
  geom_histogram(aes(y = ..density..)) +
  geom_line(aes(x = x_seq, y = dnorm(x_seq, mean = 1, sd = sqrt(2))))

The result is the same as using stat_function() but you have a little more control over the arguments to the function, which for example allows you to do things like have them depend on the data.

Central Limit Theorem exploration

Recall the CLT says that the sampling distribution of the sample mean is approximately Normal for large sample sizes. How large is large enough?

If we have a known population we can simulate the sampling distribution and then compare it to that predicted by the CLT.

Remember our code for simulating a sampling distribution for the sample mean. Here it is again but with a placeholder for the sampling process:

n_sim <- 2000

# Generate many samples
samples <- rerun(.n = n_sim, 
  <<SAMPLE FROM POPULATION>> )

# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))

We can plug in different populations and sample sizes to explore how fast the approximation holds. Try substituting rexp(n = 10) for <<SAMPLE FROM POPULATION>> to explore the sampling distribution for the sample mean with samples of size 10 from an Exponential(1) population:

n_sim <- 2000

# Generate many samples
samples <- rerun(.n = n_sim, 
  rexp(n = 10))

# Find sample mean of each sample
sample_means <- map_dbl(samples, ~ mean(.x))

We can take a look at the histogram:

ggplot(mapping = aes(x = sample_means)) + 
  geom_histogram(aes(y = ..density..)) 

Add an appropriate Normal density to the histogram. Recall an Exponential\((\lambda)\) has mean \(\lambda\) and variance \(\lambda^2\).

The CLT based Normal approximation says we’d expect about 50% of sample means below 1. We could verify how close this is, by estimating this probability using our simulated sampling distribution:

mean(sample_means < 1)

Our estimate is subject to some variability because we only simulated \(2000\) sample means. Roughly we’d expect our estimate to be within about \(\frac{1}{\sqrt{n}}\) of the true value.

Is your estimate consistent with 0.5 (i.e. the value the CLT approximation predicts), at least up to a margin of \(\frac{1}{\sqrt{2000}}\)? What does this say about the approximation?

How about at a sample size of 50?

Try the same (n = 10 and n = 50) for a Poisson\((\lambda = 5)\). Does the Normal approximation seem to be better or worse?

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.