Deriving Properties of the Sampling Distibution

Given a specific statistic it’s sometimes possible to derive properties of its sampling disitbrution without knowing the population distribution shape.

I.e. apply properties of expectation and variance to derive expectation and varinace of sampling distribution.

Unknown population distribution

Imagine we don’t know the population distribution, but we do know it has mean, $\mu$, and variance $\sigma^2$

Population: $\sim (\mu, \sigma^2)$
Sample: $n$ i.i.d from population
Sample statistic: Sample mean

Expectation of sampling distribution of sample mean

$E(\frac{1}{n}(Y_1 + Y_2 + \ldots + Y_n))$ =

The sampling distribution of the sample mean is centered around the population mean.

Variance of sampling distribution of sample mean

$Var(\frac{1}{n}(Y_1 + Y_2 + \ldots + Y_n))$ =

The variance of the sampling distribution of the sample mean is smaller than the population variance ($n > 1$), and decrease with increasing $n$.

More properties of sample means

Weak Law of Large Numbers (WLLN)

For i.i.d samples from a population with mean $\mu$:

As the sample size increases to infinity ($n \rightarrow \infty$), the sample mean converges in probability to the population mean, $\mu$.
Probability of sample mean being some small distance from $\mu$ goes to zero as sample size increases

We write: \[ \overline{Y} \rightarrow_p \mu \]

Simulating Sampling Distributions

Just knowing mean and variance of the sampling distribution isn’t generally enough.

If we know or hypothesize a population distribution we can simulate to obtain the sampling distribution for any statistic.

Simulation Set Up

Specify a known or hypothesized population distribution.

Repeat $B$ times:

Draw sample of size $n$ from the population distribution
Calculate the desired sample statistic from the sample
Record the value of sample statistic

Get $B$ sample statistics (from $B$ samples)

For large $B$, the distribution of the $B$ sample statistics approximates the true sampling distribution. Why?

Your Turn

Let $X$ be a random variable with an unknown distribution.

I obtain $X_1, \ldots, X_{10}$ i.i.d samples from the distribution. I get:

5, 3, 7, 4, 4, 3, 7, 2, 7, 3

How would you estimate $P(X \le 5)$?

Empirical Distribution Function

The empirical cumulative distribution function (ECDF) for a sample $X_1, \ldots, X_n$ is:

\[ \widehat{F}(x) = \frac{1}{n}\sum_{i = 1}^n \pmb{1}\left\{ X_i \le x \right\} \]

Intuition: the ECDF at $x$, is the sample proportion of observed values less than or equal to $x$.

Empirical Distribution Function

$\widehat{F}(x)$ is a sample mean of the random variable $\pmb{1}\left\{ X_i \le x \right\}$ therefore the Weak Law of Large Numbers applies.

$E\left[\pmb{1}\left\{ X_i \le x \right\}\right] = F(x)$

$\widehat{F}(x) \rightarrow_p F(x)$

So, the ECDF converges to the true cumulative distribution function.

In practice this means we can use our simulated values to approximate the distribution of the sampling distribution.

Example: Commute times

Population: ST551 students present on first day of class Fall 2017 Variable of interest: Commute time in minutes
Parameter: Population mean

What’s the sampling distribution for the sample mean of samples of size 5?

What’s the probability the sample mean from a sample of size 5 is less than 10 minutes?

Example: Commute times

Specify a known or hypothesized population distribution.

Repeat $B$ times:

Draw sample of size $n$ from the population distribution
Calculate the desired sample statistic from the sample
Record the value of sample statistic

Get $B$ sample statistics (from $B$ samples)

Example: Commute times

Population: all commute times from index cards

Repeat $B$ times:

Draw 5 cards at random
Find mean commute time of sample
Record the value

Get $B$ sample statistics (from $B$ samples)

Example: Commute times

Population: class_data$commute_times

Repeat n_sim times:

one_sample <- sample(class_data$commute_times, 
  size = 5)

mean(one_sample)
Record mean(one_sample)

Example: Commute times

library(tidyverse)
n <- 5
n_sim <- 1000

# Generate many samples
samples <- rerun(.n = n_sim, 
  sample(class_data$commute_time, size = n))

# Do something to each sample
sample_means <- map_dbl(samples, ~ mean(.x))

Example: Commute times

Examining the distribution of the simulated sample statistics

# Sampling dist. histogram
ggplot() + 
  geom_histogram(aes(x = sample_means), binwidth = 1) +
  theme_bw() +
  labs(x = "Sample mean commute time", 
    title = "Sampling Distribution",
    subtitle = "for sample mean and sample size, n = 5")

Example: Commute times

Using the simulated sample means to estimate a probability.

What’s the probability the sample mean from a sample of size 5 is less than 10 minutes?

# Estimate a specific probability
mean(sample_means < 10)

## [1] 0.129

Can’t I just write a `for` loop?

Yes, you could write a for loop. I almost never do anymore, because a functional style results in lots less book keeping and code that more clearly expresses the intent rather than the implementation.

In general:

There are lot’s of ways to get anything done in R.
I’ll show you one way (that comes from a lot of experience and recent innovations).

You don’t have to use my way.

You should always aim for code that: 1. Is correct 2. Is clear (i.e. understandable to a fellow human being)

Approximate Sampling Distribution

Central Limit Theorem (CLT)

If the population distribution of a variable $X$ has population mean $\mu$ and (finite) population variance $\sigma^2$, then the sampling distribution of the sample mean becomes closer and closer to a Normal distribution as the sample size n increases.

We can write: \[ \overline{X} \, \dot \sim \, N\left(\mu, \frac{\sigma^2}{n}\right) \] for large values of $n$, where the symbol $\dot \sim$ means approximately distributed as.

Describing Sampling Distributions ST551 Lecture 5

Deriving Properties of the Sampling Distibution

Unknown population distribution

Expectation of sampling distribution of sample mean

Variance of sampling distribution of sample mean

More properties of sample means

Simulating Sampling Distributions