Random Sampling studies: General Setting
Key components of the study setting:
- Population(s) of interest
- Variable of interest
- Parameter of interest
- (Specific) Question/Hypothesis of interest
Random Sampling study - notation
Take a random sample of \(n\) (sampling) units from the population of interest.
Measure outcome variable of interest on each unit:
\(Y_i =\) measurement of outcome on \(i\)th unit sampled, \(i = 1, \ldots, n\).
Maybe also measure some other explanatory/predictor variable on units:
\(X_i =\) measurement of explanatory variable on \(i\)th unit sampled, \(i = 1, \ldots, n\)
Data Settings
One sample: One outcome variable (Y) measured on units
- What’s the average rent for OSU students?
- What proportion of ST551 students prefer cats to dogs?
- How large is the average family size of US households?
Two sample: One outcome variable measured on units plus one binary explanatory variable
- How does the average rent (\(Y_i\)) of undergraduate (\(X_i = 1\)) OSU students compare to graduate (\(X_i = 0\)) OSU students?
Data Settings (cont.)
Multi-sample: One outcome variable measured on units plus one categorical (> 2 levels) explanatory variable
- Is the average rent (\(Y_i\)) of OSU students different for different kinds of accommodation (dorm, apartment, house)?
Regression settings: (ST552)
Simple: One outcome variable and one continuous explanatory variable
- How much does rent of OSU students (\(Y_i\)) decrease based on the number of people they live (\(X_i\)) with?
Multiple: One outcome variable and one or more explanatory variables
- What’s the average rent that OSU students pay for a \(Z\) square foot house with \(X\) bedrooms, \(D\) miles from campus?
For the next few weeks…
We will focus on the one sample random sampling setting.
Measure \(Y\) on \(n\) randomly sampled units from a population of interest.
Interested in some question/hypothesis about some parameter of the population.
Parameters of interest
Parameter: some summary measure of \(Y\) for all units in the population
- Population mean: average of variable of interest for all units in the population
- Population median: median of variable of interest for all units in the population
- Population variance: variance of variable of interest for all units in the population
- … any one number summary of the variable of interest for all units of the population
Questions about parameters
Point Estimate: the single best guess of the population parameter value
Interval Estimate: a range of likely values for the population parameter
Hypothesis test: is a specific value of the population parameter plausible?
Your Turn
Do people support the idea of a single payer health system?
Discuss with neighbor, what might be the population, variable, parameter and question/hypothesis?
Population:
Variable:
Parameter:
Question/Hypothesis:
Probability Review
Population Distribution
The population distribution is the distribution of \(Y\) for the entire population.
It tells us how likely values are over the range of \(Y\).
In particular, it provides us a probability model for \(Y\), so we can find probabilities such as:
\[ P(Y \in (a, b]) = P(a < Y \le b) \] In words: the probability, for a random unit drawn from the population, that the value of the variable of interest is between \(a\) and \(b\) (technically greater than \(a\) and less than or equal to \(b\)).
Common distributions
It’s sometimes convenient to assume mathematical forms for population distributions.
Continuous distributions: the range of possible values is the real line
Normal, Exponential, t, F, Uniform, Gamma
Discrete distributions: range of possible values are distinct separate values
Bernoulli, Binomial, Poisson, Multinomial, Discrete Uniform
The Normal Distribution
The Normal Distribution
The classic “Bell-shaped” distribution (but not every “bell-shape” is Normal).
The standard Normal has mean 0 and variance 1.
The Normal Distribution
Probability is found as areas under the curve of the probability density function.
E.g. \(P(0 < Y \le 1)\) = shaded area
The Normal Distribution
There is really a whole family of Normal distributions identified by their mean and variance.
We write \(N(\mu, \sigma^2)\) to refer to the specific Normal with mean \(\mu\) and variance \(\sigma^2\).
Properties of Normally Distributed variables
If \(X \sim N(0, 1)\) then \(\sigma X + \mu \sim N(\mu, \sigma^2)\)
Also if \(Y \sim N(\mu, \sigma^2)\) then \(\frac{Y - \mu}{\sigma} \sim N(0, 1)\)
More generally, if \(X \sim N(\mu, \sigma^2)\) then
\[ aX + b \sim N(a\mu + b, a^2\sigma^2) \]
Properties of Normally Distributed variables
If \(X \sim N(\mu_X, \sigma_X^2)\) and \(Y \sim N(\mu_Y, \sigma_Y^2)\), independent of \(X\).
Then,
\[ Z = X + Y \sim N(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) \]
Independent: knowing value of one variable doesn’t help to guess value of other.
Why is the Normal so important?
- Some things seem naturally Normally distributed (actually it’s pretty hard to tell)
- It’s easy to work with mathematically (this isn’t generally a good reason in practice)
- The Central Limit Theorem!
Back to our setting
Statistic
A statistic is a one number summary of our sample.
Usually, we use a statistic to summarize what we know from our data at hand (our sample).
- Sample mean: average calculated using the sample, \(\overline{Y} = \frac{1}{n}\sum_{i = 1}^n Y_i\)
- Sample median: middle value of the sample
- Sample standard deviation
- pretty much anything…
Example: Commute time
I want to know the average commute time of students in the class on the first day.
Population: ST551 students present on first day of class Fall 2017
Variable of interest: Commute time in minutes
Parameter: Population mean
I randomly sample 5 index cards from those you filled out on first day.
___ ___ ___ ___ ___
Your turn
How would you use the sample to estimate the population mean?
Would your estimate have the same value regardless of the sample we obtained?
Sampling distribution
We use a sample statistic to estimate a population parameter.
The value of the sample statistic depends on the sample we obtain.
The sample is random \(\implies\) the sample statistic is random
That means, the sample statistic has a probability distribution: the sampling distribution of the statistic
Example: Commute time (cont.)
6, 10, 10, 15, 15, 30, 5, 25, 20, 10, 10, 20, 12, 8, 10, 15, 10, 15, 8, 8, 10, 5, 15, 18, 20, 15, 2, 15, 15, 2, 30, 7, 7, 28, 30, 10 and 10
One sample:
Next sample:
Example: Commute time (cont.)
If we take a very large number of samples we would get a good idea of sampling distribution of the sample mean for samples of size 5 from this population.
Sampling distributions
Of course we don’t take many samples! So how do we know what the sampling distribution of a statistic looks like?
We’ll see inference in this setting depends on knowing the sampling distribution for the statistic being used, the sample size and the population.
Options for finding the sampling distribution:
- Derive it mathematically
- Can’t derive the distribution?
- Derive properties of the distribution
- Simulate
- Approximate