library(tidyverse)
Wilcoxon Rank Sum
Let’s use the same data from lecture:
fish <- c(-2.2, -0.8, 3.7, 4.9, 5.0, 5.2, 5.3, 6.0, 8.0, 8.0, 10.4, 14.0)
regular <- c(-6.4, -6.4, -5.9, -5.8, -5.3, -4.9, -4.4, 0.2, 2.1, 2.5, 2.5, 6.1, 8.9)
bp_experiment <- data.frame(bp_reduction = c(fish, regular),
treatment = rep(c("Fish Oil", "Regular Oil"), c(length(fish), length(regular))))
Just examining the sample data, it looks like there is some evidence the blood pressure reductions are higher in the Fish Oil group:
ggplot(bp_experiment) +
geom_histogram(aes(x = bp_reduction)) +
facet_wrap(~ treatment, ncol = 1)
To find the Wilcoxon Rank Sum test statistic, you:
- Combine the samples
- Rank the observations in the combined sample from smallest (1) to largest (\(n+m\)). If there are ties, assign the average rank to the tied observations.
- Test statistic: Sum of the ranks in the sample with the smaller sample size
The data in the current form it’s pretty easy to add ranks based on the combined sample.
bp_experiment <- bp_experiment %>%
mutate(rank = rank(bp_reduction))
The rank()
function in R returns ranks (and evens averages tied ranks). You could check that the returned ranks are reasonable by plotting them against the observed blood pressure reductions:
ggplot(bp_experiment) +
geom_point(aes(x = bp_reduction, y = rank, color = treatment, shape = treatment))
Now, to we just need to add up those ranks in the Fish Oil group (since it has the smaller sample size).
(R <- sum(bp_experiment$rank[bp_experiment$treatment == "Fish Oil"]))
Or in a more tidyverse
style:
bp_experiment %>%
filter(treatment == "Fish Oil") %>%
summarise(R = sum(rank),
n = n())
Our test statistic is 208.
Assymptopic Reference Distribution for the Wilcoxon Rank Sum
The Central Limit Theorem can be used to derive an asymptotic reference distribution.
Without loss of generality, assume the \(X\) sample is smaller, i.e. \(n_X < n_Y\). If the two population distributions are identical, then the Wilcoxon Rank Sum test statistic, R, should be asymptotically Normal: \[
R \dot \sim N\left(\frac{n_X (n_X + n_Y + 1)}{2}, \frac{n_Xn_Y(n_X + n_Y + 1)}{12}\right)
\]
First, let’s find the mean and variance parameters based on our particular example. Set the right sample sizes:
n_X <- 12
n_Y <- 13
e_R <- (n_X * (n_X + n_Y + 1)) / 2
var_R <- (n_X * n_Y * (n_X + n_Y + 1)) / 12
The we can construct a statistic that should have a N(0, 1) distribution by subtracting the mean, and dividing by the standard deviation:
(Z <- (R - e_R)/sqrt(var_R))
Which we can compare to the critical value:
abs(Z) > qnorm(0.975)
rejecting the null hypothesis, or find the two-sided p-value.
2*(1 - pnorm(abs(Z)))
How do we interpret this value? Would you be comfortable with the location-shift or additive effect assumption? With this size sample, it’s hard to get say anything about the shapes of the populations. Unless we had some good prior or expert knowledge, I’d be uncomfortable making the location-shift assumption. As such, we can conclude we reject the null hypothesis of the form \(P(Y > X) = 0.5\), or there is convincing evidence a single random blood pressure reduction from the fish oil group is larger than a single random blood pressure reduction from the regular oil group with some probability not equal to 0.5. (In this case it appears the true probability is something greater than 0.5)
The wilcox.test()
function in R calculates a different but equivalent test statistic.
wilcox.test(bp_reduction ~ treatment, data = bp_experiment, correct = FALSE, exact = FALSE)
The reported test statistic, \(W\), subtracts off the minimum possible test statistic from \(R\), i.e. \(W = R - n_X(n_X + 1)/2\). The p-value however should be close to that from the asymptotic calculation above.
Exploring performance of Wilcoxon Test
We can use p-value reported from wilcox.test()
to explore the performance of the test in different scenarios. Consider a situation where the distribution of \(Y\) is Normal(log(2), 1) and the distribution of \(X\) is Exponential(1). These two populations happen to have the same median, log(2). If the Wilcoxon Rank Sum was an asymptotically exact test of medians we would expect a rejection rate of around 0.05 (for a \(\alpha = 0.05\) test), let’s see if it does:
n_sim <- 1000
n_x <- 100
n_y <- 100
# Generate sample data of (Y, G) form.
samples <- rerun(n_sim,
y = c(rnorm(n = n_y, mean = log(2)),
rexp(n = n_x, rate = 1)),
g = rep(c(0, 1), c(n_y, n_x))
)
# p-values for Wilcoxon Rank SUm
wilcox_p <- map_dbl(samples,
~ wilcox.test(y ~ g, data = .x, correct = FALSE, exact = FALSE)$p.value)
mean(wilcox_p < 0.05)
Looks a bit high to me!
Has the asymptotic approximation for the reference distribution failed to kick in? Try increasing the sample sizes for the two samples.
So this is one example that illustrates why without a further assumption the Wilcoxon Rank Sum isn’t a test of medians.
Consider the populations \(Y \sim \text{Normal}(0, 1)\) and \(X \sim \text{exp(1)}\). Repeat the simulation and use the result to argue the Wilcoxon Rank Sum isn’t an exact test of the null \(\mu_Y = \mu_X\).
Consider the populations \(Y \sim \text{Normal}(0, 1)\) and \(X \sim \text{Normal(0, 10)}\). Repeat the simulation and use the result to argue the Wilcoxon Rank Sum isn’t a consistent test of the null \(F_Y = F_X\).