library(tidyverse)
Contingency tables in R
To illustrate some of the calculations from class we’ll use the same data (you’ll need to download a fresh copy to get the ate_breakfast
column):
download.file("http://st551.cwick.co.nz/data/class_data.csv",
"class_data.csv")
class_data <- read_csv("class_data.csv")
Recall we investigated the relationship between eating breakfast and dog/cat preference. A simple graphical display might use stacked bars compare the counts of those of ate breakfast within our two groups (cat people and dog people):
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast))
** Why are there three values for cat_dog
?**
It’s a little easier to compare proportions of those who ate breakfast by forcing the bars to sum to 1:
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast), position = "fill")
which makes it easier to see, based on our sample, dog people seem a little more likely to eat breakfast. The same information, but in tabular form comes from cross tabulation of the two variables. This can be done with the table()
function (just like in the one sample case):
table(class_data$cat_dog, class_data$ate_breakfast)
or using the xtabs()
function:
xtabs( ~ cat_dog + ate_breakfast, data = class_data)
Either way you’ll notice the missing values get silently excluded from the tabulation.
To add row and column sum use the addmargins()
function:
tab <- xtabs( ~ cat_dog + ate_breakfast, data = class_data)
addmargins(tab)
Saving the result may be useful if you plan on calculating things by hand, since you can access the elements are you would a matrix. For example to find the proportion of cat people who didn’t eat breakfast:
tab_margins <- addmargins(tab)
tab_margins["cats", "no"]/tab_margins["cats", "Sum"]
You can also easily move to proportions with the prop.table()
function. Which allows you to divide the row entries by the row sums:
prop.table(tab, margin = 1)
Column entries by column sums:
prop.table(tab, margin = 2)
Or cell entries by table sum
prop.table(tab)
Which table has the estimate of P(Ate breakfast | Prefer cats)? Which table has the estimate of P(Prefer cats | Ate breakfast)? What about P(Prefer cats & Ate breakfast)?
The order of the rows and columns is generally determined alphabetically, e.g. c for cats comes before d for dogs. But you can control this by turning the variable into a factor (R’s object for categorical variables), and explicitly setting the levels
argument:
class_data <- class_data %>%
mutate(cat_dog_f = factor(cat_dog, levels = c("dogs", "cats")))
xtabs( ~ cat_dog_f + ate_breakfast, data = class_data)
All three test functions, prop.test()
, chisq.test()
and fisher.test()
will take the tabulated data as the only argument:
prop.test(tab, correct = FALSE)
chisq.test(tab, correct = FALSE)
fisher.test(tab)
All three interpret the left column as the “successes”, and test whether the probability of success is the same between rows.
Reorder the levels for ate_breakfast
to have yes
in the first column, re-tabulate, then re-run tests to replicate calculations from class.
If you are interested in ways to visualize more than one categorical variable, you might check out Chapter 7 in Graphical Data Analysis in R
filter()
and mutate()
In canvas, visit the Lab 8 @ DataCamp Assignment to complete Chapter 1 of Introduction to the Tidyverse on DataCamp for an overview of the dplyr functions filter()
and mutate()
.