`library(tidyverse)`

## Contingency tables in R

To illustrate some of the calculations from class we’ll use the same data (you’ll need to download a fresh copy to get the `ate_breakfast`

column):

```
download.file("http://st551.cwick.co.nz/data/class_data.csv",
"class_data.csv")
```

`class_data <- read_csv("class_data.csv")`

Recall we investigated the relationship between eating breakfast and dog/cat preference. A simple graphical display might use stacked bars compare the counts of those of ate breakfast within our two groups (cat people and dog people):

```
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast))
```

** Why are there three values for `cat_dog`

?**

It’s a little easier to compare proportions of those who ate breakfast by forcing the bars to sum to 1:

```
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast), position = "fill")
```

which makes it easier to see, based on our sample, dog people seem a little more likely to eat breakfast. The same information, but in tabular form comes from cross tabulation of the two variables. This can be done with the `table()`

function (just like in the one sample case):

`table(class_data$cat_dog, class_data$ate_breakfast)`

or using the `xtabs()`

function:

`xtabs( ~ cat_dog + ate_breakfast, data = class_data)`

Either way you’ll notice the missing values get silently excluded from the tabulation.

To add row and column sum use the `addmargins()`

function:

```
tab <- xtabs( ~ cat_dog + ate_breakfast, data = class_data)
addmargins(tab)
```

Saving the result may be useful if you plan on calculating things by hand, since you can access the elements are you would a matrix. For example to find the proportion of cat people who didn’t eat breakfast:

```
tab_margins <- addmargins(tab)
tab_margins["cats", "no"]/tab_margins["cats", "Sum"]
```

You can also easily move to proportions with the `prop.table()`

function. Which allows you to divide the row entries by the row sums:

`prop.table(tab, margin = 1)`

Column entries by column sums:

`prop.table(tab, margin = 2)`

Or cell entries by table sum

`prop.table(tab)`

**Which table has the estimate of P(Ate breakfast | Prefer cats)? Which table has the estimate of P(Prefer cats | Ate breakfast)? What about P(Prefer cats & Ate breakfast)?**

The order of the rows and columns is generally determined alphabetically, e.g. c for **c**ats comes before d for **d**ogs. But you can control this by turning the variable into a factor (R’s object for categorical variables), and explicitly setting the `levels`

argument:

```
class_data <- class_data %>%
mutate(cat_dog_f = factor(cat_dog, levels = c("dogs", "cats")))
xtabs( ~ cat_dog_f + ate_breakfast, data = class_data)
```

All three test functions, `prop.test()`

, `chisq.test()`

and `fisher.test()`

will take the tabulated data as the only argument:

```
prop.test(tab, correct = FALSE)
chisq.test(tab, correct = FALSE)
fisher.test(tab)
```

All three interpret the left column as the “successes”, and test whether the probability of success is the same between rows.

**Reorder the levels for ate_breakfast to have yes in the first column, re-tabulate, then re-run tests to replicate calculations from class.**

If you are interested in ways to visualize more than one categorical variable, you might check out Chapter 7 in Graphical Data Analysis in R

`filter()`

and `mutate()`

In canvas, visit the Lab 8 @ DataCamp Assignment to complete Chapter 1 of *Introduction to the Tidyverse* on DataCamp for an overview of the dplyr functions `filter()`

and `mutate()`

.