---
title: Lab 8
author: Charlotte Wickham
date: '2017-11-14'
slug: lab-8
draft: false
output:
blogdown::html_page:
toc: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "hide",
message = FALSE, fig.keep = "none")
set.seed(1918)
library(pander)
```
```{r}
library(tidyverse)
```
## Contingency tables in R
To illustrate some of the calculations from class we'll use the same data (you'll need to download a fresh copy to get the `ate_breakfast` column):
```{r, eval = FALSE}
download.file("http://st551.cwick.co.nz/data/class_data.csv",
"class_data.csv")
```{r}
class_data <- read_csv("class_data.csv")
```
Recall we investigated the relationship between eating breakfast and dog/cat preference. A simple graphical display might use stacked bars compare the counts of those of ate breakfast within our two groups (cat people and dog people):
```{r}
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast))
```
** Why are there three values for `cat_dog`?**
It's a little easier to compare proportions of those who ate breakfast by forcing the bars to sum to 1:
```{r}
ggplot(class_data) +
geom_bar(aes(x = cat_dog, fill = ate_breakfast), position = "fill")
```
which makes it easier to see, based on our sample, dog people seem a little more likely to eat breakfast. The same information, but in tabular form comes from cross tabulation of the two variables. This can be done with the `table()` function (just like in the one sample case):
```{r}
table(class_data$cat_dog, class_data$ate_breakfast)
```
or using the `xtabs()` function:
```{r}
xtabs( ~ cat_dog + ate_breakfast, data = class_data)
```
Either way you'll notice the missing values get silently excluded from the tabulation.
To add row and column sum use the `addmargins()` function:
```{r}
tab <- xtabs( ~ cat_dog + ate_breakfast, data = class_data)
addmargins(tab)
```
Saving the result may be useful if you plan on calculating things by hand, since you can access the elements are you would a matrix. For example to find the proportion of cat people who didn't eat breakfast:
```{r}
tab_margins <- addmargins(tab)
tab_margins["cats", "no"]/tab_margins["cats", "Sum"]
```
You can also easily move to proportions with the `prop.table()` function. Which allows you to divide the row entries by the row sums:
```{r}
prop.table(tab, margin = 1)
```
Column entries by column sums:
```{r}
prop.table(tab, margin = 2)
```
Or cell entries by table sum
```{r}
prop.table(tab)
```
**Which table has the estimate of P(Ate breakfast | Prefer cats)? Which table has the estimate of P(Prefer cats | Ate breakfast)? What about P(Prefer cats & Ate breakfast)?**
The order of the rows and columns is generally determined alphabetically, e.g. c for **c**ats comes before d for **d**ogs. But you can control this by turning the variable into a factor (R's object for categorical variables), and explicitly setting the `levels` argument:
```{r}
class_data <- class_data %>%
mutate(cat_dog_f = factor(cat_dog, levels = c("dogs", "cats")))
xtabs( ~ cat_dog_f + ate_breakfast, data = class_data)
```
All three test functions, `prop.test()`, `chisq.test()` and `fisher.test()` will take the tabulated data as the only argument:
```{r}
prop.test(tab, correct = FALSE)
chisq.test(tab, correct = FALSE)
fisher.test(tab)
```
All three interpret the left column as the "successes", and test whether the probability of success is the same between rows.
**Reorder the levels for `ate_breakfast` to have `yes` in the first column, re-tabulate, then re-run tests to replicate calculations from class.**
If you are interested in ways to visualize more than one categorical variable, you might check out Chapter 7 in [Graphical Data Analysis in R](https://ebookcentral.proquest.com/lib/osu/detail.action?docID=4648053)
## `filter()` and `mutate()`
In canvas, visit the [Lab 8 @ DataCamp Assignment](https://oregonstate.instructure.com/courses/1653112/assignments/7121047) to complete Chapter 1 of *Introduction to the Tidyverse* on DataCamp for an overview of the dplyr functions `filter()` and `mutate()`.