Data modelling and hypothesis tests

# Data modelling and hypothesis tests
## Day 2
### Jason Lerch
### 2018/09/11

---

.medium {
  font-size: 80%;
}

.footnote {
  font-size: 75%;
  color: gray;
}

.smallcode {

}
.smallcode .remark-code {
  font-size: 50%
}

.smallercode {

}
.smallercode .remark-code {
  font-size: 75%
}

</style>

# Hello World

Goals for today:

1. From populations to samples

1. Testing proportions

1. Introduction to the p value

1. The p value understood through permutations

1. Testing associations between two continuous variables

1. Testing associations between one factor and one continuous variable

1. The linear model

1. From factors to numbers (understanding contrasts)

1. Linear mixed effects models

1. The fundamental principles of analytical design

---

# From populations to samples

![](images/sample.png)

???

Talk about sources of bias

---

# Data types

![](images/data-types.png)

Data types determine choice of statistics and/or encoding.

# Reload the data

```r
library(tidyverse)
```

```
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
```

```
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
```

```
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```

```r
mice <- read_csv("mice.csv") %>%
  inner_join(read_csv("volumes.csv"))
```

```
## Parsed with column specification:
## cols(
##   Age = col_double(),
##   Sex = col_character(),
##   Condition = col_character(),
##   Mouse.Genotyping = col_character(),
##   ID = col_integer(),
##   Timepoint = col_character(),
##   Genotype = col_character(),
##   DaysOfEE = col_integer(),
##   DaysOfEE0 = col_integer()
## )
```

```
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   ID = col_integer(),
##   Timepoint = col_character()
## )
```

```
## See spec(...) for full column specifications.
```

```
## Joining, by = c("ID", "Timepoint")
```

---

# Sex ratios

Are the sex ratios in our data balanced?

```r
baseline <- mice %>% filter(Timepoint == "Pre1")
addmargins(with(baseline, table(Sex)))
```

```
## Sex
##   F   M Sum 
## 101 165 266
```

What should we expect?

Assume equal probability of male or female

```r
nrow(baseline) / 2
```

```
## [1] 133
```

---

# How likely was our real value?

Binomial distribution - flip of a coin.

```r
rbinom(1, 1, 0.5)
```

```
## [1] 1
```

```r
rbinom(1, 1, 0.5)
```

```
## [1] 1
```

```r
rbinom(1, 1, 0.5)
```

```
## [1] 1
```

```r
rbinom(10, 1, 0.5)
```

```
##  [1] 0 0 1 1 0 1 1 0 1 0
```

---

# How likely was our real value?

```r
baseline <- mice %>% filter(Timepoint == "Pre1")
addmargins(with(baseline, table(Sex)))
```

```
## Sex
##   F   M Sum 
## 101 165 266
```

Assuming random choice of male or female:

```r
distribution <- rbinom(266, 1, 0.5)
sum(distribution==1)
```

```
## [1] 150
```

```r
rbinom(1, 266, 0.5)
```

```
## [1] 123
```

???

Get everyone in class to run it and get some answers

---

# Long run probability

We did a single experiment, and obtained 101 Females and 165 Males.

If we were to rerun the experiment again and again and again, and each experimental mouse had a 50/50 chance of being male or female, how often would we obtain 101 Females or fewer?

```r
nexperiments <- 1000 
females <- vector(length=nexperiments)
for (i in 1:nexperiments) {
  females[i] <- rbinom(1, 266, 0.5)
}
head(females)
```

```
## [1] 121 137 138 124 129 125
```

Can be shortened as

```r
females2 <- rbinom(nexperiments, 266, 0.5)
head(females2)
```

```
## [1] 129 137 124 103 120 128
```

---

# Long run probability

```r
head(females)
```

```
## [1] 121 137 138 124 129 125
```

```r
ggplot(data.frame(females=females)) +
  aes(x=females) + 
  geom_histogram(binwidth = 3) + 
  theme_minimal(16)
```

![](modelling_files/figure-html/unnamed-chunk-10-1.png)
---

# Long run probability

```r
ggplot(data.frame(females=females)) +
  aes(x=females) + 
  geom_histogram(binwidth = 3) + 
  theme_minimal(16)
```

![](modelling_files/figure-html/unnamed-chunk-11-1.png)

```r
sum(females<=101)
```

```
## [1] 0
```

---

# Closed form solution

```r
ggplot() +
  geom_histogram(data=data.frame(females=females),
                 aes(x=females, y=..density..),
                 binwidth = 3) + 
  geom_bar(aes(c(100:160)), stat="function", 
*          fun=function(x) dbinom(round(x), 266, 0.5),
           alpha=0.5, fill="blue") +
  theme_minimal(16)
```

![](modelling_files/figure-html/unnamed-chunk-12-1.png)

---

# Closed form solution

```r
pbinom(101, 266, 0.5)
```

```
## [1] 5.223361e-05
```

```r
sum(dbinom(0:101, 266, 0.5))
```

```
## [1] 5.223361e-05
```

]

---

# Review

* We asked whether the sex ratio in the study was likely to be random, assuming an equal chance of an experimental mouse being male or female.

* We simulated 1000 studies under the assumption of n=266 and the odds of being female = 50%

* This is the null hypothesis.

* Our random data simulations test the null hypothesis: what would happen if we ran the experiment again and again and again under the same conditions assuming random assignment of males and females?

* Our p-value - the long run probability under repeated experiments - was vanishingly small.

So the choice of sex was almost certainly non-random. Does it matter?

---

# Contingency table

```r
baseline <- mice %>% filter(Timepoint == "Pre1")
with(baseline, table(Sex, Genotype))
```

```
##    Genotype
## Sex CREB -/- CREB +/- CREB +/+
##   F       29       31       41
##   M       53       59       53
```

```r
addmargins(with(baseline, table(Sex, Genotype)))
```

```
##      Genotype
## Sex   CREB -/- CREB +/- CREB +/+ Sum
##   F         29       31       41 101
##   M         53       59       53 165
##   Sum       82       90       94 266
```

---
# What would we expect?

The table of observed numbers

```r
addmargins(with(baseline, table(Sex, Genotype))) %>% 
  knitr::kable(format = 'html')
```

Calculating the expected numbers

|    | CREB -/- | CRE +/- | CREB +/+ | Sum |
|----|----------|---------|----------|-----|
|F   |82*101/266|90*101/266|94*101/266|101|
|M   |82*165/266|90*165/266|94*165/266|165|
|Sum |82        |90        |94        |266

---

# Using the chisq.test function for these calculations

```r
xtest <- with(baseline, chisq.test(Sex, Genotype))
addmargins(xtest$observed)
```

```
##      Genotype
## Sex   CREB -/- CREB +/- CREB +/+ Sum
##   F         29       31       41 101
##   M         53       59       53 165
##   Sum       82       90       94 266
```

```r
addmargins(xtest$expected)
```

```
##      Genotype
## Sex   CREB -/- CREB +/- CREB +/+ Sum
##   F   31.13534 34.17293 35.69173 101
##   M   50.86466 55.82707 58.30827 165
##   Sum 82.00000 90.00000 94.00000 266
```

---

# `$\chi^2$` test

`$$\chi^2 = \sum^k_{i=1} \sum^l_{j=1} \frac{n_{ij} - \tilde{n}_{ij}}{\tilde{n}_{ij}} = \sum^k_{i=1} \sum^l_{j=1} \frac{(n_{ij} - \frac{n_i + n + j}{n})^2}{\frac{n_i + n + j}{n}}$$`

![](images/chisq.png)

```
sum( ((xtest$observed - xtest$expected)^2)/xtest$expected )
```

---

# `$\chi^2$` test

```r
sum( ((xtest$observed - xtest$expected)^2)/xtest$expected )
```

```
## [1] 1.983758
```

Put that number into context?

![](modelling_files/figure-html/unnamed-chunk-21-1.png)

---

# `$\chi^2$` test

```r
with(baseline, chisq.test(Sex, Genotype))
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  Sex and Genotype
## X-squared = 1.9838, df = 2, p-value = 0.3709
```

---