Bayesianische Statistik

# Bayesianische Statistik
## Teil 5a <br/> Modellvergleiche & Hypothesentests
### Andrew Ellis
### Methodenkurs Neurowissenschaft im Computerlab
### 2021-04-27

---

layout: true
  

<div class="my-footer">
<span>
<a href="https://kogpsy.github.io/neuroscicomplab" target="_blank"><svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#0F4C81;overflow:visible;position:relative;"><path d="M280.37 148.26L96 300.11V464a16 16 0 0 0 16 16l112.06-.29a16 16 0 0 0 15.92-16V368a16 16 0 0 1 16-16h64a16 16 0 0 1 16 16v95.64a16 16 0 0 0 16 16.05L464 480a16 16 0 0 0 16-16V300L295.67 148.26a12.19 12.19 0 0 0-15.3 0zM571.6 251.47L488 182.56V44.05a12 12 0 0 0-12-12h-56a12 12 0 0 0-12 12v72.61L318.47 43a48 48 0 0 0-61 0L4.34 251.47a12 12 0 0 0-1.6 16.9l25.5 31A12 12 0 0 0 45.15 301l235.22-193.74a12.19 12.19 0 0 1 15.3 0L530.9 301a12 12 0 0 0 16.9-1.6l25.5-31a12 12 0 0 0-1.7-16.93z"/></svg></a> Methodenkurs Neurowissenschaft im Computerlab
</span>
</div>

---

## Einführung

- Parameter geschätzt
    + Binomial (Bernoulli) Modell
    + Lineare Modelle mit kategorialem Prädiktor
- Posterior Verteilung(en) zusammengefasst
    + Credible Intervals
    + Highest Posterior Density Intervals

]

- z.B. Modell 1 (M1) erklärt die Daten besser als Modell 2 (M2).
- M1 ist wahrscheinlicher als M2

Solche Aussagen sind **Modellvergleiche**.
]

---

- Diskutieren Sie anhand dieses Beispiels, wie Sie die Hypothese testen können, dass die Intervention einen Effekt hat.
]
]

Wir haben hier einen within (messwiederholten) Faktor (`time`), und einen between Faktor (`intervention`).

]

---

## Mögliche Methoden

- Hypothesentest, z.B.

```r
t.test(diff ~ intervention,
       data = dwide)
```
]
.pull-right[

```r
m1 <- lmer(score ~ intervention + time + (1|id), 
                  data = d)
m2 <- lmer(score ~ intervention * time + (1|id), 
                  data = d)
```

]

- Modellvergleiche

- Varianzaufklärung

- Informationskriterien

- Kreuzvalidierung

Wir lernen nun Methoden kennen, um mit Bayesianischen Modellen Modellvergleiche durchzuführen.

---

---

## Bayesianische Modellvergleiche

Wir unterscheiden 3 verschiedene Methoden

1) Bayes Factors

2) Out of sample predictive accuracy:

> Approximate leave-one-out cross validation (LOO)

> Informationskriterien: widely application information criterion (WAIC)
    
3) Posterior predictive checking:
  > Konstruktion einer Teststatistik und Vergleich der aktuellen Daten mit generierten Daten aus der Posterior Predictive Distribution.

Heute schauen wir uns die erste Methode (Bayes Factors) an. Diese sind unter Statistikern sehr umstritten. Psychologen wollen Hypothesen testen (Evidenz für/gegen Hypothesen), während in anderen Bereichen dies weniger verbreitet ist.

---

]

]

]
]

---

## Bayes factor

Bayes' Theorem, mit expliziter Abhängigkeit der Parameter `$\mathbf{\theta}$`  vom Modells `$\mathcal{M}$`:

$$ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}$$

`$\mathcal{M}$` bezieht sich auf ein bestimmtes Modell.

`$p(y | \mathcal{M})$` ist die Wahrscheinlichkeit der Daten (marginal likelihood), über alle möglichen Parameterwerte des Modells `$\mathcal{M}$` integriert. Wir haben sie anfangs als Normierungskonstante bezeichnet.

$$ P(\theta|Data) = \frac{ P(Data|\theta) * P(\theta) } {P(Data)} $$
]
.panel[.panel-name[Bayes Theorem ohne Konstante]
$$ P(\theta|Data) \propto P(Data|\theta) * P(\theta) $$
]
.panel[.panel-name[Marginal likelihood]

$$ p(y | \mathcal{M}) = \int_{\theta}{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta} $$
Bei der Berechnung der Marginal Likelihood muss über alle unter dem Modell möglichen Werte von `$\theta$` gemittelt werden.

]
]

---

## Komplexität

- Marginal Likelihood wird auch __Model Evidence__ genannt. Sie hängt davon ab, welche Vorhersagen ein Model machen kann (warum?).

- Ein Modell, welches viele Vorhersagen machen kann, ist ein __komplexes__ Modell. Die meisten Vorhersagen werden aber falsch sein (warum?).

Komplexität hängt unter anderem ab von:

- Anzahl Parameter im Modell

- Prior Verteilungen der Parameter

Bei frequentistischen Modellen spielt nur die Anzahl Parameter eine Rolle.

---

## Prior Verteilungen

Uninformative Priors machen viele Vorhersagen, vor allem in Gegenden des Parameterraums, in denen die Likelihood niedrig ist.

```r
n_points <- 100
theta_grid <- seq(from=0, to=1 , length.out = n_points)
likelihood <- dbinom(6, size = 9, prob = theta_grid)

compute_posterior = function(likelihood, prior){
  unstandardized_posterior <- likelihood * prior
  posterior <- unstandardized_posterior / sum(unstandardized_posterior)
  par(mfrow=c(1, 3))
  plot(theta_grid , prior, type="l", main="Prior", col = "dodgerblue3", lwd = 2)
  plot(theta_grid , likelihood, type="l", main="Likelihood", col = "firebrick3", lwd = 2)
  plot(theta_grid , posterior , type="l", main="Posterior", col = "darkorchid3", lwd = 2)
}
```

---

## Prior Verteilungen

Uninformative Priors machen viele Vorhersagen, vor allem in Gegenden des Parameterraums, in denen die Likelihood niedrig ist.

```r
prior <- dbeta(x = theta_grid, shape1 = 1, shape2 = 1)
compute_posterior(likelihood, prior)
```

<img src="05-bayesian-stats-slides_files/figure-html/unnamed-chunk-11-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 20, shape2 = 20)
compute_posterior(likelihood, prior)
```

<img src="05-bayesian-stats-slides_files/figure-html/unnamed-chunk-12-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 2, shape2 = 40)
compute_posterior(likelihood, prior)
```

<img src="05-bayesian-stats-slides_files/figure-html/unnamed-chunk-13-1.png" width="100%" />
]

```r
prior <- dbeta(x = theta_grid, shape1 = 48, shape2 = 30)
compute_posterior(likelihood, prior)
```

<img src="05-bayesian-stats-slides_files/figure-html/unnamed-chunk-14-1.png" width="100%" />
]
]

---

## Ockham's Razor

- Komplexe Modelle haben eine niedrigere Marginal Likelihood.

- Wenn wir bei einem Vergleich mehrerer Modelle diejenigen Modelle mit höherer Marginal Likelihood bevorzugen, wenden wir das __Prinzip der Sparsamkeit__ an.

Wir schreiben Bayes Theorem mit expliziten Modellen M1 und M2

$$ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)} $$

und

$$ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)} $$

Für Model `$\mathcal{M_m}$` ist die Posterior Wahrscheinlichkeit des Modells proportional zum Produkt der Marginal Likelihood und der A Priori Wahrscheinlichkeit.

---

## Modellvergleich

Verhältnis der Modellwahrscheinlichkeiten:

$$ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}} $$

`$p(y)$` kann rausgekürzt werden.

`$$\underbrace{\frac{p(\mathcal{M}_1 | y)} {p(\mathcal{M}_2 | y)}}_\text{Posterior odds} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} \cdot \underbrace{ \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}}_\text{Prior odds}$$`

<br>

`$\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}$` sind die  **prior odds**, und `$\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}$` sind die **posterior odds**.

---

Nehmen wir an, die Prior Odds seine `$1$`, dann interessiert uns nur das Verhältnis der Marginal Likelihoods:

$$ \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)} $$

Dieser Term ist der **Bayes factor**: der Term, mit dem die Prior Odds multipliziert werden. Er gibt an, unter welchem Modell die Daten wahrscheinlicher sind.

Wir schreiben `$BF_{12}$` - dies ist der Bayes factor für Modell 1 vs Modell 2.

$$ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}$$

`$BF_{12}$` gibt an, in welchem Ausmass die Daten `$\mathcal{M}_1$` bevorzugen, relativ zu `$\mathcal{M}_2$`.

Beispiel: `$BF_{12} = 5$` heisst: die Daten sind unter Modell 1 5 Mal wahrscheinlicher als unter Modell 2

---

## Klassifizierung

---

## Bayes Factor

- Sehr stark abhängig von den Prior Verteilungen der Parameter.

- Ist nur für sehr simple Modelle einfach zu berechnen/schätzen.

- Schätzmethoden
    + Savage-Dickey Density Ratio mit `Stan`/`brms`
    
    + Package [BayesFactor](https://cran.r-project.org/web/packages/BayesFactor/vignettes/manual.html) (nur für allgemeine lineare Modelle)
    
    + [JASP](https://jasp-stats.org/): Im Prinzip ein GUI für `BayesFactor`
    
    + Bridge sampling mit `brms`: schwieriger zu implementieren, aber für viele Modelle möglich.

---

.pull-left-narrow[
  .huge-blue-number[2]
]
.pull-right-wide[
  .larger[
  Anwendung: Binomial Model
  ]
]

---

## Savage-Dickey Density Ratio

Wenn wir zwei genestete Modell vergleichen, wie z.B. ein Modell 1 mit einem frei geschätzten Parameter, und ein Nullmodell 
vergleichen, gibt es eine einfache Methode.

Unter dem Nullmodell (Nullhypothese): `$H_0: \theta = \theta_0$`

Unter Modell 1: `$H_1: \theta \neq \theta_0$`

Nun braucht der Parameter `$\theta$` unter Modell 1 eine Verteilung, z.B. `$\theta \sim \text{Beta}(1, 1)$`

Der Savage-Dickey Density Ratio Trick: wir betrachten nur Modell 1, und dividieren die Posterior Verteilung durch die Prior Verteilung an der Stelle `$\theta_0$`.

---

## Savage-Dickey Density Ratio

Wir schauen uns ein Beispiel aus Wagenmakers (2010) an:

Sie beobachten, dass jemand 9 Fragen von 10 richtig beantwortet.

```r
d <- tibble(s = 9, k = 10)
```

Unsere Frage: was ist die Wahrscheinlichkeit, dass das mit Glück ( `$\theta = 0.5$` ) passiert ist?

.footnote[
Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. “Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.” Cognitive Psychology 60, no. 3 (May 1, 2010): 158–89. https://doi.org/10.1016/j.cogpsych.2009.12.001.
]

---

.panelset[
.panel[.panel-name[Prior]
<img src="05-bayesian-stats-slides_files/figure-html/unnamed-chunk-17-1.png" width="100%" />
]

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Bayes Factors.</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> x </th>
   <th style="text-align:right;"> Prior </th>
   <th style="text-align:right;"> Posterior </th>
   <th style="text-align:right;"> BF01 </th>
   <th style="text-align:right;"> BF10 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.5 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.107 </td>
   <td style="text-align:right;"> 0.107 </td>
   <td style="text-align:right;"> 9.309 </td>
  </tr>
</tbody>
</table>
]
]

---

## Savage-Dickey Density Ratio

```r
library(brms)

m1_formula <- bf(
  s | trials(k) ~ 0 + Intercept,
  family = binomial(link = "identity"))
```

]

```r
get_prior(m1_formula, 
          data = d)
```

```
##   prior class      coef group resp dpar nlpar bound       source
##  (flat)     b                                            default
##  (flat)     b Intercept                             (vectorized)
```
]
.panel[.panel-name[Set Priors]

```r
priors <- set_prior("beta(1, 1)", 
                   class = "b", 
                   lb = 0, ub = 1)
```

]

```r
m1 <- brm(m1_formula,
         prior = priors,
         data = d,
*        sample_prior = TRUE,
         file = "models/05-m1")
```

]
.panel[.panel-name[Summary]

```r
summary(m1)
```

```
##  Family: binomial 
##   Links: mu = identity 
## Formula: s | trials(k) ~ 0 + Intercept 
##    Data: d (Number of observations: 1) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.84      0.10     0.60     0.98 1.01     1273     1320
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

]
]

---

## Posterior summary

```r
m1 %>% 
  mcmc_plot(c("b_Intercept", "prior_b"))
```

---

## Prior und Posterior

```r
samples <- m1 %>% 
  posterior_samples("b")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption>Six first rows of posterior samples.</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> b_Intercept </th>
   <th style="text-align:right;"> prior_b </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.84 </td>
   <td style="text-align:right;"> 0.94 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.73 </td>
   <td style="text-align:right;"> 0.31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.78 </td>
   <td style="text-align:right;"> 0.26 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> 0.74 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.79 </td>
   <td style="text-align:right;"> 0.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> 0.74 </td>
  </tr>
</tbody>
</table>

]

```r
samples %>% 
  pivot_longer(everything(), 
               names_to = "Type", 
               values_to = "value") %>%
  ggplot(aes(value, color = Type)) +
  geom_density(size = 1.5) +
  scale_color_viridis_d(end = 0.8) +
  labs(x = bquote(theta), y = "Density") +
  geom_vline(xintercept = .9) 
```

]

---

## Savage-Dickey Density Ratio

### Mit `brms`

```r
h <- m1 %>% 
  hypothesis("Intercept = 0.5")

print(h, digits = 4)
```

```
## Hypothesis Tests for class b:
##              Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
## 1 (Intercept)-(0.5) = 0   0.3361    0.1007   0.0953   0.4754     0.0862    0.0793    *
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
```

---

## Savage-Dickey Density Ratio

```r
plot(h)
```

---

.pull-left-narrow[
  .huge-blue-number[3]
]
.pull-right-wide[
  .larger[
  Anwendung: Normalverteilung
  ]
]

---

## Bayes Factor für Vergleich von zwei Mittelwerten

```r
smart = tibble(IQ = c(101,100,102,104,102,97,105,105,98,101,100,123,105,103,
                      100,95,102,106,109,102,82,102,100,102,102,101,102,102,
                      103,103,97,97,103,101,97,104,96,103,124,101,101,100,
                      101,101,104,100,101),
               Group = "SmartDrug")

placebo = tibble(IQ = c(99,101,100,101,102,100,97,101,104,101,102,102,100,105,
                        88,101,100,104,100,100,100,101,102,103,97,101,101,100,101,
                        99,101,100,100,101,100,99,101,100,102,99,100,99),
                 Group = "Placebo")

TwoGroupIQ <- bind_rows(smart, placebo)  %>%
    mutate(Group = fct_relevel(as.factor(Group), "Placebo"))
```

---

```r
t.test(IQ ~ Group,
       data = TwoGroupIQ)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.1098
## alternative hypothesis: true difference in means between group Placebo and group SmartDrug is not equal to 0
## 95 percent confidence interval:
##  -3.4766863  0.3611848
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149
```

---

```r
m2_formula <- bf(IQ ~ 1 + Group)
```

]

```r
get_prior(m2_formula,
          data = TwoGroupIQ)
```

```
##                   prior     class           coef group resp dpar nlpar bound       source
##                  (flat)         b                                                 default
##                  (flat)         b GroupSmartDrug                             (vectorized)
##  student_t(3, 101, 2.5) Intercept                                                 default
##    student_t(3, 0, 2.5)     sigma                                                 default
```
]

```r
priors = c(set_prior("normal(0, 1)",
                    class = "b", coef = "GroupSmartDrug"))
```

]

```r
m2 <- brm(m2_formula,
          prior = priors,
          data = TwoGroupIQ,
          cores = parallel::detectCores(),
          sample_prior = TRUE,
          file = here::here("models/05-m2-iq-bf"))
```

]

```r
summary(m2)
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: IQ ~ 1 + Group 
##    Data: TwoGroupIQ (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept        100.76      0.61    99.58   101.96 1.00     4374     2977
## GroupSmartDrug     0.76      0.71    -0.64     2.17 1.00     4267     3012
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     4.72      0.36     4.11     5.48 1.00     3654     2814
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```
]
]

---

```r
m2 %>% 
  mcmc_plot("GroupSmartDrug")
```

---

```r
BF <- hypothesis(m2,
                hypothesis = 'GroupSmartDrug = 0')

BF
```

```
## Hypothesis Tests for class b:
##             Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
## 1 (GroupSmartDrug) = 0     0.76      0.71    -0.64     2.17       0.81      0.45     
## ---
## 'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
## '*': For one-sided hypotheses, the posterior probability exceeds 95%;
## for two-sided hypotheses, the value tested against lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.
```

```r
1/BF$hypothesis$Evid.Ratio
```

```
## [1] 1.23144
```

---

```r
write_csv(TwoGroupIQ, file = "SmartDrug.csv")
```

Laden Sie [JASP](https://jasp-stats.org/download/) herunter, und öffnen Sie das gespeicherte CSV File.

Führen Sie in JASP einen **Bayesian t-Test** durch.

- Warum erhalten Sie nicht denselben Bayes Factor wie JASP?
- Probieren Sie verschiedene Prior Verteilungen aus.
- Versuchen Sie, das Ergebnis in Worten zusammenzufassen. Was sagt der Bayes Factor aus?
]