Multivariate normal model cont’d

class: center, middle, inverse, title-slide

# Multivariate normal model cont’d
### Dr. Olanrewaju Michael Akande
### Feb 21, 2020

---

## Announcements

- Homework 5 will be online by 5pm today.

## Outline

- Chat on survey responses

- Multivariate normal/Gaussian model
  
  + Inference for mean (recap)
  
  + Inference for covariance
  
  + Back to the example
  
  + Gibbs sampler
  
  + Jeffreys' prior

---
class: center, middle

# Multivariate normal model

---
## Conditional inference on mean recap

- For data `$\boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma)$`,
.block[
.small[
$$
`\begin{split}
L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}})  \right\}.
\end{split}`
$$
]
]

- If we assume `$\pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0)$`, that is,
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\
\end{split}`
$$
]
]

- Then
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\} \ \equiv \ \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n)
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\
\boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right].
\end{split}`
$$
]
]

---
## Conditional inference on mean recap

- As in the univariate case, we once again have that
  + Posterior precision is sum of prior precision and data precision:
.block[
.small[
$$
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
$$
]
]

--
  + Posterior expectation is weighted average of prior expectation and the sample mean:
.block[
.small[
$$
`\begin{split}
\boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\
\\
& = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}}
\end{split}`
$$
]
]

- Compare these to the results from the univariate case to gain more intuition.

---
## What about the covariance matrix?

- A random variable `$\Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$`, where `$\Sigma$` is positive definite and `$p \times p$`, has pdf
.block[
.small[
$$
`\begin{split}
p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\},
\end{split}`
$$
]
]

where
    + `$\text{tr}(\cdot)$` is the **trace function** (sum of diagonal elements), 
    + `$\nu_0 > p - 1$` is the "degrees of freedom", and 
    + `$\boldsymbol{S}_0$` is a `$p \times p$` positive definite matrix.
  
--

- For this distribution, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`, for `$\nu_0 > p + 1$`.

- Hence, `$\boldsymbol{S}_0$` is the scaled mean of the `$\mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$`.

---
## Wishart distribution

- If we are very confidence in a prior guess `$\Sigma_0$`, for `$\Sigma$`, then we might set
  + `$\nu_0$`, the degrees of freedom to be very large, and
  + `$\boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0$`. 
  
  In this case, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0$`, and `$\Sigma$` is tightly (depending on the value of `$\nu_0$`) centered around `$\Sigma_0$`.
  
--

- If we are not at all confident but we still have a prior guess `$\Sigma_0$`, we might set
  + `$\nu_0 = p + 2$`, so that the `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$` is finite.
  + `$\boldsymbol{S}_0 = \Sigma_0$`
  
  Here, `$\mathbb{E}[\Sigma] = \Sigma_0$` as before, but `$\Sigma$` is only loosely centered around `$\Sigma_0$`.

---
## Wishart distribution

- Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the .hlight[Wishart distribution] (multivariate generalization of the gamma) instead.

- The .hlight[Wishart distribution] provides a conditionally-conjugate prior for the precision matrix `$\Sigma^{-1}$` in a multivariate normal model.

- Specifically, if `$\Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`.

- A random variable `$\Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`, where `$\Phi$` has dimension `$(p \times p)$`, has pdf
.block[
.small[
$$
`\begin{split}
f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}.
\end{split}`
$$
]
]

- Here, `$\mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0$`.

- Note that the textbook writes the inverse-Wishart as `$\mathcal{IW}_p(\nu_0, \boldsymbol{S}_0^{-1})$`. I prefer `$\mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$` instead. Feel free to use either notation but try not to get confused.

---
## Back to inference on covariance

- For inference on `$\boldsymbol{\Sigma}$`, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

- First a few results from matrix algebra:

1. `$\text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}$`, where `$a_{jj}$` is the `$j$`th diagonal element of a square `$p \times p$` matrix `$\boldsymbol{A}$`.
  
--
  2. Cyclic property:
  `$$\text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}),$$`
    given that the product `$\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}$` is a square matrix.
  
--
  3. If `$\boldsymbol{A}$` is a `$p \times p$` matrix, then for a `$p \times 1$` vector `$\boldsymbol{x}$`, 
    `$$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} = \text{tr}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})$$`
    holds by (1), since `$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}$` is a scalar.
  
--
  4. `$\text{tr}(\boldsymbol{A} + \boldsymbol{B}) = \text{tr}(\boldsymbol{A}) + \text{tr}(\boldsymbol{B})$`.

---
## Multivariate normal likelihood again

- It is thus convenient to rewrite `$L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)$` as
.block[
.small[
$$
`\begin{split}
L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}}_{\text{no algebra/change yet}}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta}) \right]}_{\text{by result 3}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by cyclic property}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \underbrace{\text{tr}\left[\sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by result 4}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\
\end{split}`
$$
]
]

where `$\boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T$` is the residual sum of squares matrix.
  
  
  
---
## Conditional posterior for covariance

- Assuming `$\pi(\Sigma) = \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$`, the conditional posterior (full conditional) `$\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}$`, is then
.block[
.small[
$$
`\begin{split}
\pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\
\\
& \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\boldsymbol{\theta})} \\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\
\end{split}`
$$
]
]

which is `$\mathcal{IW}_p(\nu_n, \boldsymbol{S}_n)$`, or using the notation in the book, `$\mathcal{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} )$`, with
    + `$\nu_n = \nu_0 + n$`, and
    + `$\boldsymbol{S}_n = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]$`

---
## Conditional posterior for covariance

- We once again see that the "posterior sample size" or "posterior degrees of freedom" `$\nu_n$` is the sum of the "prior degrees of freedom" `$\nu_0$` and the data sample size `$n$`.

- `$\boldsymbol{S}_n$` can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

- Recall that if `$\Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`.

- `$\Rightarrow$` the conditional posterior expectation of the population covariance is
.block[
.small[
$$
`\begin{split}
\mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\
\\
& = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\
\end{split}`
$$
]
]

which is a weighted average of prior expectation and sample estimate.
  
  
  
---
## Reading comprehension example again

- Vector of observations for each student: `$\boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T$`.
  + `$Y_{i1}$`: pre-instructional score for student `$i$`.
  
  + `$Y_{i2}$`: post-instructional score for student `$i$`.

- Questions of interest:
  + Do students improve in reading comprehension on average?
  
  + If so, by how much?
  
  + Can we predict post-test score from pre-test score? How correlated are they?

+ If we have students with missing pre-test scores, can we predict the scores? (Will defer this till next week!)

---
## Reading comprehension example

- Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.

- Model the data as `$\boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma)$`, that is, 
.block[
.small[
`\begin{eqnarray*}
\boldsymbol{Y} =
\begin{pmatrix}Y_{i1}\\
Y_{i2}
\end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c}
\theta_1\\
\theta_2
\end{array}\right),\Sigma = \left(\begin{array}{cc}
\sigma^2_1 & \sigma_{12} \\
\sigma_{21} & \sigma^2_2
\end{array}\right)\right].\\
\end{eqnarray*}`
]
]
  
--

- We can answer the first two questions of interest by looking at the posterior distribution of `$\theta_2 - \theta_1$`.
  
--

- The correlation between `$Y_1$` and `$Y_2$` is
.block[
.small[
`$$\rho = \dfrac{\sigma_{12}}{\sigma_1 \sigma_2},$$`
]
]

so we can answer the third question by looking at the posterior distribution of `$\rho$`, which we have once we have posterior samples of `$\Sigma$`.
  
  
  
---
## Reading example: prior on mean

- Clearly, we first need to set the hyperparameters `$\boldsymbol{\mu}_0$` and `$\Lambda_0$` in `$\pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0)$`, based on prior belief.

- For this example, both tests were actually designed *apriori* to have a mean of 50, so, we can set `$\boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T$`.

- `$\boldsymbol{\mu}_0 = (0,0)^T$` is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

- Next, we need to set values for elements of
.block[
.small[
`\begin{eqnarray*}
\Lambda_0 = \left(\begin{array}{cc}
\lambda_{11} & \lambda_{12} \\
\lambda_{21} & \lambda_{22}
\end{array}\right)\\
\end{eqnarray*}`
]
]

- It is quite reasonable to believe *apriori* that the true means will most likely lie in the interval `$[25, 75]$` with high probability (perhaps 0.95?), since  individual test scores should lie in the interval `$[0, 100]$`.

- Recall that for any normal distribution, 95% of the density will lie within two standard deviations of the mean.

---
## Reading example: prior on mean

- Therefore, we can set
.block[
.small[
$$
`\begin{split}
& \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\
\Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156.
\end{split}`
$$
]
]

- Similarly, set `$\lambda_{22} \approx 156$`.

- Finally, we expect some correlation between `$\mu_{0(1)}$` and `$\mu_{0(2)}$`, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

--
.block[
.small[
`$$\Rightarrow 0.5 = \dfrac{\lambda_{12}}{\sqrt{\lambda_{11}}\sqrt{\lambda_{22}}} = \dfrac{\lambda_{12}}{156} \ \ \Rightarrow \ \  \lambda_{12} = 156 \times 0.5 = 78.$$`
]
]

- Thus,
.block[
.small[
`\begin{eqnarray*}
\pi(\boldsymbol{\theta}) =
\mathcal{N}_2\left(\boldsymbol{\mu}_0 = \left(\begin{array}{c}
50\\
50
\end{array}\right),\Lambda_0 = \left(\begin{array}{cc}
156 & 78 \\
78 & 156
\end{array}\right)\right).\\
\end{eqnarray*}`
]
]

---
## Reading example: prior on covariance

- Next we need to set the hyperparameters `$\nu_0$` and `$\boldsymbol{S}_0$` in `$\pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0)$`, based on prior belief.

- First, let's start with a prior guess `$\Sigma_0$` for `$\Sigma$`.

- Again, since individual test scores should lie in the interval `$[0, 100]$`, we should set `$\Sigma_0$` so that values outside `$[0, 100]$` are highly unlikely.

- Just as we did with `$\Lambda_0$`, we can use that idea to set the elements of `$\Sigma_0$`
.block[
.small[
`\begin{eqnarray*}
\Sigma_0 = \left(\begin{array}{cc}
\sigma^{(0)}_{11} & \sigma^{(0)}_{12} \\
\sigma^{(0)}_{21} & \sigma^{(0)}_{22}
\end{array}\right)\\
\end{eqnarray*}`
]
]

- The identity matrix is also often a common choice for `$\Sigma_0$` when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

---
## Reading example: prior on covariance

- Therefore, we can set
.block[
.small[
$$
`\begin{split}
& \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\
\Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625.
\end{split}`
$$
]
]

- Similarly, set `$\sigma^{(0)}_{22} \approx 625$`.

- Again, we expect some correlation between `$Y_1$` and `$Y_2$`, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

--
.block[
.small[
`$$\Rightarrow 0.5 = \dfrac{\sigma^{(0)}_{12}}{\sqrt{\sigma^{(0)}_{11}}\sqrt{\sigma^{(0)}_{22}}} = \dfrac{\sigma^{(0)}_{12}}{625} \ \ \Rightarrow \ \  \sigma^{(0)}_{12} = 625 \times 0.5 = 312.5.$$`
]
]

- Thus,
.block[
.small[
`\begin{eqnarray*}
\Sigma_0 = \left(\begin{array}{cc}
625 & 312.5 \\
312.5 & 625
\end{array}\right)\\
\end{eqnarray*}`
]
]

---
## Reading example: prior on covariance

- Recall that if we are not at all confident on a prior value for `$\Sigma$`, but we have a prior guess `$\Sigma_0$`, we can set
  + `$\nu_0 = p + 2$`, so that the `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$` is finite.
  + `$\boldsymbol{S}_0 = \Sigma_0$`
  
  so that `$\Sigma$` is only loosely centered around `$\Sigma_0$`.
  
--

- Thus, we can set
  + `$\nu_0 = p + 2 = 2+2=4$`
  + `$\boldsymbol{S}_0 = \Sigma_0$`
  
  so that we have
.block[
.small[
`\begin{eqnarray*}
\pi(\Sigma) =
\mathcal{IW}_2\left(\nu_0 = 4,\Sigma_0 = \left(\begin{array}{cc}
625 & 312.5 \\
312.5 & 625
\end{array}\right)\right).\\
\end{eqnarray*}`
]
]

---
## Reading example: data

Now, to the data (finally!)

```r
Y <- as.matrix(dget("http://www2.stat.duke.edu/~pdh10/FCBS/Inline/Y.reading"))
dim(Y)
```

```
## [1] 22  2
```

```r
head(Y)
```

```
##      pretest posttest
## [1,]      59       77
## [2,]      43       39
## [3,]      34       46
## [4,]      32       26
## [5,]      42       38
## [6,]      38       43
```

```r
summary(Y)
```

```
##     pretest         posttest    
##  Min.   :28.00   Min.   :26.00  
##  1st Qu.:34.25   1st Qu.:43.75  
##  Median :44.00   Median :52.00  
##  Mean   :47.18   Mean   :53.86  
##  3rd Qu.:58.00   3rd Qu.:60.00  
##  Max.   :72.00   Max.   :86.00
```

---
## Reading example: data

---
## Reading example: data

---
## Posterior computation

- To recap, we have

.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) = \mathcal{N}_2(\boldsymbol{\mu}_n, \Lambda_n)
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\
\\
\boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right],\\
\end{split}`
$$
]
]

.block[
.small[
`$$\boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T$$`
]
]

.block[
.small[
`\begin{eqnarray*}
\Lambda_0 = \left(\begin{array}{cc}
156 & 78 \\
78 & 156
\end{array}\right)\\
\end{eqnarray*}`
]
]

---
## Posterior computation

- We also have

.block[
.small[
$$
`\begin{split}
\pi(\Sigma | \boldsymbol{\theta} \boldsymbol{Y}) = \mathcal{IW}_2(\nu_n, \boldsymbol{S}_n)
\end{split}`
$$
]
]

or using the notation in the book, `$\mathcal{IW}_2(\nu_n, \boldsymbol{S}_n^{-1} )$`, where
.block[
.small[
$$
`\begin{split}
\nu_n & = \nu_0 + n\\
\\
\boldsymbol{S}_n & = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\
& = \left[\boldsymbol{S}_0 + \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \right].
\end{split}`
$$
]
]

.block[
.small[
`$$\nu_0 = p + 2 = 4$$`
]
]

.block[
.small[
`\begin{eqnarray*}
\Sigma_0 = \left(\begin{array}{cc}
625 & 312.5 \\
312.5 & 625
\end{array}\right)\\
\end{eqnarray*}`
]
]

---
## Posterior computation

```r
#Data summaries
n <- nrow(Y)
ybar <- apply(Y,2,mean)

#Hyperparameters for the priors
mu_0 <- c(50,50)
Lambda_0 <- matrix(c(156,78,78,156),nrow=2,ncol=2)
nu_0 <- 4
S_0 <- matrix(c(625,312.5,312.5,625),nrow=2,ncol=2)

#Initial values for Gibbs sampler
#No need to set initial value for theta, we can simply sample it first
Sigma <- cov(Y)

#Set null matrices to save samples
THETA <- SIGMA <- NULL
```

Next, we need to write the code for the Gibbs sampler.

---
## Posterior computation

```r
#Now, to the Gibbs sampler
#library(mvtnorm) for multivariate normal
#library(MCMCpack) for inverse-Wishart

#first set number of iterations and burn-in, then set seed
n_iter <- 10000; burn_in <- 0.3*n_iter
set.seed(1234)

for (s in 1:(n_iter+burn_in)){
##update theta using its full conditional
Lambda_n <- solve(solve(Lambda_0) + n*solve(Sigma))
mu_n <- Lambda_n %*% (solve(Lambda_0)%*%mu_0 + n*solve(Sigma)%*%ybar)
theta <- rmvnorm(1,mu_n,Lambda_n)

#update Sigma
S_theta <- (t(Y)-c(theta))%*%t(t(Y)-c(theta))
S_n <- S_0 + S_theta
nu_n <- nu_0 + n
Sigma <- riwish(nu_n, S_n)

#save results only past burn-in
if(s > burn_in){
  THETA <- rbind(THETA,theta)
  SIGMA <- rbind(SIGMA,c(Sigma))
  }
}
colnames(THETA) <- c("theta_1","theta_2")
colnames(SIGMA) <- c("sigma_11","sigma_12","sigma_21","sigma_22") #symmetry in sigma
```

Note that the text also has a function to sample from the Wishart distribution.

---
## Diagnostics

```r
#library(coda)
THETA.mcmc <- mcmc(THETA,start=1); summary(THETA.mcmc)
```

```
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean    SD Naive SE Time-series SE
## theta_1 47.30 2.956  0.02956        0.02956
## theta_2 53.69 3.290  0.03290        0.03290
## 
## 2. Quantiles for each variable:
## 
##          2.5%   25%   50%   75% 97.5%
## theta_1 41.55 45.35 47.36 49.23 53.08
## theta_2 47.08 51.53 53.69 55.82 60.13
```

```r
effectiveSize(THETA.mcmc)
```

```
## theta_1 theta_2 
##   10000   10000
```

---
## Diagnostics

```r
SIGMA.mcmc <- mcmc(SIGMA,start=1); summary(SIGMA.mcmc)
```

```
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean    SD Naive SE Time-series SE
## sigma_11 202.3 63.39   0.6339         0.6511
## sigma_12 155.3 60.92   0.6092         0.6244
## sigma_21 155.3 60.92   0.6092         0.6244
## sigma_22 260.1 81.96   0.8196         0.8352
## 
## 2. Quantiles for each variable:
## 
##            2.5%   25%   50%   75% 97.5%
## sigma_11 113.50 158.2 190.8 234.8 357.3
## sigma_12  67.27 113.2 144.7 186.5 305.4
## sigma_21  67.27 113.2 144.7 186.5 305.4
## sigma_22 145.84 203.2 244.6 300.9 461.0
```

```r
effectiveSize(SIGMA.mcmc)
```

```
## sigma_11 sigma_12 sigma_21 sigma_22 
## 9478.710 9517.989 9517.989 9629.352
```