Multivariate normal model

class: center, middle, inverse, title-slide

# Multivariate normal model
### Dr. Olanrewaju Michael Akande
### Feb 19, 2020

---

## Announcements

- Take Survey I

- Link: [https://duke.qualtrics.com/jfe/form/SV_54rrMwDxp3hmagt](https://duke.qualtrics.com/jfe/form/SV_54rrMwDxp3hmagt)

- Responses are anonymized.

## Outline

- Wrap up exercise from last class

- Multivariate normal/Gaussian model

+ Motivating example
  
  + Inference for mean
  
  + Inference for covariance

---
## Recap of conditional distributions

- Partition `$\boldsymbol{Y} = (Y_1,\ldots,Y_p)^T$` as
.block[
.small[
`\begin{eqnarray*}
\boldsymbol{Y} =
\begin{pmatrix}\boldsymbol{Y}_1\\
\boldsymbol{Y}_2
\end{pmatrix} & \sim & \mathcal{N}_p\left[\left(\begin{array}{c}
\boldsymbol{\mu}_1\\
\boldsymbol{\mu}_2
\end{array}\right),\left(\begin{array}{cc}
\Sigma_{11} & \Sigma_{12} \\
\Sigma_{21} & \Sigma_{22}
\end{array}\right)\right],\\
\end{eqnarray*}`
]
]

where
  + `$\boldsymbol{Y}_1$` and `$\boldsymbol{\mu}_1$` are `$q \times 1$`,
  + `$\boldsymbol{Y}_2$` and `$\boldsymbol{\mu}_2$` are `$(p-q) \times 1$`,
  + `$\Sigma_{11}$` is `$q \times q$`, and
  + `$\Sigma_{22}$` is `$(p-q) \times (p-q)$`, with `$\Sigma_{22} > 0$`.
  
--

- Then,
.block[
`$$\boldsymbol{Y}_1 | \boldsymbol{Y}_2 = \boldsymbol{y}_2 \sim \mathcal{N}_q\left(\boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}  (\boldsymbol{y}_2-\boldsymbol{\mu}_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right).$$`
]

---
## Working with normal distributions

- Three real (univariate) random quantities `$x$`, `$y$` and `$z$` have a joint normal distribution given by `$p(x,y,z) = p(y|x) p(x|z) p(z)$`.

- Suppose
  + `$p(y|x) = \mathcal{N}(x, w)$` independently of `$z$`, for some known variance `$w$`;
  + `$p(x|z) = \mathcal{N}(\theta z, v)$` for some known parameter `$\theta$`, and known variance `$v$`; and
  + `$p(z) = \mathcal{N}(m, M)$`, with some known mean `$m$`, and known variance `$M$`.
  
--

- What is
  + `$p(x)$`? `$p(y)$`? 
  + `$p(x|y)$`? `$p(z|x)$`?

- **To be done on the board.**

---
## Multivariate data

- Survey data often yield multivariate data of varied types.

- **Typical survey data:** response vector `$\boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T$` for each person `$i$` in a sample of survey respondents, `$i = 1,\ldots, n$`. For example, we could have
  + `$y_{i1} =$` income
  + `$y_{i2} =$` level of education
  + `$y_{i3} =$` number of children
  + `$y_{i4} =$` age
  + `$y_{i5} =$` attitude
  
--

- Interest is then often on inferring the potential associations among these variables.

- *See [https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf](https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf)*

---
## GSS data

*See [https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf](https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf)*

---
## Conditional models

- Interest is often in conditional relationships between pairs of variables, accounting for heterogeneity in other variables of less interest.

- Consider the following models.

- GSS data:
  + **Model 1**
  .midsmall[$$\color{green}{\textrm{INC}}_i = \beta_0 + \beta_1 \color{red}{\textrm{CHILD}}_i + \beta_2 \textrm{DEG}_i + \beta_3 \textrm{AGE}_i + \beta_4 \textrm{PCHILD}_i + \beta_5 \textrm{PINC}_i + \beta_6 \textrm{PDEG}_i + \epsilon_i$$]
  p-value for `$\beta_1$` here is 0.11: "little evidence" that `$\beta_1 \neq 0$`.
  
  + **Model 2**
  .midsmall[$$\color{red}{\textrm{CHILD}}_i \sim \textrm{Poisson}\left(\textrm{exp}\left[ \beta_0 + \beta_1 \color{green}{\textrm{INC}}_i + \beta_2 \textrm{DEG}_i + \beta_3 \textrm{AGE}_i + \beta_4 \textrm{PCHILD}_i + \beta_5 \textrm{PINC}_i + \beta_6 \textrm{PDEG}_i \right]\right)$$]
  p-value for `$\beta_1$` here is 0.01: "strong evidence" that `$\beta_1 \neq 0$`.
  
--

- Not satisfactory; better to use multivariate models instead to do this jointly.

- *See [https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf](https://www.stat.washington.edu/people/pdhoff/public/coptalk.pdf)*

---
## Multivariate normal distribution recap

- Recall that if `$\boldsymbol{Y} = (Y_1,\ldots,Y_p)^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma)$`, then
.block[
.small[
`$$f(\boldsymbol{y}) = (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y} - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y} - \boldsymbol{\theta})\right\}.$$`
]
]
  
--

- `$\boldsymbol{\theta}$` is the `$p \times 1$` mean vector, that is, `$\boldsymbol{\theta} = (\theta_1, \ldots,\theta_p)^T$`.

- `$\Sigma$` is the `$p \times p$` **positive definite** covariance matrix, that is, `$\Sigma = \{\sigma_{jk}\}$`, where `$\sigma_{jk}$` denotes the covariance between `$Y_j$` and `$Y_k$`.

- For each `$j = 1, \ldots, p$`, `$Y_j \sim \mathcal{N}(\theta_j, \sigma_{jj})$`.

- <div class="question">
How to do posterior inference if this is our sampling model?
</div>

---
## Reading comprehension example

- Twenty-two children are given a reading comprehension test before and after receiving a particular instruction method.
  + `$Y_{i1}$`: pre-instructional score for student `$i$`.
  
  + `$Y_{i2}$`: post-instructional score for student `$i$`.
  
--

- Vector of observations for each student: `$\boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T$`.

- Clearly, we should expect some correlation between `$Y_{i1}$` and `$Y_{i2}$`.

---
## Reading comprehension example

- Questions of interest:
  + Do students improve in reading comprehension on average?
  
  + If so, by how much?
  
  + Can we predict post-test score from pre-test score?
  
  + If there is a "significant" improvement, does that mean the instructional method is good?

+ If we have students with missing pre-test scores, can we predict the scores?
  
--

- We will come back to this example. First, let's specify priors and see what the implied (conditional) posteriors look like.

---
## Multivariate normal likelihood

- For data `$\boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma)$`, the likelihood is
.block[
.small[
$$
`\begin{split}
L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}.
\end{split}`
$$
]
]

- It will be super useful to be able to write the likelihood in two different formulations depending on whether we about the posterior of `$\boldsymbol{\theta}$` or `$\Sigma$`.

---
## Multivariate normal likelihood

- For `$\boldsymbol{\theta}$`, it is convenient to write `$L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)$` as
.block[
.small[
$$
`\begin{split}
L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i^T - \boldsymbol{\theta}^T) \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \left[\boldsymbol{y}_i^T\Sigma^{-1}\boldsymbol{y}_i \underbrace{- \boldsymbol{y}_i^T\Sigma^{-1}\boldsymbol{\theta} - \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i}_{\textrm{same term}} + \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} \right] \right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \left[\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} -2\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i \right] \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} - \dfrac{1}{2}\sum^n_{i=1} (-2)\boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{y}_i  \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} n \boldsymbol{\theta}^T\Sigma^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Sigma^{-1} \sum^n_{i=1}\boldsymbol{y}_i  \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}})  \right\},\\
\end{split}`
$$
]
]

where `$\bar{\boldsymbol{y}} = (\bar{y}_1,\ldots,\bar{y}_p)^T$`.

---
## Prior for the mean

- A convenient specification of the joint prior is `$\pi(\boldsymbol{\theta}, \Sigma) = \pi(\boldsymbol{\theta}) \pi(\Sigma)$`.

- As in the univariate case, a convenient conjugate prior distribution for `$\boldsymbol{\theta}$` is also normal (multivariate in this case).

- Assume that `$\pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0)$`.

- The pdf will be easier to work with if we write it as
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & =  (2\pi)^{-\frac{p}{2}} \left|\Lambda_0\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\
& \propto  \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)^T \Lambda_0^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_0)\right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} \underbrace{- \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 - \boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{same term}} + \boldsymbol{\mu}_0^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\}\\
& \propto \textrm{exp} \left\{-\dfrac{1}{2} \left[\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} - 2\boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\}\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\
\end{split}`
$$
]
]

---
## Prior for the mean

- So we have
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\
\end{split}`
$$
]
]

- **Key trick for combining with likelihood:** When the normal density is written in this form, note the following details in the exponent.

+ In the first part, the inverse of the *covariance matrix* `$\Lambda_0^{-1}$` is "sandwiched" between `$\boldsymbol{\theta}^T$` and `$\boldsymbol{\theta}$`.
  
  + In the second part, the `$\boldsymbol{\theta}$` in the first part is replaced (sort of) with the *mean* `$\boldsymbol{\mu}_0$`, with `$\Lambda_0^{-1}$` keeping its place.
  
--

- The two points above will help us identify **updated means** and **updated covariance matrices** relatively quickly.

---
## Conditional posterior for the mean

- Our conditional posterior (full conditional) `$\boldsymbol{\theta} | \Sigma , \boldsymbol{Y}$`, is then
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\
\\
& \propto \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}}_{L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}}_{\pi(\boldsymbol{\theta})} \\
\\
& = \textrm{exp} \left\{\underbrace{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} -\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta}}_{\textrm{First parts from } L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} + \underbrace{\boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0}_{\textrm{Second parts from } L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \textrm{ and } \pi(\boldsymbol{\theta})} \right\}\\
\\
& = \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[n\Sigma^{-1} + \Lambda_0^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[ n\Sigma^{-1} \bar{\boldsymbol{y}} + \Lambda_0^{-1}\boldsymbol{\mu}_0 \right] \right\},
\end{split}`
$$
]
]

which is just another multivariate normal distribution.

---
## Conditional posterior for the mean

- To confirm the normal density and its parameters, compare to the prior kernel
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\
\end{split}`
$$
]
]

and the posterior kernel we just derived, that is,
.block[
.small[
$$
`\begin{split}
\pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\}.
\end{split}`
$$
]
]

- Easy to see (relatively) that `$\boldsymbol{\theta} | \Sigma, \boldsymbol{Y} \sim \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n)$`, with
.block[
.small[
$$
\Lambda_n = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}
$$
]
]

and
.block[
.small[
$$
\boldsymbol{\mu}_n = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]
$$
]
]

---
## Bayesian inference

- As in the univariate case, we once again have that
  + Posterior precision is sum of prior precision and data precision:
.block[
.small[
$$
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
$$
]
]

--
  + Posterior expectation is weighted average of prior expectation and the sample mean:
.block[
.small[
$$
`\begin{split}
\boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\
\\
& = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}}
\end{split}`
$$
]
]

- Compare these to the results from the univariate case to gain more intuition.

---
## What about the covariance matrix?

- In the univariate case with `$y_i \sim \mathcal{N}(\mu, \sigma^2)$`, the common choice for the prior is an inverse-gamma distribution for the variance `$\sigma^2$`.

- As we have seen, we can rewrite as `$y_i \sim \mathcal{N}(\mu, \tau^{-1})$`, so that we have a gamma prior for the precision `$\tau$`.

- In the multivariate normal case, we have a covariance matrix `$\Sigma$` instead of a scalar.

- Appealing to have a matrix-valued extension of the inverse-gamma (and gamma) that would be conjugate.

---
## Positive definite and symmetric

- One complication is that the covariance matrix `$\Sigma$` must be **positive definite and symmetric**.

- "Positive definite" means that for all `$x \in \mathcal{R}^p$`, `$x^T \Sigma x > 0$`.

- Basically ensures that the diagonal elements of `$\Sigma$` (corresponding to the marginal variances) are positive.

- Also, ensures that the correlation coefficients for each pair of variables are between -1 and 1.

- Our prior for `$\Sigma$` should thus assign probability one to set of positive definite matrices.

- Analogous to the univariate case, the .hlight[inverse-Wishart distribution] is the corresponding conditionally conjugate prior for `$\Sigma$` (multivariate generalization of the inverse-gamma).

- The textbook covers the construction of Wishart and inverse-Wishart random variables. We will skip the actual development in class but will write code to sample random variates.

---
## Inverse-Wishart distribution

- A random variable `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, where `$\Sigma$` is positive definite and `$p \times p$`, has pdf
.block[
.small[
$$
`\begin{split}
p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\},
\end{split}`
$$
]
]

where
    + `$\text{tr}(\cdot)$` is the **trace function** (sum of diagonal elements), 
    + `$\nu_0 > p - 1$` is the "degrees of freedom", and 
    + `$\boldsymbol{S}_0$` is a `$p \times p$` positive definite matrix.
  
--

- For this distribution, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`, for `$\nu_0 > p + 1$`.

- Hence, `$\boldsymbol{S}_0$` is the scaled mean of the `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`.

---
## Wishart distribution

- If we are very confidence in a prior guess `$\Sigma_0$`, for `$\Sigma$`, then we might set
  + `$\nu_0$`, the degrees of freedom to be very large, and
  + `$\boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0$`. 
  
  In this case, `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0$`, and `$\Sigma$` is tightly (depending on the value of `$\nu_0$`) centered around `$\Sigma_0$`.
  
--

- If we are not at all confident but we still have a prior guess `$\Sigma_0$`, we might set
  + `$\nu_0 = p + 2$`, so that the `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$` is finite.
  + `$\boldsymbol{S}_0 = \Sigma_0$`
  
  Here, `$\mathbb{E}[\Sigma] = \Sigma_0$` as before, but `$\Sigma$` is only loosely centered around `$\Sigma_0$`.

---
## Wishart distribution

- Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the .hlight[Wishart distribution] (multivariate generalization of the gamma) instead.

- The .hlight[Wishart distribution] provides a conditionally-conjugate prior for the precision matrix `$\Sigma^{-1}$` in a multivariate normal model.

- Specifically, if `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`.

- A random variable `$\Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1})$`, where `$\Phi$` has dimension `$(p \times p)$`, has pdf
.block[
.small[
$$
`\begin{split}
f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}.
\end{split}`
$$
]
]

- Here, `$\mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0$`.

- Note that the textbook writes the inverse-Wishart as `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0^{-1})$`. I prefer `$\textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$` instead. Feel free to use either notation but try not to get confused.

---
## Back to inference on covariance

- For inference on `$\boldsymbol{\Sigma}$`, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

- First a few results from matrix algebra:

1. `$\text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}$`, where `$a_{jj}$` is the `$j$`th diagonal element of a square `$p \times p$` matrix `$\boldsymbol{A}$`.
  
--
  2. Cyclic property:
  `$$\text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}),$$`
    given that the product `$\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}$` is a square matrix.
  
--
  3. If `$\boldsymbol{A}$` is a `$p \times p$` matrix, then for a `$p \times 1$` vector `$\boldsymbol{x}$`, 
    `$$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} = \text{tr}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})$$`
    holds by (1), since `$\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}$` is a scalar.
  
--
  4. `$\text{tr}(\boldsymbol{A} + \boldsymbol{B}) = \text{tr}(\boldsymbol{A}) + \text{tr}(\boldsymbol{B})$`.

---
## Multivariate normal likelihood again

- It is thus convenient to rewrite `$L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)$` as
.block[
.small[
$$
`\begin{split}
L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}}_{\text{no algebra/change yet}}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta}) \right]}_{\text{by result 3}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by cyclic property}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \underbrace{\text{tr}\left[\sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by result 4}} \right\}\\
& = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\
\end{split}`
$$
]
]

where `$\boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T$` is the residual sum of squares matrix.
  
  
  
---
## Conditional posterior for covariance

- Assuming `$\pi(\Sigma) = \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, the conditional posterior (full conditional) `$\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}$`, is then
.block[
.small[
$$
`\begin{split}
\pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\
\\
& \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\boldsymbol{\theta})} \\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\
\\
& \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\
\end{split}`
$$
]
]

which is `$\textrm{IW}_p(\nu_n, \boldsymbol{S}_n)$`, or using the notation in the book, `$\textrm{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} )$`, with
    + `$\nu_n = \nu_0 + n$`, and
    + `$\boldsymbol{S}_n = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]$`

---
## Conditional posterior for covariance

- We once again see that the "posterior sample size" or "posterior degrees of freedom" `$\nu_n$` is the sum of the "prior degrees of freedom" `$\nu_0$` and the data sample size `$n$`.

- `$\boldsymbol{S}_n$` can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

- Recall that if `$\Sigma \sim \textrm{IW}_p(\nu_0, \boldsymbol{S}_0)$`, then `$\mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0$`.

- `$\Rightarrow$` the conditional posterior expectation of the population covariance is
.block[
.small[
$$
`\begin{split}
\mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\
\\
& = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\
\end{split}`
$$
]
]

which is a weighted average of prior expectation and sample estimate.