Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

Multivariate normal model cont’d

Dr. Olanrewaju Michael Akande

Feb 21, 2020

1 / 45

Announcements

  • Homework 5 will be online by 5pm today.
2 / 45

Announcements

  • Homework 5 will be online by 5pm today.

Outline

  • Chat on survey responses

  • Multivariate normal/Gaussian model

    • Inference for mean (recap)

    • Inference for covariance

    • Back to the example

    • Gibbs sampler

    • Jeffreys' prior

2 / 45

Multivariate normal model

3 / 45

Conditional inference on mean recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma),

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}

4 / 45

Conditional inference on mean recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma),

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}

  • If we assume \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0), that is,

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}

4 / 45

Conditional inference on mean recap

  • For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma),

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}

  • If we assume \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0), that is,

    \begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}

  • Then

    \begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\} \ \equiv \ \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n) \end{split}

    where

    \begin{split} \Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\ \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]. \end{split}

4 / 45

Conditional inference on mean recap

  • As in the univariate case, we once again have that
    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

5 / 45

Conditional inference on mean recap

  • As in the univariate case, we once again have that
    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

    • Posterior expectation is weighted average of prior expectation and the sample mean:

      \begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}

5 / 45

Conditional inference on mean recap

  • As in the univariate case, we once again have that

    • Posterior precision is sum of prior precision and data precision:

      \Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}

    • Posterior expectation is weighted average of prior expectation and the sample mean:

      \begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}

  • Compare these to the results from the univariate case to gain more intuition.

5 / 45

What about the covariance matrix?

  • A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \text{tr}(\cdot) is the trace function (sum of diagonal elements),
    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
6 / 45

What about the covariance matrix?

  • A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \text{tr}(\cdot) is the trace function (sum of diagonal elements),
    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
  • For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.

6 / 45

What about the covariance matrix?

  • A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf

    \begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}

    where

    • \text{tr}(\cdot) is the trace function (sum of diagonal elements),
    • \nu_0 > p - 1 is the "degrees of freedom", and
    • \boldsymbol{S}_0 is a p \times p positive definite matrix.
  • For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.

  • Hence, \boldsymbol{S}_0 is the scaled mean of the \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0).

6 / 45

Wishart distribution

  • If we are very confidence in a prior guess \Sigma_0, for \Sigma, then we might set

    • \nu_0, the degrees of freedom to be very large, and
    • \boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0.

    In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.

7 / 45

Wishart distribution

  • If we are very confidence in a prior guess \Sigma_0, for \Sigma, then we might set

    • \nu_0, the degrees of freedom to be very large, and
    • \boldsymbol{S}_0 = (\nu_0 - p - 1)\Sigma_0.

    In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.

  • If we are not at all confident but we still have a prior guess \Sigma_0, we might set

    • \nu_0 = p + 2, so that the \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 is finite.
    • \boldsymbol{S}_0 = \Sigma_0

    Here, \mathbb{E}[\Sigma] = \Sigma_0 as before, but \Sigma is only loosely centered around \Sigma_0.

7 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
8 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

8 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

8 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

8 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

  • Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.

8 / 45

Wishart distribution

  • Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.

  • The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.

  • Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).

  • A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf

    \begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}

  • Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.

  • Note that the textbook writes the inverse-Wishart as \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0^{-1}). I prefer \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0) instead. Feel free to use either notation but try not to get confused.

8 / 45

Back to inference on covariance

  • For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.
9 / 45

Back to inference on covariance

  • For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

  • First a few results from matrix algebra:

    1. \text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}, where a_{jj} is the jth diagonal element of a square p \times p matrix \boldsymbol{A}.
9 / 45

Back to inference on covariance

  • For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

  • First a few results from matrix algebra:

    1. \text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}, where a_{jj} is the jth diagonal element of a square p \times p matrix \boldsymbol{A}.
    2. Cyclic property: \text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}), given that the product \boldsymbol{A}\boldsymbol{B}\boldsymbol{C} is a square matrix.
9 / 45

Back to inference on covariance

  • For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

  • First a few results from matrix algebra:

    1. \text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}, where a_{jj} is the jth diagonal element of a square p \times p matrix \boldsymbol{A}.
    2. Cyclic property: \text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}), given that the product \boldsymbol{A}\boldsymbol{B}\boldsymbol{C} is a square matrix.
    3. If \boldsymbol{A} is a p \times p matrix, then for a p \times 1 vector \boldsymbol{x}, \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} = \text{tr}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}) holds by (1), since \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} is a scalar.
9 / 45

Back to inference on covariance

  • For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.

  • First a few results from matrix algebra:

    1. \text{tr}(\boldsymbol{A}) = \sum^p_{j=1}a_{jj}, where a_{jj} is the jth diagonal element of a square p \times p matrix \boldsymbol{A}.
    2. Cyclic property: \text{tr}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \text{tr}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) = \text{tr}(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}), given that the product \boldsymbol{A}\boldsymbol{B}\boldsymbol{C} is a square matrix.
    3. If \boldsymbol{A} is a p \times p matrix, then for a p \times 1 vector \boldsymbol{x}, \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} = \text{tr}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}) holds by (1), since \boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x} is a scalar.
    4. \text{tr}(\boldsymbol{A} + \boldsymbol{B}) = \text{tr}(\boldsymbol{A}) + \text{tr}(\boldsymbol{B}).
9 / 45

Multivariate normal likelihood again

  • It is thus convenient to rewrite L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) as

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}}_{\text{no algebra/change yet}}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta}) \right]}_{\text{by result 3}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by cyclic property}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \underbrace{\text{tr}\left[\sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by result 4}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}

    where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T is the residual sum of squares matrix.

10 / 45

Conditional posterior for covariance

  • Assuming \pi(\Sigma) = \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), the conditional posterior (full conditional) \Sigma | \boldsymbol{\theta}, \boldsymbol{Y}, is then

    \begin{split} \pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\ \\ & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\boldsymbol{\theta})} \\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\ \end{split}

    which is \mathcal{IW}_p(\nu_n, \boldsymbol{S}_n), or using the notation in the book, \mathcal{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} ), with

    • \nu_n = \nu_0 + n, and
    • \boldsymbol{S}_n = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]
11 / 45

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.
12 / 45

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

12 / 45

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

  • Recall that if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.

12 / 45

Conditional posterior for covariance

  • We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.

  • \boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".

  • Recall that if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.

  • \Rightarrow the conditional posterior expectation of the population covariance is

    \begin{split} \mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\ \\ & = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\ \end{split}

    which is a weighted average of prior expectation and sample estimate.

12 / 45

Reading comprehension example again

  • Vector of observations for each student: \boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T.

    • Y_{i1}: pre-instructional score for student i.

    • Y_{i2}: post-instructional score for student i.

13 / 45

Reading comprehension example again

  • Vector of observations for each student: \boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T.

    • Y_{i1}: pre-instructional score for student i.

    • Y_{i2}: post-instructional score for student i.

  • Questions of interest:

    • Do students improve in reading comprehension on average?

    • If so, by how much?

    • Can we predict post-test score from pre-test score? How correlated are they?

    • If we have students with missing pre-test scores, can we predict the scores? (Will defer this till next week!)

13 / 45

Reading comprehension example

  • Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.
14 / 45

Reading comprehension example

  • Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.

  • Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,

    \begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}

14 / 45

Reading comprehension example

  • Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.

  • Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,

    \begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}

  • We can answer the first two questions of interest by looking at the posterior distribution of \theta_2 - \theta_1.

14 / 45

Reading comprehension example

  • Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.

  • Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,

    \begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}

  • We can answer the first two questions of interest by looking at the posterior distribution of \theta_2 - \theta_1.

  • The correlation between Y_1 and Y_2 is

    \rho = \dfrac{\sigma_{12}}{\sigma_1 \sigma_2},

    so we can answer the third question by looking at the posterior distribution of \rho, which we have once we have posterior samples of \Sigma.

14 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
15 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.

  • For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.

15 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.

  • For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.

  • \boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

15 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.

  • For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.

  • \boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

  • Next, we need to set values for elements of

    \begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}

15 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.

  • For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.

  • \boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

  • Next, we need to set values for elements of

    \begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}

  • It is quite reasonable to believe apriori that the true means will most likely lie in the interval [25, 75] with high probability (perhaps 0.95?), since individual test scores should lie in the interval [0, 100].

15 / 45

Reading example: prior on mean

  • Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.

  • For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.

  • \boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

  • Next, we need to set values for elements of

    \begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}

  • It is quite reasonable to believe apriori that the true means will most likely lie in the interval [25, 75] with high probability (perhaps 0.95?), since individual test scores should lie in the interval [0, 100].

  • Recall that for any normal distribution, 95% of the density will lie within two standard deviations of the mean.

15 / 45

Reading example: prior on mean

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}

16 / 45

Reading example: prior on mean

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}

  • Similarly, set \lambda_{22} \approx 156.

16 / 45

Reading example: prior on mean

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}

  • Similarly, set \lambda_{22} \approx 156.

  • Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

16 / 45

Reading example: prior on mean

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}

  • Similarly, set \lambda_{22} \approx 156.

  • Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

    \Rightarrow 0.5 = \dfrac{\lambda_{12}}{\sqrt{\lambda_{11}}\sqrt{\lambda_{22}}} = \dfrac{\lambda_{12}}{156} \ \ \Rightarrow \ \ \lambda_{12} = 156 \times 0.5 = 78.

16 / 45

Reading example: prior on mean

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}

  • Similarly, set \lambda_{22} \approx 156.

  • Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

    \Rightarrow 0.5 = \dfrac{\lambda_{12}}{\sqrt{\lambda_{11}}\sqrt{\lambda_{22}}} = \dfrac{\lambda_{12}}{156} \ \ \Rightarrow \ \ \lambda_{12} = 156 \times 0.5 = 78.

  • Thus,

    \begin{eqnarray*} \pi(\boldsymbol{\theta}) = \mathcal{N}_2\left(\boldsymbol{\mu}_0 = \left(\begin{array}{c} 50\\ 50 \end{array}\right),\Lambda_0 = \left(\begin{array}{cc} 156 & 78 \\ 78 & 156 \end{array}\right)\right).\\ \end{eqnarray*}

16 / 45

Reading example: prior on covariance

  • Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.
17 / 45

Reading example: prior on covariance

  • Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.

  • First, let's start with a prior guess \Sigma_0 for \Sigma.

17 / 45

Reading example: prior on covariance

  • Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.

  • First, let's start with a prior guess \Sigma_0 for \Sigma.

  • Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.

17 / 45

Reading example: prior on covariance

  • Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.

  • First, let's start with a prior guess \Sigma_0 for \Sigma.

  • Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.

  • Just as we did with \Lambda_0, we can use that idea to set the elements of \Sigma_0

    \begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} \sigma^{(0)}_{11} & \sigma^{(0)}_{12} \\ \sigma^{(0)}_{21} & \sigma^{(0)}_{22} \end{array}\right)\\ \end{eqnarray*}

17 / 45

Reading example: prior on covariance

  • Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.

  • First, let's start with a prior guess \Sigma_0 for \Sigma.

  • Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.

  • Just as we did with \Lambda_0, we can use that idea to set the elements of \Sigma_0

    \begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} \sigma^{(0)}_{11} & \sigma^{(0)}_{12} \\ \sigma^{(0)}_{21} & \sigma^{(0)}_{22} \end{array}\right)\\ \end{eqnarray*}

  • The identity matrix is also often a common choice for \Sigma_0 when there is no prior guess, especially when there is enough data to "drown out" the prior guess.

17 / 45

Reading example: prior on covariance

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}

18 / 45

Reading example: prior on covariance

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}

  • Similarly, set \sigma^{(0)}_{22} \approx 625.

18 / 45

Reading example: prior on covariance

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}

  • Similarly, set \sigma^{(0)}_{22} \approx 625.

  • Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

18 / 45

Reading example: prior on covariance

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}

  • Similarly, set \sigma^{(0)}_{22} \approx 625.

  • Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

    \Rightarrow 0.5 = \dfrac{\sigma^{(0)}_{12}}{\sqrt{\sigma^{(0)}_{11}}\sqrt{\sigma^{(0)}_{22}}} = \dfrac{\sigma^{(0)}_{12}}{625} \ \ \Rightarrow \ \ \sigma^{(0)}_{12} = 625 \times 0.5 = 312.5.

18 / 45

Reading example: prior on covariance

  • Therefore, we can set

    \begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}

  • Similarly, set \sigma^{(0)}_{22} \approx 625.

  • Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.

    \Rightarrow 0.5 = \dfrac{\sigma^{(0)}_{12}}{\sqrt{\sigma^{(0)}_{11}}\sqrt{\sigma^{(0)}_{22}}} = \dfrac{\sigma^{(0)}_{12}}{625} \ \ \Rightarrow \ \ \sigma^{(0)}_{12} = 625 \times 0.5 = 312.5.

  • Thus,

    \begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\\ \end{eqnarray*}

18 / 45

Reading example: prior on covariance

  • Recall that if we are not at all confident on a prior value for \Sigma, but we have a prior guess \Sigma_0, we can set

    • \nu_0 = p + 2, so that the \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 is finite.
    • \boldsymbol{S}_0 = \Sigma_0

    so that \Sigma is only loosely centered around \Sigma_0.

19 / 45

Reading example: prior on covariance

  • Recall that if we are not at all confident on a prior value for \Sigma, but we have a prior guess \Sigma_0, we can set

    • \nu_0 = p + 2, so that the \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 is finite.
    • \boldsymbol{S}_0 = \Sigma_0

    so that \Sigma is only loosely centered around \Sigma_0.

  • Thus, we can set

    • \nu_0 = p + 2 = 2+2=4
    • \boldsymbol{S}_0 = \Sigma_0

    so that we have

    \begin{eqnarray*} \pi(\Sigma) = \mathcal{IW}_2\left(\nu_0 = 4,\Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\right).\\ \end{eqnarray*}

19 / 45

Reading example: data

Now, to the data (finally!)

Y <- as.matrix(dget("http://www2.stat.duke.edu/~pdh10/FCBS/Inline/Y.reading"))
dim(Y)
## [1] 22 2
head(Y)
## pretest posttest
## [1,] 59 77
## [2,] 43 39
## [3,] 34 46
## [4,] 32 26
## [5,] 42 38
## [6,] 38 43
summary(Y)
## pretest posttest
## Min. :28.00 Min. :26.00
## 1st Qu.:34.25 1st Qu.:43.75
## Median :44.00 Median :52.00
## Mean :47.18 Mean :53.86
## 3rd Qu.:58.00 3rd Qu.:60.00
## Max. :72.00 Max. :86.00
20 / 45

Reading example: data

21 / 45

Reading example: data

22 / 45

Posterior computation

  • To recap, we have

    \begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) = \mathcal{N}_2(\boldsymbol{\mu}_n, \Lambda_n) \end{split}

    where

    \begin{split} \Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\ \\ \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right],\\ \end{split}

    \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T

    \begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} 156 & 78 \\ 78 & 156 \end{array}\right)\\ \end{eqnarray*}

23 / 45

Posterior computation

  • We also have

    \begin{split} \pi(\Sigma | \boldsymbol{\theta} \boldsymbol{Y}) = \mathcal{IW}_2(\nu_n, \boldsymbol{S}_n) \end{split}

    or using the notation in the book, \mathcal{IW}_2(\nu_n, \boldsymbol{S}_n^{-1} ), where

    \begin{split} \nu_n & = \nu_0 + n\\ \\ \boldsymbol{S}_n & = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\ & = \left[\boldsymbol{S}_0 + \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \right]. \end{split}

    \nu_0 = p + 2 = 4

    \begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\\ \end{eqnarray*}

24 / 45

Posterior computation

#Data summaries
n <- nrow(Y)
ybar <- apply(Y,2,mean)
#Hyperparameters for the priors
mu_0 <- c(50,50)
Lambda_0 <- matrix(c(156,78,78,156),nrow=2,ncol=2)
nu_0 <- 4
S_0 <- matrix(c(625,312.5,312.5,625),nrow=2,ncol=2)
#Initial values for Gibbs sampler
#No need to set initial value for theta, we can simply sample it first
Sigma <- cov(Y)
#Set null matrices to save samples
THETA <- SIGMA <- NULL

Next, we need to write the code for the Gibbs sampler.

25 / 45

Posterior computation

#Now, to the Gibbs sampler
#library(mvtnorm) for multivariate normal
#library(MCMCpack) for inverse-Wishart
#first set number of iterations and burn-in, then set seed
n_iter <- 10000; burn_in <- 0.3*n_iter
set.seed(1234)
for (s in 1:(n_iter+burn_in)){
##update theta using its full conditional
Lambda_n <- solve(solve(Lambda_0) + n*solve(Sigma))
mu_n <- Lambda_n %*% (solve(Lambda_0)%*%mu_0 + n*solve(Sigma)%*%ybar)
theta <- rmvnorm(1,mu_n,Lambda_n)
#update Sigma
S_theta <- (t(Y)-c(theta))%*%t(t(Y)-c(theta))
S_n <- S_0 + S_theta
nu_n <- nu_0 + n
Sigma <- riwish(nu_n, S_n)
#save results only past burn-in
if(s > burn_in){
THETA <- rbind(THETA,theta)
SIGMA <- rbind(SIGMA,c(Sigma))
}
}
colnames(THETA) <- c("theta_1","theta_2")
colnames(SIGMA) <- c("sigma_11","sigma_12","sigma_21","sigma_22") #symmetry in sigma

Note that the text also has a function to sample from the Wishart distribution.

26 / 45

Diagnostics

#library(coda)
THETA.mcmc <- mcmc(THETA,start=1); summary(THETA.mcmc)
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## theta_1 47.30 2.956 0.02956 0.02956
## theta_2 53.69 3.290 0.03290 0.03290
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## theta_1 41.55 45.35 47.36 49.23 53.08
## theta_2 47.08 51.53 53.69 55.82 60.13
effectiveSize(THETA.mcmc)
## theta_1 theta_2
## 10000 10000
27 / 45

Diagnostics

SIGMA.mcmc <- mcmc(SIGMA,start=1); summary(SIGMA.mcmc)
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## sigma_11 202.3 63.39 0.6339 0.6511
## sigma_12 155.3 60.92 0.6092 0.6244
## sigma_21 155.3 60.92 0.6092 0.6244
## sigma_22 260.1 81.96 0.8196 0.8352
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## sigma_11 113.50 158.2 190.8 234.8 357.3
## sigma_12 67.27 113.2 144.7 186.5 305.4
## sigma_21 67.27 113.2 144.7 186.5 305.4
## sigma_22 145.84 203.2 244.6 300.9 461.0
effectiveSize(SIGMA.mcmc)
## sigma_11 sigma_12 sigma_21 sigma_22
## 9478.710 9517.989 9517.989 9629.352
28 / 45

Diagnostics: trace plots

plot(THETA.mcmc[,"theta_1"])

Looks good!

29 / 45

Diagnostics: trace plots

plot(THETA.mcmc[,"theta_2"])

Looks good!

30 / 45

Diagnostics: trace plots

plot(SIGMA.mcmc[,"sigma_11"])

Looks good!

31 / 45

Diagnostics: trace plots

plot(SIGMA.mcmc[,"sigma_12"])

Looks good!

32 / 45

Diagnostics: trace plots

plot(SIGMA.mcmc[,"sigma_22"])

Looks good!

33 / 45

Diagnostics: autocorrelation

autocorr.plot(THETA.mcmc[,"theta_1"])

Looks good!

34 / 45

Diagnostics: autocorrelation

autocorr.plot(THETA.mcmc[,"theta_2"])

Looks good!

35 / 45

Diagnostics: autocorrelation

autocorr.plot(SIGMA.mcmc[,"sigma_11"])

Looks good!

36 / 45

Diagnostics: autocorrelation

autocorr.plot(SIGMA.mcmc[,"sigma_12"])

Looks good!

37 / 45

Diagnostics: autocorrelation

autocorr.plot(SIGMA.mcmc[,"sigma_22"])

Looks good!

38 / 45

Posterior distribution of the mean

39 / 45

Answering questions of interest

  • Questions of interest:
    • Do students improve in reading comprehension on average?
40 / 45

Answering questions of interest

  • Questions of interest:

    • Do students improve in reading comprehension on average?
  • Need to compute \Pr[\theta_2 > \theta_1 | \boldsymbol{Y}]. In R,

    mean(THETA[,2]>THETA[,1])
    ## [1] 0.992
40 / 45

Answering questions of interest

  • Questions of interest:

    • Do students improve in reading comprehension on average?
  • Need to compute \Pr[\theta_2 > \theta_1 | \boldsymbol{Y}]. In R,

    mean(THETA[,2]>THETA[,1])
    ## [1] 0.992
  • That is, posterior probability > 0.99 and indicates strong evidence that test scores are higher in the second administration.
40 / 45

Answering questions of interest

  • Questions of interest:
    • If so, by how much?
41 / 45

Answering questions of interest

  • Questions of interest:

    • If so, by how much?
  • Need posterior summaries \Pr[\theta_2 - \theta_1 | \boldsymbol{Y}]. In R,

    mean(THETA[,2] - THETA[,1])
    ## [1] 6.385515
    quantile(THETA[,2] - THETA[,1], prob=c(0.025, 0.5, 0.975))
    ## 2.5% 50% 97.5%
    ## 1.233154 6.385597 11.551304
41 / 45

Answering questions of interest

  • Questions of interest:

    • If so, by how much?
  • Need posterior summaries \Pr[\theta_2 - \theta_1 | \boldsymbol{Y}]. In R,

    mean(THETA[,2] - THETA[,1])
    ## [1] 6.385515
    quantile(THETA[,2] - THETA[,1], prob=c(0.025, 0.5, 0.975))
    ## 2.5% 50% 97.5%
    ## 1.233154 6.385597 11.551304
  • Mean (and median) improvement is \approx 6.39 points with 95% credible interval (1.23, 11.55).
41 / 45

Answering questions of interest

  • Questions of interest:
    • How correlated (positively) are the post-test and pre-test scores?
42 / 45

Answering questions of interest

  • Questions of interest:

    • How correlated (positively) are the post-test and pre-test scores?
  • We can compute \Pr[\sigma_{12} > 0 | \boldsymbol{Y}]. In R,

    mean(SIGMA[,2]>0)
    ## [1] 1
42 / 45

Answering questions of interest

  • Questions of interest:

    • How correlated (positively) are the post-test and pre-test scores?
  • We can compute \Pr[\sigma_{12} > 0 | \boldsymbol{Y}]. In R,

    mean(SIGMA[,2]>0)
    ## [1] 1
  • Posterior probability that the covariance between them is positive is basically 1.
42 / 45

Answering questions of interest

  • Questions of interest:
    • How correlated (positively) are the post-test and pre-test scores?
43 / 45

Answering questions of interest

  • Questions of interest:

    • How correlated (positively) are the post-test and pre-test scores?
  • We can also look at the distribution of \rho instead. In R,

    CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))
    quantile(CORR,prob=c(0.025, 0.5, 0.975))
    ## 2.5% 50% 97.5%
    ## 0.4046817 0.6850218 0.8458880
43 / 45

Answering questions of interest

  • Questions of interest:

    • How correlated (positively) are the post-test and pre-test scores?
  • We can also look at the distribution of \rho instead. In R,

    CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))
    quantile(CORR,prob=c(0.025, 0.5, 0.975))
    ## 2.5% 50% 97.5%
    ## 0.4046817 0.6850218 0.8458880
  • Median correlation between the 2 scores is 0.69 with a 95% quantile-based credible interval of (0.40, 0.85)
43 / 45

Answering questions of interest

  • Questions of interest:

    • How correlated (positively) are the post-test and pre-test scores?
  • We can also look at the distribution of \rho instead. In R,

    CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))
    quantile(CORR,prob=c(0.025, 0.5, 0.975))
    ## 2.5% 50% 97.5%
    ## 0.4046817 0.6850218 0.8458880
  • Median correlation between the 2 scores is 0.69 with a 95% quantile-based credible interval of (0.40, 0.85)

  • Because density is skewed, we may prefer the 95% HPD interval, which is (0.45, 0.88).

    #library(hdrcde)
    hdr(CORR,prob=95)$hdr
    ## [,1] [,2]
    ## 95% 0.4468522 0.8761174
43 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
44 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).

  • What if we want to specify the priors so that we put in as little information as possible?

44 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).

  • What if we want to specify the priors so that we put in as little information as possible?

  • We already know how to do that somewhat with Jeffreys' priors.

44 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).

  • What if we want to specify the priors so that we put in as little information as possible?

  • We already know how to do that somewhat with Jeffreys' priors.

  • For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives

    \pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.

44 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).

  • What if we want to specify the priors so that we put in as little information as possible?

  • We already know how to do that somewhat with Jeffreys' priors.

  • For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives

    \pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.

  • Can we derive the full conditionals under this prior?

44 / 45

Jeffreys' prior

  • Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).

  • What if we want to specify the priors so that we put in as little information as possible?

  • We already know how to do that somewhat with Jeffreys' priors.

  • For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives

    \pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.

  • Can we derive the full conditionals under this prior?

  • To be done on the board.

44 / 45

Jeffreys' prior

  • We can leverage previous work. For the likelihood we have both

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}

    and

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}

    where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.

45 / 45

Jeffreys' prior

  • We can leverage previous work. For the likelihood we have both

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}

    and

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}

    where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.

  • Also, we can rewrite any \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0) as

    \begin{split} p(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}

45 / 45

Jeffreys' prior

  • We can leverage previous work. For the likelihood we have both

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}

    and

    \begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}

    where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.

  • Also, we can rewrite any \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0) as

    \begin{split} p(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}

  • Finally, \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0),

    \begin{split} \Rightarrow \ \ p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}. \end{split}

45 / 45

Announcements

  • Homework 5 will be online by 5pm today.
2 / 45
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow