Chat on survey responses
Multivariate normal/Gaussian model
Inference for mean (recap)
Inference for covariance
Back to the example
Gibbs sampler
Jeffreys' prior
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}
For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma),
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}
If we assume \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0), that is,
\begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}
For data \boldsymbol{Y}_i = (Y_{i1},\ldots,Y_{ip})^T \sim \mathcal{N}_p(\boldsymbol{\theta}, \Sigma),
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & = \prod^n_{i=1} (2\pi)^{-\frac{p}{2}} \left|\Sigma\right|^{-\frac{1}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}\\ & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\}. \end{split}
If we assume \pi(\boldsymbol{\theta}) = \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0), that is,
\begin{split} \pi(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}\\ \end{split}
Then
\begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T \left[\Lambda_0^{-1} + n\Sigma^{-1}\right] \boldsymbol{\theta} + \boldsymbol{\theta}^T \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right] \right\} \ \equiv \ \mathcal{N}_p(\boldsymbol{\mu}_n, \Lambda_n) \end{split}
where
\begin{split} \Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\ \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]. \end{split}
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
\begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}
As in the univariate case, we once again have that
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
\begin{split} \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right]\\ \\ & = \overbrace{\left[ \Lambda_n \Lambda_0^{-1} \right]}^{\textrm{weight on prior mean}} \underbrace{\boldsymbol{\mu}_0}_{\textrm{prior mean}} + \overbrace{\left[ \Lambda_n (n\Sigma^{-1}) \right]}^{\textrm{weight on sample mean}} \underbrace{ \bar{\boldsymbol{y}}}_{\textrm{sample mean}} \end{split}
Compare these to the results from the univariate case to gain more intuition.
A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf
\begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}
where
A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf
\begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}
where
For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.
A random variable \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), where \Sigma is positive definite and p \times p, has pdf
\begin{split} p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}, \end{split}
where
For this distribution, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0, for \nu_0 > p + 1.
Hence, \boldsymbol{S}_0 is the scaled mean of the \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0).
If we are very confidence in a prior guess \Sigma_0, for \Sigma, then we might set
In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.
If we are very confidence in a prior guess \Sigma_0, for \Sigma, then we might set
In this case, \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0 = \dfrac{1}{\nu_0 - p - 1}(\nu_0 - p - 1)\Sigma_0 = \Sigma_0, and \Sigma is tightly (depending on the value of \nu_0) centered around \Sigma_0.
If we are not at all confident but we still have a prior guess \Sigma_0, we might set
Here, \mathbb{E}[\Sigma] = \Sigma_0 as before, but \Sigma is only loosely centered around \Sigma_0.
Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.
Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.
Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).
Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.
Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).
A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf
\begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}
Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.
Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).
A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf
\begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}
Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.
Just as we had with the gamma and inverse-gamma relationship in the univariate case, we can also work in terms of the Wishart distribution (multivariate generalization of the gamma) instead.
The Wishart distribution provides a conditionally-conjugate prior for the precision matrix \Sigma^{-1} in a multivariate normal model.
Specifically, if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \Phi = \Sigma^{-1} \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}).
A random variable \Phi \sim \textrm{W}_p(\nu_0, \boldsymbol{S}_0^{-1}), where \Phi has dimension (p \times p), has pdf
\begin{split} f(\Phi) \ \propto \ \left|\Phi\right|^{\frac{\nu_0 - p - 1}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Phi) \right\}. \end{split}
Here, \mathbb{E}[\Phi] = \nu_0 \boldsymbol{S}_0.
Note that the textbook writes the inverse-Wishart as \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0^{-1}). I prefer \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0) instead. Feel free to use either notation but try not to get confused.
For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.
First a few results from matrix algebra:
For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.
First a few results from matrix algebra:
For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.
First a few results from matrix algebra:
For inference on \boldsymbol{\Sigma}, we need to rewrite the likelihood a bit to match the inverse-Wishart kernel.
First a few results from matrix algebra:
It is thus convenient to rewrite L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) as
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta})\right\}}_{\text{no algebra/change yet}}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} (\boldsymbol{y}_i - \boldsymbol{\theta}) \right]}_{\text{by result 3}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \sum^n_{i=1} \underbrace{\text{tr}\left[(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by cyclic property}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2} \underbrace{\text{tr}\left[\sum^n_{i=1} (\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \Sigma^{-1} \right]}_{\text{by result 4}} \right\}\\ & = \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}
where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T is the residual sum of squares matrix.
Assuming \pi(\Sigma) = \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), the conditional posterior (full conditional) \Sigma | \boldsymbol{\theta}, \boldsymbol{Y}, is then
\begin{split} \pi(\Sigma| \boldsymbol{\theta}, \boldsymbol{Y}) & \propto L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) \cdot \pi(\boldsymbol{\theta}) \\ \\ & \propto \underbrace{\left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\}}_{L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma)} \cdot \underbrace{\left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}}_{\pi(\boldsymbol{\theta})} \\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + p + n + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[\boldsymbol{S}_0\Sigma^{-1} + \boldsymbol{S}_\theta \Sigma^{-1} \right] \right\} ,\\ \\ & \propto \left|\Sigma\right|^{\frac{-(\nu_0 + n + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}\left[ \left(\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right) \Sigma^{-1} \right] \right\} ,\\ \end{split}
which is \mathcal{IW}_p(\nu_n, \boldsymbol{S}_n), or using the notation in the book, \mathcal{IW}_p(\nu_n, \boldsymbol{S}_n^{-1} ), with
We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.
\boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".
We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.
\boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".
Recall that if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.
We once again see that the "posterior sample size" or "posterior degrees of freedom" \nu_n is the sum of the "prior degrees of freedom" \nu_0 and the data sample size n.
\boldsymbol{S}_n can be thought of as the "posterior sum of squares", which is the sum of "prior sum of squares" plus "sample sum of squares".
Recall that if \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0), then \mathbb{E}[\Sigma] = \dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0.
\Rightarrow the conditional posterior expectation of the population covariance is
\begin{split} \mathbb{E}[\Sigma | \boldsymbol{\theta}, \boldsymbol{Y}] & = \dfrac{1}{\nu_0 + n - p - 1} \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\ \\ & = \underbrace{\dfrac{\nu_0 - p - 1}{\nu_0 + n - p - 1}}_{\text{weight on prior expectation}} \overbrace{\left[\dfrac{1}{\nu_0 - p - 1} \boldsymbol{S}_0\right]}^{\text{prior expectation}} + \underbrace{\dfrac{n}{\nu_0 + n - p - 1}}_{\text{weight on sample estimate}} \overbrace{\left[\dfrac{1}{n} \boldsymbol{S}_\theta \right]}^{\text{sample estimate}},\\ \end{split}
which is a weighted average of prior expectation and sample estimate.
Vector of observations for each student: \boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T.
Y_{i1}: pre-instructional score for student i.
Y_{i2}: post-instructional score for student i.
Vector of observations for each student: \boldsymbol{Y}_i = (Y_{i1},Y_{i2})^T.
Y_{i1}: pre-instructional score for student i.
Y_{i2}: post-instructional score for student i.
Questions of interest:
Do students improve in reading comprehension on average?
If so, by how much?
Can we predict post-test score from pre-test score? How correlated are they?
If we have students with missing pre-test scores, can we predict the scores? (Will defer this till next week!)
Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.
Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,
\begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}
Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.
Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,
\begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}
We can answer the first two questions of interest by looking at the posterior distribution of \theta_2 - \theta_1.
Since we have bivariate continuous responses for each student, and test scores are often normally distributed, we can use a bivariate normal model.
Model the data as \boldsymbol{Y_i} = (Y_{i1},Y_{i2})^T \sim \mathcal{N}_2(\boldsymbol{\theta}, \Sigma), that is,
\begin{eqnarray*} \boldsymbol{Y} = \begin{pmatrix}Y_{i1}\\ Y_{i2} \end{pmatrix} & \sim & \mathcal{N}_2\left[\boldsymbol{\theta} = \left(\begin{array}{c} \theta_1\\ \theta_2 \end{array}\right),\Sigma = \left(\begin{array}{cc} \sigma^2_1 & \sigma_{12} \\ \sigma_{21} & \sigma^2_2 \end{array}\right)\right].\\ \end{eqnarray*}
We can answer the first two questions of interest by looking at the posterior distribution of \theta_2 - \theta_1.
The correlation between Y_1 and Y_2 is
\rho = \dfrac{\sigma_{12}}{\sigma_1 \sigma_2},
so we can answer the third question by looking at the posterior distribution of \rho, which we have once we have posterior samples of \Sigma.
Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.
Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.
\boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.
Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.
\boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.
Next, we need to set values for elements of
\begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}
Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.
\boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.
Next, we need to set values for elements of
\begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}
It is quite reasonable to believe apriori that the true means will most likely lie in the interval [25, 75] with high probability (perhaps 0.95?), since individual test scores should lie in the interval [0, 100].
Clearly, we first need to set the hyperparameters \boldsymbol{\mu}_0 and \Lambda_0 in \pi(\boldsymbol{\theta}) = \mathcal{N}_2(\boldsymbol{\mu}_0, \Lambda_0), based on prior belief.
For this example, both tests were actually designed apriori to have a mean of 50, so, we can set \boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T.
\boldsymbol{\mu}_0 = (0,0)^T is also often a common choice when there is no prior guess, especially when there is enough data to "drown out" the prior guess.
Next, we need to set values for elements of
\begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} \lambda_{11} & \lambda_{12} \\ \lambda_{21} & \lambda_{22} \end{array}\right)\\ \end{eqnarray*}
It is quite reasonable to believe apriori that the true means will most likely lie in the interval [25, 75] with high probability (perhaps 0.95?), since individual test scores should lie in the interval [0, 100].
Recall that for any normal distribution, 95% of the density will lie within two standard deviations of the mean.
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}
Similarly, set \lambda_{22} \approx 156.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}
Similarly, set \lambda_{22} \approx 156.
Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}
Similarly, set \lambda_{22} \approx 156.
Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
\Rightarrow 0.5 = \dfrac{\lambda_{12}}{\sqrt{\lambda_{11}}\sqrt{\lambda_{22}}} = \dfrac{\lambda_{12}}{156} \ \ \Rightarrow \ \ \lambda_{12} = 156 \times 0.5 = 78.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\lambda_{11}} = (25,75) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\lambda_{11}} = (25,75) \\ \Rightarrow \ \ & 2\sqrt{\lambda_{11}} = 25 \ \ \Rightarrow \ \ \lambda_{11} = \left(\frac{25}{2}\right)^2 \approx 156. \end{split}
Similarly, set \lambda_{22} \approx 156.
Finally, we expect some correlation between \mu_{0(1)} and \mu_{0(2)}, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
\Rightarrow 0.5 = \dfrac{\lambda_{12}}{\sqrt{\lambda_{11}}\sqrt{\lambda_{22}}} = \dfrac{\lambda_{12}}{156} \ \ \Rightarrow \ \ \lambda_{12} = 156 \times 0.5 = 78.
Thus,
\begin{eqnarray*} \pi(\boldsymbol{\theta}) = \mathcal{N}_2\left(\boldsymbol{\mu}_0 = \left(\begin{array}{c} 50\\ 50 \end{array}\right),\Lambda_0 = \left(\begin{array}{cc} 156 & 78 \\ 78 & 156 \end{array}\right)\right).\\ \end{eqnarray*}
Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.
First, let's start with a prior guess \Sigma_0 for \Sigma.
Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.
First, let's start with a prior guess \Sigma_0 for \Sigma.
Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.
Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.
First, let's start with a prior guess \Sigma_0 for \Sigma.
Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.
Just as we did with \Lambda_0, we can use that idea to set the elements of \Sigma_0
\begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} \sigma^{(0)}_{11} & \sigma^{(0)}_{12} \\ \sigma^{(0)}_{21} & \sigma^{(0)}_{22} \end{array}\right)\\ \end{eqnarray*}
Next we need to set the hyperparameters \nu_0 and \boldsymbol{S}_0 in \pi(\Sigma) = \mathcal{IW}_2(\nu_0, \boldsymbol{S}_0), based on prior belief.
First, let's start with a prior guess \Sigma_0 for \Sigma.
Again, since individual test scores should lie in the interval [0, 100], we should set \Sigma_0 so that values outside [0, 100] are highly unlikely.
Just as we did with \Lambda_0, we can use that idea to set the elements of \Sigma_0
\begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} \sigma^{(0)}_{11} & \sigma^{(0)}_{12} \\ \sigma^{(0)}_{21} & \sigma^{(0)}_{22} \end{array}\right)\\ \end{eqnarray*}
The identity matrix is also often a common choice for \Sigma_0 when there is no prior guess, especially when there is enough data to "drown out" the prior guess.
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}
Similarly, set \sigma^{(0)}_{22} \approx 625.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}
Similarly, set \sigma^{(0)}_{22} \approx 625.
Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}
Similarly, set \sigma^{(0)}_{22} \approx 625.
Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
\Rightarrow 0.5 = \dfrac{\sigma^{(0)}_{12}}{\sqrt{\sigma^{(0)}_{11}}\sqrt{\sigma^{(0)}_{22}}} = \dfrac{\sigma^{(0)}_{12}}{625} \ \ \Rightarrow \ \ \sigma^{(0)}_{12} = 625 \times 0.5 = 312.5.
Therefore, we can set
\begin{split} & \mu_{0(1)} \pm 2 \sqrt{\sigma^{(0)}_{11}} = (0,100) \ \ \Rightarrow \ \ 50 \pm 2\sqrt{\sigma^{(0)}_{11}} = (0,100) \\ \Rightarrow \ \ & 2\sqrt{\sigma^{(0)}_{11}} = 50 \ \ \Rightarrow \ \ \sigma^{(0)}_{11} = \left(\frac{50}{2}\right)^2 \approx 625. \end{split}
Similarly, set \sigma^{(0)}_{22} \approx 625.
Again, we expect some correlation between Y_1 and Y_2, but suppose we don't know exactly how strong. We can set the prior correlation to 0.5.
\Rightarrow 0.5 = \dfrac{\sigma^{(0)}_{12}}{\sqrt{\sigma^{(0)}_{11}}\sqrt{\sigma^{(0)}_{22}}} = \dfrac{\sigma^{(0)}_{12}}{625} \ \ \Rightarrow \ \ \sigma^{(0)}_{12} = 625 \times 0.5 = 312.5.
Thus,
\begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\\ \end{eqnarray*}
Recall that if we are not at all confident on a prior value for \Sigma, but we have a prior guess \Sigma_0, we can set
so that \Sigma is only loosely centered around \Sigma_0.
Recall that if we are not at all confident on a prior value for \Sigma, but we have a prior guess \Sigma_0, we can set
so that \Sigma is only loosely centered around \Sigma_0.
Thus, we can set
so that we have
\begin{eqnarray*} \pi(\Sigma) = \mathcal{IW}_2\left(\nu_0 = 4,\Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\right).\\ \end{eqnarray*}
Now, to the data (finally!)
Y <- as.matrix(dget("http://www2.stat.duke.edu/~pdh10/FCBS/Inline/Y.reading"))dim(Y)
## [1] 22 2
head(Y)
## pretest posttest## [1,] 59 77## [2,] 43 39## [3,] 34 46## [4,] 32 26## [5,] 42 38## [6,] 38 43
summary(Y)
## pretest posttest ## Min. :28.00 Min. :26.00 ## 1st Qu.:34.25 1st Qu.:43.75 ## Median :44.00 Median :52.00 ## Mean :47.18 Mean :53.86 ## 3rd Qu.:58.00 3rd Qu.:60.00 ## Max. :72.00 Max. :86.00
To recap, we have
\begin{split} \pi(\boldsymbol{\theta} | \Sigma, \boldsymbol{Y}) = \mathcal{N}_2(\boldsymbol{\mu}_n, \Lambda_n) \end{split}
where
\begin{split} \Lambda_n & = \left[\Lambda_0^{-1} + n\Sigma^{-1}\right]^{-1}\\ \\ \boldsymbol{\mu}_n & = \Lambda_n \left[\Lambda_0^{-1}\boldsymbol{\mu}_0 + n\Sigma^{-1} \bar{\boldsymbol{y}} \right],\\ \end{split}
\boldsymbol{\mu}_0 = (\mu_{0(1)},\mu_{0(2)})^T = (50,50)^T
\begin{eqnarray*} \Lambda_0 = \left(\begin{array}{cc} 156 & 78 \\ 78 & 156 \end{array}\right)\\ \end{eqnarray*}
We also have
\begin{split} \pi(\Sigma | \boldsymbol{\theta} \boldsymbol{Y}) = \mathcal{IW}_2(\nu_n, \boldsymbol{S}_n) \end{split}
or using the notation in the book, \mathcal{IW}_2(\nu_n, \boldsymbol{S}_n^{-1} ), where
\begin{split} \nu_n & = \nu_0 + n\\ \\ \boldsymbol{S}_n & = \left[\boldsymbol{S}_0 + \boldsymbol{S}_\theta \right]\\ & = \left[\boldsymbol{S}_0 + \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T \right]. \end{split}
\nu_0 = p + 2 = 4
\begin{eqnarray*} \Sigma_0 = \left(\begin{array}{cc} 625 & 312.5 \\ 312.5 & 625 \end{array}\right)\\ \end{eqnarray*}
#Data summariesn <- nrow(Y)ybar <- apply(Y,2,mean)#Hyperparameters for the priorsmu_0 <- c(50,50)Lambda_0 <- matrix(c(156,78,78,156),nrow=2,ncol=2)nu_0 <- 4S_0 <- matrix(c(625,312.5,312.5,625),nrow=2,ncol=2)#Initial values for Gibbs sampler#No need to set initial value for theta, we can simply sample it firstSigma <- cov(Y)#Set null matrices to save samplesTHETA <- SIGMA <- NULL
Next, we need to write the code for the Gibbs sampler.
#Now, to the Gibbs sampler#library(mvtnorm) for multivariate normal#library(MCMCpack) for inverse-Wishart#first set number of iterations and burn-in, then set seedn_iter <- 10000; burn_in <- 0.3*n_iterset.seed(1234)for (s in 1:(n_iter+burn_in)){##update theta using its full conditionalLambda_n <- solve(solve(Lambda_0) + n*solve(Sigma))mu_n <- Lambda_n %*% (solve(Lambda_0)%*%mu_0 + n*solve(Sigma)%*%ybar)theta <- rmvnorm(1,mu_n,Lambda_n)#update SigmaS_theta <- (t(Y)-c(theta))%*%t(t(Y)-c(theta))S_n <- S_0 + S_thetanu_n <- nu_0 + nSigma <- riwish(nu_n, S_n)#save results only past burn-inif(s > burn_in){ THETA <- rbind(THETA,theta) SIGMA <- rbind(SIGMA,c(Sigma)) }}colnames(THETA) <- c("theta_1","theta_2")colnames(SIGMA) <- c("sigma_11","sigma_12","sigma_21","sigma_22") #symmetry in sigma
Note that the text also has a function to sample from the Wishart distribution.
#library(coda)THETA.mcmc <- mcmc(THETA,start=1); summary(THETA.mcmc)
## ## Iterations = 1:10000## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 10000 ## ## 1. Empirical mean and standard deviation for each variable,## plus standard error of the mean:## ## Mean SD Naive SE Time-series SE## theta_1 47.30 2.956 0.02956 0.02956## theta_2 53.69 3.290 0.03290 0.03290## ## 2. Quantiles for each variable:## ## 2.5% 25% 50% 75% 97.5%## theta_1 41.55 45.35 47.36 49.23 53.08## theta_2 47.08 51.53 53.69 55.82 60.13
effectiveSize(THETA.mcmc)
## theta_1 theta_2 ## 10000 10000
SIGMA.mcmc <- mcmc(SIGMA,start=1); summary(SIGMA.mcmc)
## ## Iterations = 1:10000## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 10000 ## ## 1. Empirical mean and standard deviation for each variable,## plus standard error of the mean:## ## Mean SD Naive SE Time-series SE## sigma_11 202.3 63.39 0.6339 0.6511## sigma_12 155.3 60.92 0.6092 0.6244## sigma_21 155.3 60.92 0.6092 0.6244## sigma_22 260.1 81.96 0.8196 0.8352## ## 2. Quantiles for each variable:## ## 2.5% 25% 50% 75% 97.5%## sigma_11 113.50 158.2 190.8 234.8 357.3## sigma_12 67.27 113.2 144.7 186.5 305.4## sigma_21 67.27 113.2 144.7 186.5 305.4## sigma_22 145.84 203.2 244.6 300.9 461.0
effectiveSize(SIGMA.mcmc)
## sigma_11 sigma_12 sigma_21 sigma_22 ## 9478.710 9517.989 9517.989 9629.352
plot(THETA.mcmc[,"theta_1"])
Looks good!
plot(THETA.mcmc[,"theta_2"])
Looks good!
plot(SIGMA.mcmc[,"sigma_11"])
Looks good!
plot(SIGMA.mcmc[,"sigma_12"])
Looks good!
plot(SIGMA.mcmc[,"sigma_22"])
Looks good!
autocorr.plot(THETA.mcmc[,"theta_1"])
Looks good!
autocorr.plot(THETA.mcmc[,"theta_2"])
Looks good!
autocorr.plot(SIGMA.mcmc[,"sigma_11"])
Looks good!
autocorr.plot(SIGMA.mcmc[,"sigma_12"])
Looks good!
autocorr.plot(SIGMA.mcmc[,"sigma_22"])
Looks good!
Questions of interest:
Need to compute \Pr[\theta_2 > \theta_1 | \boldsymbol{Y}]. In R,
mean(THETA[,2]>THETA[,1])
## [1] 0.992
Questions of interest:
Need to compute \Pr[\theta_2 > \theta_1 | \boldsymbol{Y}]. In R,
mean(THETA[,2]>THETA[,1])
## [1] 0.992
Questions of interest:
Need posterior summaries \Pr[\theta_2 - \theta_1 | \boldsymbol{Y}]. In R,
mean(THETA[,2] - THETA[,1])
## [1] 6.385515
quantile(THETA[,2] - THETA[,1], prob=c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5% ## 1.233154 6.385597 11.551304
Questions of interest:
Need posterior summaries \Pr[\theta_2 - \theta_1 | \boldsymbol{Y}]. In R,
mean(THETA[,2] - THETA[,1])
## [1] 6.385515
quantile(THETA[,2] - THETA[,1], prob=c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5% ## 1.233154 6.385597 11.551304
Questions of interest:
We can compute \Pr[\sigma_{12} > 0 | \boldsymbol{Y}]. In R,
mean(SIGMA[,2]>0)
## [1] 1
Questions of interest:
We can compute \Pr[\sigma_{12} > 0 | \boldsymbol{Y}]. In R,
mean(SIGMA[,2]>0)
## [1] 1
Questions of interest:
We can also look at the distribution of \rho instead. In R,
CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))quantile(CORR,prob=c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5% ## 0.4046817 0.6850218 0.8458880
Questions of interest:
We can also look at the distribution of \rho instead. In R,
CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))quantile(CORR,prob=c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5% ## 0.4046817 0.6850218 0.8458880
Questions of interest:
We can also look at the distribution of \rho instead. In R,
CORR <- SIGMA[,2]/(sqrt(SIGMA[,1])*sqrt(SIGMA[,4]))quantile(CORR,prob=c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5% ## 0.4046817 0.6850218 0.8458880
Median correlation between the 2 scores is 0.69 with a 95% quantile-based credible interval of (0.40, 0.85)
Because density is skewed, we may prefer the 95% HPD interval, which is (0.45, 0.88).
#library(hdrcde)hdr(CORR,prob=95)$hdr
## [,1] [,2]## 95% 0.4468522 0.8761174
Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
What if we want to specify the priors so that we put in as little information as possible?
Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
What if we want to specify the priors so that we put in as little information as possible?
We already know how to do that somewhat with Jeffreys' priors.
Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
What if we want to specify the priors so that we put in as little information as possible?
We already know how to do that somewhat with Jeffreys' priors.
For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives
\pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.
Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
What if we want to specify the priors so that we put in as little information as possible?
We already know how to do that somewhat with Jeffreys' priors.
For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives
\pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.
Can we derive the full conditionals under this prior?
Clearly, there's a lot of work to be done in specifying the hyperparameters (two or which are p \times p matrices).
What if we want to specify the priors so that we put in as little information as possible?
We already know how to do that somewhat with Jeffreys' priors.
For the multivariate normal model, turns out that the Jeffreys' rule for generating a prior distribution on (\boldsymbol{\theta}, \Sigma) gives
\pi(\boldsymbol{\theta}, \Sigma) \propto \left|\Sigma\right|^{-\frac{(p+2)}{2}}.
Can we derive the full conditionals under this prior?
To be done on the board.
We can leverage previous work. For the likelihood we have both
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}
and
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}
where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.
We can leverage previous work. For the likelihood we have both
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}
and
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}
where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.
Also, we can rewrite any \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0) as
\begin{split} p(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}
We can leverage previous work. For the likelihood we have both
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T(n\Sigma^{-1})\boldsymbol{\theta} + \boldsymbol{\theta}^T (n\Sigma^{-1} \bar{\boldsymbol{y}}) \right\} \end{split}
and
\begin{split} L(\boldsymbol{Y}; \boldsymbol{\theta}, \Sigma) & \propto \left|\Sigma\right|^{-\frac{n}{2}} \ \textrm{exp} \left\{-\dfrac{1}{2}\text{tr}\left[\boldsymbol{S}_\theta \Sigma^{-1} \right] \right\},\\ \end{split}
where \boldsymbol{S}_\theta = \sum^n_{i=1}(\boldsymbol{y}_i - \boldsymbol{\theta})(\boldsymbol{y}_i - \boldsymbol{\theta})^T.
Also, we can rewrite any \mathcal{N}_p(\boldsymbol{\mu}_0, \Lambda_0) as
\begin{split} p(\boldsymbol{\theta}) & \propto \textrm{exp} \left\{-\dfrac{1}{2} \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\theta} + \boldsymbol{\theta}^T\Lambda_0^{-1}\boldsymbol{\mu}_0 \right\}.\\ \end{split}
Finally, \Sigma \sim \mathcal{IW}_p(\nu_0, \boldsymbol{S}_0),
\begin{split} \Rightarrow \ \ p(\Sigma) \ \propto \ \left|\Sigma\right|^{\frac{-(\nu_0 + p + 1)}{2}} \textrm{exp} \left\{-\dfrac{1}{2} \text{tr}(\boldsymbol{S}_0\Sigma^{-1}) \right\}. \end{split}
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |