Normal model cont’d

class: center, middle, inverse, title-slide

# Normal model cont’d
### Dr. Olanrewaju Michael Akande
### Jan 31, 2020

---

## Announcements

- Take "Participation Quiz II - Jan 31" on Sakai.

- Homework 3 now online.

## Outline

- Inference for mean, conditional on variance (cont'd)
  
- Noninformative and improper priors

- Joint inference for mean and variance

- Back to the examples

---
class: center, middle

# The univariate normal model (cont'd)

---
## Conditional inference for the mean (recap)

Normal data: `$Y = (y_1,y_2,\ldots,y_n)$`, where each
.block[
.small[
`$$y_i \sim \mathcal{N}(\mu, \sigma^2); \ \ \ \textrm{or} \ \ \ y_i \sim \mathcal{N}(\mu, \tau^{-1}).$$`
]
]

`$+$` Normal Prior (when `$\sigma^2$`/ `$\tau$` is known):
.block[
.small[
`$$\mu|\sigma^2 \sim \mathcal{N}(\mu_0, \sigma_0^2); \ \ \ \textrm{or} \ \ \ \mu|\tau \sim \mathcal{N}(\mu_0, \tau_0^{-1}).$$`
]
]

`$\Rightarrow$` Normal posterior (in terms of precision):
.block[
.small[
`$$\mu|Y,\tau \sim \mathcal{N}(\mu_n, \tau_n^{-1}).$$`
]
]

--
  where
  + `$\mu_n = \dfrac{\tau n\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}$`; and 
  + `$\tau_n = \tau n + \tau_0$`.

---
## Posterior with precision terms: combining information

- Posterior mean is weighted sum of prior information plus data information:
.block[
.small[
$$
`\begin{split}
\mu_n & = \dfrac{n\tau\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}\\
& = \dfrac{\tau_0}{\tau_0 + \tau n} \mu_0 + \dfrac{n\tau}{\tau_0 + \tau n} \bar{y}
\end{split}`
$$
]
]

- Relatively easy to set `$\mu_0$` if we have a "prior" guess of the mean. What about `$\tau_0$`?

- If we think of the prior mean as being based on `$\kappa_0$` prior observations from a similar population as `$Y$`, then we might set `$\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}$`, which implies `$\tau_0 = \kappa_0 \tau$`.

- Then the posterior mean is given by
.block[
.small[
$$
`\begin{split}
\mu_n & = \dfrac{\kappa_0}{\kappa_0 + n} \mu_0 + \dfrac{n}{\kappa_0 + n} \bar{y}.
\end{split}`
$$
]
]

---
## Posterior with variance terms

- In terms of variances, we have
.block[
.large[
`$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \sigma_n^2)$$`
]
]

where
.block[
.small[
`$$\mu_n = \dfrac{ \dfrac{n}{\sigma^2}\bar{y} + \dfrac{1}{\sigma^2_0} \mu_0}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}} \ \ \ \textrm{and} \ \ \ \sigma^2_n = \dfrac{1}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}}.$$`
]
]

- Easy to see that we can re-express the posterior information as a sum of the prior information and the information from the data.

---
## Posterior with variance terms

- If we once again set `$\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}$`, the posterior mean is still given by
.block[
.small[
$$
`\begin{split}
\mu_n & = \dfrac{\kappa_0}{\kappa_0 + n} \mu_0 + \dfrac{n}{\kappa_0 + n} \bar{y}.
\end{split}`
$$
]
]

- By the way, setting `$\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}$` `$\Rightarrow$` prior dependence between `$\mu$` and `$\sigma^2$`, whereas an arbitrary `$\sigma_0^2$`, independent on `$\sigma^2$`, `$\Rightarrow$` prior independence between `$\mu$` and `$\sigma^2$`.

---
## Noninformative and improper priors

- Generally, we must specify both `$\mu_0$` and `$\tau_0$` to do inference.

- When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.

- To that end, there is often the desire to construct .hlight[noninformative priors], with the rationale being _"to let the data speak for themselves"_.

- For example, we could instead assume a uniform prior on `$\mu$` that is constant over the real line, i.e., `$\pi(\mu) \propto 1$` `$\Rightarrow$` all values on the real line are equally likely apriori.

- Clearly, this is not a valid pdf since it will not integrate to 1 over the real line. Such priors are known as .hlight[improper priors].

- An improper prior can still be very useful, we just need to ensure it results in a .hlight[proper posterior].

---
## Jeffreys' prior

- Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior?

- Laplace proposed the uniform distribution.  This proposal lacks invariance under monotone transformations of the parameter.

- For example, a uniform prior on the binomial proportion parameter `$\theta$` is not the same as a uniform prior on the odds parameter `$\phi = \dfrac{\theta}{1-\theta}$`.

- A more acceptable approach was introduced by Jeffreys. For single parameter models, the .hlight[Jeffreys' prior] defines a noninformative prior density of a parameter `$\theta$` as
.block[
.small[
`$$\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)}$$`
]
]

where `$\mathcal{I}(\theta)$` is the .hlight[Fisher information] for `$\theta$`.

---
## Jeffreys' prior

- The Fisher information gives a way to measure the amount of information a random variable `$Y$` carries about an unknown parameter `$\theta$` of a distribution that describes `$Y$`.

- Formally, `$\mathcal{I}(\theta)$` is defined as
.block[
.small[
`$$\mathcal{I}(\theta) = \mathbb{E} \left[ \left( \dfrac{\partial}{\partial \theta} \textrm{log} f(y; \theta) \right)^2 \bigg| \theta \right] = \int_\mathcal{Y} \left( \dfrac{\partial}{\partial \theta} \textrm{log} f(y; \theta) \right)^2 f(y; \theta) dy.$$`
]
]

- Alternatively,
.block[
.small[
`$$\mathcal{I}(\theta) = - \mathbb{E} \left[ \dfrac{\partial^2}{\partial^2 \theta} \textrm{log} f(y; \theta) \bigg| \theta \right] = - \int_\mathcal{Y} \left( \dfrac{\partial^2}{\partial^2 \theta} \textrm{log} f(y; \theta) \right) f(y; \theta) dy.$$`
]
]

- Turns out that the Jeffreys' prior for `$\mu$` under the normal model, when `$\sigma^2$` is known, is
.block[
.small[
`$$\pi(\mu) \propto 1,$$`
]
]

the uniform prior over the real line. You should try to derive this.

---
## Inference for mean, conditional on variance using Jeffreys' prior

- Recall that for `$\sigma^2$` known, the normal likelihood simplifies to
.block[
.small[
`$$\propto \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$`
]
]

ignoring everything else that does not depend on `$\mu$`.
  
- With the Jeffreys' prior `$\pi(\mu) \propto 1$`, can we derive the posterior distribution?

---
## Inference for mean, conditional on variance using Jeffreys' prior

- Posterior:
.block[
.small[
$$
`\begin{split}
\pi(\mu|Y,\sigma^2) \ & \propto \  \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\} \pi(\mu)\\
& \propto \  \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}.\\
\end{split}`
$$
]
]

- This is the kernel of a normal distribution with
    + mean `$\bar{y}$`, and
    + precision `$n\tau$` or variance `$\dfrac{1}{n\tau} = \dfrac{\sigma^2}{n}$`.

- Written differently, we have `$\mu|Y,\sigma^2 \sim \mathcal{N}(\bar{y},\dfrac{\sigma^2}{n})$`

- <div class="question">
This should look familiar to you. Does it?
</div>

---
## Joint inference for mean and variance

- What happens when `$\sigma$`/ `$\tau$` is unknown? We need a joint prior `$\pi(\mu,\sigma^2)$` for `$\mu$` and `$\sigma^2$`.

- Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far.

- That is,
.block[
.small[
`$$\pi(\mu,\sigma^2) \ = \  \pi(\mu | \sigma^2)\pi(\sigma^2).$$`
]
]

- For `$\pi(\sigma^2)$`, we need a distribution with support on `$(0,\infty)$`. One such family is the gamma family, but this is NOT conjugate for the variance of a normal distribution.

- The gamma distribution is, however, conjugate for the precision `$\tau$`, and in that case, we say that `$\sigma^2$` has an .hlight[inverse-gamma] distribution.

---
## Conditional specification of prior

- Once again, suppose `$Y = (y_1,y_2,\ldots,y_n)$`, where each
.block[
.small[
`$$y_i \sim \mathcal{N}(\mu, \sigma^2); \ \ \ \textrm{or} \ \ \ y_i \sim \mathcal{N}(\mu, \tau^{-1}).$$`
]
]

- A conjugate joint prior is given by
.block[
.small[
$$
`\begin{split}
\tau = \dfrac{1}{\sigma^2} \ & \sim \textrm{Gamma}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\
\mu|\sigma^2 & \sim \mathcal{N}\left(\mu_0, \dfrac{\sigma^2}{\kappa_0}\right) \ \ \ \textrm{or} \ \ \ \mu|\tau & \sim \mathcal{N}\left(\mu_0, \dfrac{1}{\kappa_0 \tau}\right).\\
\end{split}`
$$
]
]

- This is often called a .hlight[normal-gamma] prior distribution.

- `$\sigma_0^2$` is the prior guess for `$\sigma^2$`, while `$\nu_0$` is often referred to as the "prior degrees of freedom", our degree of confidence in `$\sigma_0^2$`.

- We do not have conjugacy if we replace `$\dfrac{\sigma^2}{\kappa_0}$` in the normal prior with an arbitrary prior variance independent of `$\sigma^2$`. To do inference in that scenario, we need .hlight[Gibbs sampling] (to come next week!).

---
## Posterior for the mean given variance, under normal-gamma prior

- Based on the normal-gamma prior, we need `$\pi(\mu|Y,\sigma^2)$` and `$\pi(\tau|Y)$`.

- For `$\pi(\mu|Y,\sigma^2)$`, we can leverage our previous results. We have
.block[
.large[
`$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \dfrac{\sigma^2}{\kappa_n}) \ \ \ \textrm{or} \ \ \ \mu|Y,\tau \sim \mathcal{N}(\mu_n, \dfrac{1}{\kappa_n \tau})$$`
]
]

where
.block[
.small[
`$$\mu_n = \dfrac{ \dfrac{n}{\sigma^2}\bar{y} + \dfrac{\kappa_0}{\sigma^2} \mu_0}{\dfrac{n}{\sigma^2} + \dfrac{\kappa_0}{\sigma^2}} = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y} \ \ \ \textrm{and} \ \ \ \kappa_n = \kappa_0 + n.$$`
]
]

- `$\mu_n$` is simply the sample mean of the current and prior observations, and posterior variance of `$\mu$` given `$\sigma^2$` is `$\sigma^2$` divided by the total number of observations (prior and current).

---
## Posterior derivation

- Some algebra is required to get the marginal posterior of `$\tau$`. Let's write the full joint posterior and go from there. We must keep some of the terms we discarded in the last lecture.

- Recall the likelihood
.block[
.large[
`$$L(Y; \mu,\tau) \propto \tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$`
]
]

- Now, `$\mu|\tau \sim \mathcal{N}\left(\mu_0, \dfrac{1}{\kappa_0 \tau}\right)$` `$\Rightarrow$`
.block[
.small[
`$$\pi(\mu|\tau) \propto \ \textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\}.$$`
]
]

- and `$\tau \sim \textrm{Ga}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)$` `$\Rightarrow$`
.block[
.small[
`$$\pi(\tau) \propto \tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\}.$$`
]
]

---
## Posterior derivation

.block[
.small[
$$
`\begin{split}
\Rightarrow \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \pi(\mu|\sigma^2) \times \pi(\tau) \times L(Y; \mu,\sigma^2)\\
 \\
& \boldsymbol{ \propto} \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\}}_{\propto \ \pi(\mu|\sigma^2)} \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\}}_{\propto \ \pi(\tau)}\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \underbrace{\tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\propto \ L(Y; \mu,\sigma^2)}\\
& \boldsymbol{=}  \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\textrm{Terms involving } \mu}\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\} \tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\}}_{\textrm{Terms NOT involving } \mu}\\
& \boldsymbol{=}  \textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu^2 - 2\mu\mu_0 + \mu_0^2) \right\}\  \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu^2 - 2\mu\bar{y} + \bar{y}^2) \right\}\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
\end{split}`
$$
]
]

---
## Posterior derivation

.block[
.small[
$$
`\begin{split}
\pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\kappa_0 \tau (\mu^2 - 2\mu\mu_0)  + \tau n(\mu^2 - 2\mu\bar{y}) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
 \\
& \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2(n\tau + \kappa_0 \tau) - 2\mu(n\tau \bar{y} + \kappa_0 \tau\mu_0)   \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
\end{split}`
$$
]
]

- Set `$a^\star = (n\tau + \kappa_0 \tau)$` and `$b^\star = (n\tau \bar{y} + \kappa_0 \tau\mu_0)$`, then complete the square for the first part like we did in the last lecture.
.block[
.small[
$$
`\begin{split}
\Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2 a^\star - 2\mu b^\star   \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
\end{split}`
$$
]
]

---
## Posterior derivation

.block[
.small[
$$
`\begin{split}
\Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 + \dfrac{(b^\star)^2}{2a^\star} \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
& \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(b^\star)^2}{a^\star}\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
& \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)^2}{(n\tau + \kappa_0 \tau)}\right] \right\}}_{\textrm{Expand terms and recombine}} \\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
& \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\dfrac{n\kappa_0\tau^2 (\mu_0^2- 2\mu_0\bar{y} + \bar{y}^2)}{\tau(n + \kappa_0)}\right] \right\}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
\end{split}`
$$ 
]
]

---
## Posterior derivation

.block[
.small[
$$
`\begin{split}
\Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\
 \\
& \boldsymbol{=} \ \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\}}_{\textrm{Substitute the values for } a^\star \textrm{ and } b^\star \textrm{ back}} \\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\
 \\
& \boldsymbol{=} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} (n\tau + \kappa_0 \tau) \left[ \mu^2- \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)}{(n\tau + \kappa_0 \tau)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\
\end{split}`
$$
]
]

---
## Posterior derivation

.block[
.small[
$$
`\begin{split}
\Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \tau(n + \kappa_0) \left[ \mu^2- \dfrac{(n \bar{y} + \kappa_0 \mu_0)}{(n + \kappa_0)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\
& \ \ \ \ \ \ \ \ \ \ \ \ 
\times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\
& \boldsymbol{=} \ \ \mathcal{N}\left(\mu_n, \dfrac{1}{\kappa_n\tau} \right) \times \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}\right) = \pi(\mu|Y, \tau) \pi(\tau|Y),
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\kappa_n & = \kappa_0 + n\\
\mu_n & = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y}\\
\nu_n & = \nu_0 + n\\
\sigma_n^2 & = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right] = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + \sum_{i=1}^n (y_i-\bar{y})^2 + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right]\\
\end{split}`
$$
]
]

- Turns out that the marginal posterior of `$\mu$`, that is, `$\pi(\mu|Y) = \int_0^\infty \pi(\mu, \tau|Y) d\tau$` is a **t-distribution**. You can derive that distribution if you are interested, we won't spend time on it in class.

---
## Back to our examples

- .hlight[Pygmalion: questions of interest]
  + Is the average improvement for the accelerated group larger than that for the no growth group?
      - What is `$\Pr[\mu_A > \mu_N | Y_A, Y_N)$`?
  + Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
      - What is `$\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)$`?
      
--

- .hlight[Job training: questions of interest]
  + Is the average change in annual earnings for the training group larger than that for the no training group?
      - What is `$\Pr[\mu_T > \mu_N | Y_T, Y_N)$`?
  + Is the variance of change in annual earnings for the training group larger than that for the no training group?
      - What is `$\Pr[\sigma^2_T > \sigma^2_N | Y_T, Y_N)$`?

---
## Mildly informative priors

- We will focus on the Pygmalion study. Follow the same approach for the job training data.

- Suppose you have no idea whether students would improve IQ on average. Set `$\mu_{0A} = \mu_{0N} = 0$`.

- Suppose you don't have any faith in this belief, and think it is the equivalent of having only 1 prior observation in each group. Set `$\kappa_{0A} = \kappa_{0N} = 1$`.

- Based on the literature, SD of change scores should be around 10 in each group, but still you don't have a lot of faith in this belief. Set 
`$\nu_{0A} = \nu_{0N} = 1$` and `$\sigma^2_{0A} = \sigma^2_{0N} = 100$`.

- Graph priors to see if they accord with your beliefs. Sampling new values of `$Y$` from the priors offers a good check.

---
## Recall the Pygmalion data

- Data:
  
  + Accelerated group (A): 20, 10, 19, 15, 9, 18.
  
  + No growth group (N): 3, 2, 6, 10, 11, 5.
--

- Summary statistics:
  
  + `$\bar{y}_A = 15.2$`; `$s_A = 4.71$`.
  
  + `$\bar{y}_N = 6.2$`; `$s_N = 3.65$`.

---
## Analysis with mildly informative priors

.block[
.small[
$$
`\begin{split}
\kappa_{nA} & = \kappa_{0A} + n_A = 1 + 6 = 7\\
\kappa_{nN} & = \kappa_{0N} + n_N = 1 + 6 = 7\\
\nu_{nA} & = \nu_{0A} + n_A = 1 + 6 = 7\\
\nu_{nN} & = \nu_{0N} + n_N = 1 + 6 = 7\\
\mu_{nA} & = \dfrac{\kappa_{0A} \mu_{0A} + n_A\bar{y}_A}{\kappa_{nA}} = \dfrac{ (1)(0) + (6)(15.2) }{7} \approx 13.03\\
\mu_{nN} & = \dfrac{\kappa_{0N} \mu_{0N} + n_N\bar{y}_N}{\kappa_{nN}} = \dfrac{ (1)(0) + (6)(6.2) }{7} \approx 5.31\\
\sigma_{nA}^2 & = \dfrac{1}{\nu_{nA}}\left[\nu_{0A}\sigma_{0A}^2 + s^2_A(n_A-1) + \dfrac{n_A\kappa_{0A}}{\kappa_{nA}} (\bar{y}_A - \mu_{0A})^2\right] \\
& = \dfrac{1}{7}\left[(1)(100) + (22.17)(5) + \dfrac{(6)(1)}{(7)} (15.2- 0)^2\right] \approx 58.41\\
\sigma_{nN}^2 & = \dfrac{1}{\nu_{nN}}\left[\nu_{0N}\sigma_{0N}^2 + s^2_N(n_N-1) + \dfrac{n_N\kappa_{0N}}{\kappa_{nN}} (\bar{y}_N - \mu_{0N})^2\right] \\
& = \dfrac{1}{7}\left[(1)(100) + (13.37)(5) + \dfrac{(6)(1)}{(7)} (6.2- 0)^2\right] \approx 28.54\\
\end{split}`
$$
]
]

---
## Analysis with mildly informative priors

- So our joint posterior is
.block[
.small[
$$
`\begin{split}
\mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\mu_{nA}, \dfrac{1}{\kappa_{nA}\tau_A} \right) = \mathcal{N}\left(13.03, \dfrac{1}{7\tau_A} \right)\\
\tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{\nu_{nA}}{2}, \dfrac{\nu_{nA} \sigma_{nA}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (58.41)}{2}\right)\\
\mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\mu_{nN}, \dfrac{1}{\kappa_{nN}\tau_N} \right) = \mathcal{N}\left(5.31, \dfrac{1}{7\tau_N} \right)\\
\tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{\nu_{nN}}{2}, \dfrac{\nu_{nN} \sigma_{nN}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (28.54)}{2}\right)\\
\end{split}`
$$
]
]

---
## Monte Carlo sampling

- To evaluate whether the accelerated group has larger IQ gains than the normal group, we would like to estimate quantities like `$\Pr[\mu_A > \mu_N | Y_A, Y_N)$` which are based on the **marginal posterior** of `$\mu$` rather than the **conditional distribution**.

- Fortunately, this is easy to do by generating samples of `$\mu$` and `$\sigma^2$` from their joint posterior.

---
## Monte Carlo sampling

- Suppose we simulate values using the following Monte Carlo procedure:
.block[
.small[
$$
`\begin{split}
\tau^{(1)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\
\mu^{(1)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(1)}} \right)\\
\tau^{(2)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\
\mu^{(2)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(2)}} \right)\\
& \ \ \vdots \\
& \ \ \vdots \\
& \ \ \vdots \\
\tau^{(m)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\
\mu^{(m)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(m)}} \right)\\
\end{split}`
$$
]
]

---
## Monte Carlo sampling

- Note that we are sampling each `$\mu^{(j)}$`, `$j=1,\ldots,m$`, from its conditional distribution, not from the marginal.

- The sequence of pairs `$\{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\}$` simulated using this method are independent samples from the joint posterior `$\pi(\mu,\tau | Y)$`.

- Additionally, the simulated sequence `$\{\mu^{(1)}, \ldots, \mu^{(m)}\}$` are independent samples from the **marginal posterior distribution**.

- While this may seem odd, keep in mind that while we drew the `$\mu$`'s as conditional samples, each was conditional on a different value of `$\tau$`.

- Thus, together they constitute marginal samples of `$\mu$`.

---
## Monte Carlo sampling

It is easy to sample from these posteriors:

```r
aA <- 7/2
aN <- 7/2
bA <- (7/2)*58.41
bN <- (7/2)*28.54
muA <- 13.03
muN <- 5.31
kappaA <- 7
kappaN <- 7
tauA_postsample <- rgamma(10000,aA,bA)
thetaA_postsample <- rnorm(10000,muA,sqrt(1/(kappaA*tauA_postsample)))
tauN_postsample <- rgamma(10000,aN,bN)
thetaN_postsample <- rnorm(10000,muN,sqrt(1/(kappaN*tauN_postsample)))
sigma2A_postsample <- 1/tauA_postsample
sigma2N_postsample <- 1/tauN_postsample
```

---
## Monte Carlo sampling

- Is the average improvement for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\mu_A > \mu_N | Y_A, Y_N)$`?
  
  ```r
  mean(thetaA_postsample > thetaN_postsample)
  ```
  
  ```
  ## [1] 0.9721
  ```

- Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)$`?
  
  ```r
  mean(sigma2A_postsample > sigma2N_postsample)
  ```
  
  ```
  ## [1] 0.8185
  ```
      
--

- <div class="question">
What can we conclude from this?
</div>

---
## Improper prior

- Let's be very objective with the prior selection. In fact, let's be extreme!

--
  + If we let the normal variance `$\rightarrow \infty$` then our prior on `$\mu$` is `$\propto 1$` (recall the Jeffreys' prior on `$\mu$` for known `$\sigma^2$`).

--
  + If we let the gamma variance get very large (e.g., `$a,b \rightarrow 0$`), then the prior on `$\sigma^2$` is `$\propto \dfrac{1}{\sigma^2}$`.

- `$\pi(\mu,\sigma^2) \propto \dfrac{1}{\sigma^2}$` is improper (does not integrate to 1) but does lead to a proper posterior distribution that yields inferences similar to frequentist ones.

- For that choice, we have
.block[
.small[
$$
`\begin{split}
\mu|Y,\tau & \sim \mathcal{N}\left(\bar{y},\dfrac{1}{n \tau}\right)\\
\tau | Y & \sim \textrm{Gamma}\left(\dfrac{n-1}{2}, \dfrac{(n-1)s^2}{2}\right)\\
\end{split}`
$$
]
]

---
## Analysis with noninformative priors

- So our joint posterior is
.block[
.small[
$$
`\begin{split}
\mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\bar{y}_A,\dfrac{1}{n_A \tau_A}\right) = \mathcal{N}\left(15.2, \dfrac{1}{6\tau_A} \right)\\
\tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{n_A-1}{2}, \dfrac{(n_A-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(22.17)}{2}\right)\\
\mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\bar{y}_N,\dfrac{1}{n_N \tau_N}\right) = \mathcal{N}\left(6.2, \dfrac{1}{6\tau_N} \right)\\
\tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{n_N-1}{2}, \dfrac{(n_N-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(13.37)}{2}\right)\\
\end{split}`
$$
]
]

---
## Monte Carlo sampling

It is easy to sample from these posteriors:

```r
aA <- (6-1)/2
aN <- (6-1)/2
bA <- (6-1)*22.17/2
bN <- (6-1)*13.37/2
muA <- 15.2
muN <- 6.2
tauA_postsample_impr <- rgamma(10000,aA,bA)
thetaA_postsample_impr <- rnorm(10000,muA,sqrt(1/(6*tauA_postsample_impr)))
tauN_postsample_impr <- rgamma(10000,aN,bN)
thetaN_postsample_impr <- rnorm(10000,muN,sqrt(1/(6*tauN_postsample_impr)))
sigma2A_postsample_impr <- 1/tauA_postsample_impr
sigma2N_postsample_impr <- 1/tauN_postsample_impr
```

---
## Monte Carlo sampling

- Is the average improvement for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\mu_A > \mu_N | Y_A, Y_N)$`?
  
  ```r
  mean(thetaA_postsample_impr > thetaN_postsample_impr)
  ```
  
  ```
  ## [1] 0.9941
  ```

- Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
  + What is `$\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)$`?
  
  ```r
  mean(sigma2A_postsample_impr > sigma2N_postsample_impr)
  ```
  
  ```
  ## [1] 0.7113
  ```
      
--

- <div class="question">
How does the new choice of prior affect our conclusions?
</div>

---
## Recall the job training data

- Data:
  
  + No training group (N): sample size `$n_N = 429$`.
  
  + Training group (T): sample size `$n_A = 185$`.

- Summary statistics for change in annual earnings:
  
  + `$\bar{y}_N = 1364.93$`; `$s_N = 7460.05$`
  
  + `$\bar{y}_T = 4253.57$`; `$s_T = 8926.99$`
  
--

- <div class="question">
Using the same approach we used for the Pygmalion data, answer the questions of interest.
</div>