Gibbs sampling

class: center, middle, inverse, title-slide

# Gibbs sampling
### Dr. Olanrewaju Michael Akande
### Feb 5, 2020

---

## Announcements

- Homework 3 due tomorrow.

## Outline

- Non-conjugate priors

- Full conditionals

- Gibbs sampling
  
- A simple example: bivariate normal

- In-class exercise

---
## Bayesian inference (conjugacy recap)

- As we've seen so far, Bayesian inference is based on posterior distributions, that is,
.block[
.small[
`$$p(\theta | y) = \frac{p(\theta)L(y;\theta)}{\int_{\Theta}p(\tilde{\theta})L(y; \tilde{\theta}) \textrm{d}\tilde{\theta}} = \frac{p(\theta)L(y;\theta)}{L(y)}$$`
]
]

- Good news: we have the numerator in this expression.

- Bad news: the denominator is typically not available (may involve high dimensional integral)!
  
--

- How have we been getting by? Conjugacy! For conjugate priors, the posterior distribution of `$\theta$` is available analytically.

- What if a conjugate prior does not represent our prior information well, or we have a more complex model, and our posterior is no longer in a convenient distributional form?

---
## Some conjugate models

- We've already seen the following conjugate models.

- Here are a few more we have not covered yet.

- Clearly, we cannot restrict ourselves to conjugate models only.

---
## Back to the normal model

- For conjugacy in the normal model, we had
.block[
.small[
$$
`\begin{split}
\mu|\tau & \sim \mathcal{N}\left(\mu_0, \dfrac{1}{\kappa_0 \tau}\right).\\
\tau \ & \sim \textrm{Gamma}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\
\end{split}`
$$
]
]

- Suppose we wish to specify our uncertainty about `$\mu$` as independent of `$\tau$`, that is, we want `$\pi(\mu,\tau) = \pi(\mu)\pi(\tau)$`. For example,
.block[
.small[
$$
`\begin{split}
\mu & \sim \mathcal{N}\left(\mu_0, \sigma_0^2\right).\\
\tau \ & \sim \textrm{Gamma}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0}{2\tau_0}\right).\\
\end{split}`
$$
]
]

- When `$\sigma_0^2$` is not proportional to `$\dfrac{1}{\tau}$`, the marginal density of `$\tau$` is not a gamma density (or a density we can easily sample from).

- Side note: for conjugacy, the joint posterior should also be a product of two independent Normal and Gamma densities in `$\mu$` and `$\tau$` respectively.

---
## Non-conjugate priors

- In general, conjugate priors are not available for generalized linear models (GLMs) other than the normal linear model.

- One can potentially rely on an asymptotic normal approximation.

- As `$n \rightarrow \infty$`, the posterior distribution is normal centered on MLE.

- However, even for moderate sample sizes, asymptotic approximations may be inaccurate.

- In logistic regression for example, for rare outcomes or rare binary exposures, posterior can be highly skewed.

- Appealing to avoid any reliance on large sample assumptions and base inferences on **exact posterior**.

---
## Non-conjugate priors

- Even though we may not be able to sample from the marginal posterior of a particular parameter when using a non-conjugate prior, sometimes, we may still be able to sample from conditional distributions of those parameters given all other parameters and the data.

- These conditional distributions, known as .hlight[full conditionals], will be very important for us.

- In our normal example with
.block[
.small[
$$
`\begin{split}
\mu & \sim \mathcal{N}\left(\mu_0, \sigma_0^2\right).\\
\tau \ & \sim \textrm{Gamma}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0}{2\tau_0}\right),\\
\end{split}`
$$
]
]

even though we cannot sample easily from `$\tau | Y$`, turns out we will be able to sample from `$\tau | \mu, Y$`. That is the .hlight[full conditional] for `$\tau$`.
  
--

- By the way, note that we already know the full conditional for `$\mu$`, i.e., `$\mu | \tau, Y$` (last two classes).

---
## Full conditional distributions

- .hlight[Goal]: try to take advantage of those full conditional distributions (without sampling directly from the marginal posteriors) to obtain samples from the said marginal posteriors.

- In our example, with `$\pi(\mu) = \mathcal{N}\left(\mu_0, \sigma_0^2\right)$`, we have
.block[
.small[
`$$\mu|Y,\tau \sim \mathcal{N}(\mu_n, \tau_n^{-1}),$$`
]
]

--
  where
  + `$\mu_n = \dfrac{\frac{\mu_0}{\sigma^2_0} + n\tau\bar{y}}{\frac{1}{\sigma^2_0} + n\tau}$`; and 
  + `$\tau_n = \frac{1}{\sigma^2_0} + n\tau$`.
  
--

- Review results from previous two classes if you are not sure why this holds.
  
--

- Let's see if we can figure out the other full conditional `$\tau | \mu, Y$`.

---
## Full conditional distributions

.block[
.small[
$$
`\begin{split}
\pi(\tau| \mu,Y) & \boldsymbol{=} \dfrac{\Pr[\tau,\mu,Y]}{\Pr[\mu,Y]} = \dfrac{L(y;\mu,\tau)\pi(\mu,\tau)}{\Pr[\mu,Y]}\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
& \boldsymbol{=} \dfrac{L(y;\mu,\tau)\pi(\mu)\pi(\tau)}{\Pr[\mu,Y]}\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
& \boldsymbol{\propto} L(y;\mu,\tau)\pi(\tau)\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
& \boldsymbol{\propto} \underbrace{\tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \sum^n_{i=1} (y_i - \mu)^2 \right\}}_{\propto \ L(Y; \mu,\tau)} \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0}{2\tau_0}\right\}}_{\propto \ \pi(\tau)}\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
& \boldsymbol{=} \underbrace{\tau^{\dfrac{\nu_0 + n}{2} - 1} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] \right\}}_{\textrm{Gamma Kernel}}.
\end{split}`
$$
]
]

---
## Full conditional distributions

.block[
.small[
$$
`\begin{split}
\pi(\tau| \mu,Y) & \boldsymbol{\propto} \underbrace{\tau^{\dfrac{\nu_0 + n}{2} - 1} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\
& \boldsymbol{=} \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n}{2 \tau_n(\mu)}\right) \ \ \ \textrm{OR} \ \ \ \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2(\mu)}{2}\right),
\end{split}`
$$
]
]

where
.block[
.small[
$$
`\begin{split}
\nu_n & = \nu_0 + n\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
\sigma_n^2(\mu) & = \dfrac{1}{\nu_n} \left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right] = \dfrac{1}{\nu_n}\left[ \dfrac{\nu_0}{\tau_0} + ns^2_n(\mu) \right]\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
\ \ \textrm{OR} \ \ \tau_n(\mu) & = \dfrac{\nu_n}{\left[ \dfrac{\nu_0}{\tau_0} + \sum^n_{i=1} (y_i - \mu)^2 \right]} = \dfrac{\nu_n}{\left[ \dfrac{\nu_0}{\tau_0} + ns^2_n(\mu) \right]};\\
& \ \ \ \ \ \ \ \ \ \ \ \ \\
\textrm{with} \ \ \ s^2_n(\mu) & = \dfrac{1}{n} \sum^n_{i=1} (y_i - \mu)^2.\\
\end{split}`
$$
]
]

---
## Iterative scheme

- Now we have two full conditional distributions but what we really need is to sample from `$\pi(\tau | Y)$`.

- Actually, if we could sample from `$\pi(\mu, \tau| Y)$`, we already know that the draws for `$\mu$` and `$\tau$` will be from the two marginal posterior distributions. So, we just need a scheme to sample from `$\pi(\mu, \tau| Y)$`.

- Suppose we had a single sample, say `$\tau^{(1)}$` from the marginal posterior distribution `$\pi(\tau | Y)$`. Then we could sample
.block[
.small[
`$$\mu^{(1)} \sim p(\mu | \tau^{(1)}, Y).$$`
]
]

- This is what we did in the last class, so that the pair `$\{\mu^{(1)},\tau^{(1)}\}$` is a sample from the joint posterior `$\pi(\mu, \tau| Y)$`.

- `$\Rightarrow \ \mu^{(1)}$` can be considered a sample from the marginal distribution of `$\mu$`, which again means we can use it to sample
.block[
.small[
`$$\tau^{(2)}  \sim p(\tau | \mu^{(1)}, Y),$$`
]
]

and so forth.

---
## Gibbs sampling

- So, we can use two **full conditional distributions** to generate samples from the **joint distribution**, once we have a starting value `$\tau^{(1)}$`.

- Formally, this sampling scheme is known as .hlight[Gibbs sampling].

--
  + .hlight[Purpose]: Draw from a joint distribution, say `$p(\mu, \tau| Y)$`.

--
  + .hlight[Method]: Iterative conditional sampling
      - Draw `$\tau^{(1)}  \sim p(\tau | \mu^{(0)}, Y)$`
      - Draw `$\mu^{(1)} \sim p(\mu | \tau^{(1)}, Y)$`

--
  + .hlight[Purpose]: Full conditional distributions have known forms, with sampling from the full conditional distributions fairly easy.

- More generally, we can use this method to generate samples of `$\theta = (\theta_1, \ldots, \theta_p)$`, the vector of `$p$` parameters of interest, from the joint density.

---
## Gibbs sampling

- .hlight[Procedure]:
  + Start with initial value `$\theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_p^{(0)})$`.

--
  + For iterations `$t = 1, \ldots, T$`,
      1. Sample `$\theta_1^{(t)}$` from the conditional posterior distribution
.block[
.small[
`$$\pi(\theta_1 | \theta_2 = \theta_2^{(t-1)}, \ldots, \theta_p = \theta_p^{(t-1)}, Y)$$`
]
]

2. Sample `$\theta_2^{(t)}$` from the conditional posterior distribution
.block[
.small[
`$$\pi(\theta_2 | \theta_1 = \theta_1^{(t)}, \theta_3 = \theta_3^{(t-1)}, \ldots, \theta_p = \theta_p^{(t-1)}, Y)$$`
]
]

3. Similarly, sample `$\theta_3^{(t)}, \ldots, \theta_p^{(t)}$` from the conditional posterior
distributions given current values of other parameters.
  
--
  + This generates a **dependent** sequence of parameter values.

---
## MCMC

- Gibbs sampling is one of several flavors of .hlight[Markov chain Monte Carlo (MCMC)].

--
  + .hlight[Markov chain]: a stochastic process in which future states are independent of past states conditional on the present state.
  
  + .hlight[Monte Carlo]: simulation.

- MCMC provides an approach for generating samples from posterior distributions.

- From these samples, we can obtain summaries (including summaries of functions) of the posterior distribution for `$\theta$`, our parameter of interest.

---
## How does MCMC work?

- Let `$\theta^{(t)} = (\theta_1^{(t)}, \ldots, \theta_p^{(t)})$` denote the value of the `$p \times 1$` vector of parameters at iteration t.

- Let `$\theta^{(0)}$` be an initial value used to start the chain (_should not be sensitive_).

- MCMC generates `$\theta^{(t)}$` from a distribution that depends on the data and potentially on `$\theta^{(t-1)}$`, but not on `$\theta^{(1)}, \ldots, \theta^{(t-2)}$`.

- This results in a Markov chain with **stationary distribution** `$\pi(\theta | Y)$` under some conditions on the sampling distribution.

- The theory of Markov Chains (structure, convergence, reversibility, detailed balance, stationarity, etc) is well beyond the scope of this course so we will not dive into it.

- If you are interested, consider taking STA 531/831 or courses on stochastic process.

---
## Properties

- **Note**: Our Markov chain is a collection of draws of `$\theta$` that are (slightly we hope!) dependent on the previous draw.

- The chain will wander around our parameter space, only remembering where it had been in the last draw.

- We want to have our MCMC sample size, `$T$`, big enough so that we can

+ Move out of areas of low probability into regions of high probability (convergence)
  
  + Move between high probability regions (good mixing)
  
  + Know our Markov chain is stationary in time (the distribution of samples is the same for all samples, regardless of location in the chain)

- At the start of the sampling, the samples are **not** from the posterior distribution. It is necessary to discard the initial samples as a .hlight[burn-in] to allow convergence. We'll talk more about that in the next class.

---
## Different flavors of MCMC

- The most commonly used MCMC algorithms are:
  + Metropolis sampling (Metropolis et al., 1953).
  + Metropolis-Hastings (MH) (Hastings, 1970).
  + Gibbs sampling (Geman & Geman, 1984; Gelfand & Smith, 1990).
  
- Overview of Gibbs - Casella & George (1992, The American Statistician, 46, 167-174).

--
the first two 
- Overview of MH - Chib & Greenberg (1995, The American Statistician).

- We will get to Metropolis and Metropolis-Hastings later in the course.

---
## Example: bivariate normal

- Consider
.block[
.small[
`\begin{eqnarray*}
\begin{pmatrix}\theta_1\\
\theta_2
\end{pmatrix} & \sim & \mathcal{N}\left[\left(\begin{array}{c}
0\\
0
\end{array}\right),\left(\begin{array}{cc}
1 & \rho \\
\rho & 1 
\end{array}\right)\right]\\
\end{eqnarray*}`
]
]

where `$\rho$` is known (and is the correlation between `$\theta_1$` and `$\theta_2$`).

- We will review details of the multivariate normal distribution very soon but for now, let's use this example to explore Gibbs sampling.

- For this density, turns out that we have
.block[
.small[
`$$\theta_1|\theta_2 \sim \mathcal{N}\left(\rho\theta_2, 1-\rho^2 \right)$$`
]
]

and
.block[
.small[
`$$\theta_2|\theta_1 \sim \mathcal{N}\left(\rho\theta_1, 1-\rho^2 \right)$$`
]
]

- While we can easily sample directly from this distribution (using the `mvtnorm` or `MASS` packages in R), let's instead use the Gibbs sampler to draw samples from it.

---
## Bivariate normal

First, a few examples of the bivariate normal distribution.
.midsmall[
`\begin{eqnarray*}
\begin{pmatrix}\theta_1\\
\theta_2
\end{pmatrix} & \sim & \mathcal{N}\left[\left(\begin{array}{c}
0\\
0
\end{array}\right),\left(\begin{array}{cc}
1 & 0 \\
0 & 1 
\end{array}\right)\right]\\
\end{eqnarray*}`
]

---
## Bivariate normal

.midsmall[
`\begin{eqnarray*}
\begin{pmatrix}\theta_1\\
\theta_2
\end{pmatrix} & \sim & \mathcal{N}\left[\left(\begin{array}{c}
0\\
0
\end{array}\right),\left(\begin{array}{cc}
1 & 0 \\
0 & 1 
\end{array}\right)\right]\\
\end{eqnarray*}`
]

---
## Bivariate normal

.midsmall[
`\begin{eqnarray*}
\begin{pmatrix}\theta_1\\
\theta_2
\end{pmatrix} & \sim & \mathcal{N}\left[\left(\begin{array}{c}
0\\
2
\end{array}\right),\left(\begin{array}{cc}
1 & 0.5 \\
0.5 & 2 
\end{array}\right)\right]\\
\end{eqnarray*}`
]

---
## Bivariate normal

---
## Bivariate normal

.midsmall[
`\begin{eqnarray*}
\begin{pmatrix}\theta_1\\
\theta_2
\end{pmatrix} & \sim & \mathcal{N}\left[\left(\begin{array}{c}
1\\
-1
\end{array}\right),\left(\begin{array}{cc}
1 & 0.9 \\
0.9 & 0.5 
\end{array}\right)\right]\\
\end{eqnarray*}`
]

---
## Bivariate normal

---
## Back to the example

- Again, we have
.block[
.small[
`$$\theta_1|\theta_2 \sim \mathcal{N}\left(\rho\theta_2, 1-\rho^2 \right); \ \ \ \ \theta_2|\theta_1 \sim \mathcal{N}\left(\rho\theta_1, 1-\rho^2 \right)$$`
]
]

- Here's a code to do Gibbs sampling using those full conditionals:
  
  ```r
  rho <- #set correlation
  S <- #set number of MCMC samples
  thetamat <- matrix(0,nrow=S,ncol=2)
  theta <- c(10,10) #initialize values of theta
  for (s in 1:S) {
  theta[1] <- rnorm(1,rho*theta[2],sqrt(1-rho^2)) #sample theta1
  theta[2] <- rnorm(1,rho*theta[1],sqrt(1-rho^2)) #sample theta2
  thetamat[s,] <- theta
  }
  ```

- Here's a code to do sample directly instead:
  
  ```r
  library(mvtnorm)
  rho <- #set correlation; no need to set again once you've used previous code
  S <- #set number of MCMC samples; no need to set again once you've used previous code
  Mu <- c(0,0)
  Sigma <- matrix(c(1,rho,rho,1),ncol=2)
  thetamat_direct <- rmvnorm(S, mean = Mu,sigma = Sigma)
  ```

---
## Participation exercise

- You will work in groups of three. Work with the three students closest to you.

- For `$S \in \{50,250,500\}$` and `$\rho \in \{0.1,0.5,0.95\}$`, do the following:
  1. Generate `$S$` samples using the two methods.
  2. Make a scatter plot of the samples from each method (plot the samples from the Gibbs sampler first) and compare them.

- How do the results differ between the two methods for the different combinations of `$S$` and `$\rho$`?

- Discuss within your teams, document your team findings and submit.

- You can have one person document the findings but make sure to write the name of all three members at the top of the sheet.

---
## More code

See how the chain actually evolves with an overlay on the true density:

```r
rho <- #set correlation
Sigma <- matrix(c(1,rho,rho,1),ncol=2); Mu <- c(0,0)
x.points <- seq(-3,3,length.out=100)
y.points <- x.points
z <- matrix(0,nrow=100,ncol=100)
for (i in 1:100) {
  for (j in 1:100) {
    z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=Mu,sigma=Sigma)
  }
}
contour(x.points,y.points,z,xlim=c(-3,10),ylim=c(-3,10),"orange2",
        xlab=expression(theta[1]),ylab=expression(theta[2]))

S <- #set number of MCMC samples;
thetamat <- matrix(0,nrow=S,ncol=2)
theta <- c(10,10)
points(x=theta[1],y=theta[2],col="black",pch=2)
for (s in 1:S) {
  theta[1] <- rnorm(1,rho*theta[2],sqrt(1-rho^2))
  theta[2] <- rnorm(1,rho*theta[1],sqrt(1-rho^2))
  thetamat[s,] <- theta
  if(s < 20){
    points(x=theta[1],y=theta[2],col="red4",pch=16); Sys.sleep(1)
  } else {
    points(x=theta[1],y=theta[2],col="green4",pch=16); Sys.sleep(0.1)
  }
}
```