Poisson model (wrap-up); Monte Carlo approximation and sampling

class: center, middle, inverse, title-slide

# Poisson model (wrap-up); Monte Carlo approximation and sampling
### Dr. Olanrewaju Michael Akande
### Jan 24, 2020

---

## Announcements

- HW2 now online, due next Thursday.
  + There should be six questions.
  + If you only see five questions, refresh your browser.

---
## Outline

- Poisson model (wrap-up)
  + Example
  + Posterior prediction
  + Other parameterizations
  
- Finding conjugate distributions

- Monte Carlo approximation

- Sampling methods
  + Simple accept/reject
  + Importance sampling

---
class: center, middle

# Poisson model (wrap-up)

---
## Poisson-Gamma recap

Poisson data:
.block[
.small[
`$$f(y_i; \theta): y_1,\ldots,y_n \overset{iid}{\sim} \textrm{Po}(\theta)$$`
]
]

`$+$` Gamma Prior:
.block[
.small[
`$$\pi(\theta) = \frac{b^a}{\Gamma(a)} \theta^{a-1}e^{-b\theta} = \textrm{Ga}(a,b)$$`
]
]

`$\Rightarrow$` Gamma posterior:
.block[
.small[
`$$\pi(\theta | \{y_i\}): \theta | \{y_i\} \sim \textrm{Ga}(a+\sum y_i,b+n).$$`
]
]

- Recall: for `$\textrm{Gamma}(a,b)$`,
  + `$\mathbb{E}[\theta] = \dfrac{a}{b}$`
  + `$\mathbb{V}[\theta] = \dfrac{a}{b^2}$`
  + `$\textrm{Mode}[\theta] = \dfrac{a-1}{b} \ \ \textrm{for} \ \ a \geq 1$`

---
## Hoff example: birth rates

- Survey data on educational attainment and number of children of 155 forty-year-old women during the 1990's.

- These women were in their 20s during the 1970s, a period of historically low fertility rates in the US.

- **Goal**: compare birth rate `$\theta_1$` for women with bachelor's degrees to the rate `$\theta_2$` for women without.

- **Data**:
  + 111 women without a bachelor's degree had 217 children: `$(\bar{y}_1 = 1.95)$`
  + 44 women with bachelor's degrees had 66 children: `$(\bar{y}_2 = 1.50)$`
  
--

- Based on the data alone, looks like `$\theta_1$` should be greater than `$\theta_2$`. But...how sure are we?
  
--

- **Priors**: `$\theta_1, \theta_2 \sim \textrm{Ga}(2,1)$` (not much prior information; equivalent to 1 prior woman with 2 children). Posterior means will be close to the MLEs.

---
## Hoff example: birth rates

- Then,
  + `$\theta_1 | \{n_1=111, \sum y_{i,1}=217\} \sim \textrm{Ga}(2+217,1+111) = \textrm{Ga}(219,112).$`
  
  + `$\theta_2 | \{n_2=44, \sum y_{i,2}=66\} \sim \textrm{Ga}(2+66,1+44) = \textrm{Ga}(68,45).$`
  
--

- Use R to calculate posterior means and 95% CIs for `$\theta_1$` and `$\theta_2$`.
    
    ```r
    a=2; b=1; #prior
    n1=111; sumy1=217; n2=44; sumy2=66 #data
    (a+sumy1)/(b+n1); (a+sumy2)/(b+n2); #post means
    qgamma(c(0.025, 0.975),a+sumy1,b+n1) #95\% ci 1
    qgamma(c(0.025, 0.975),a+sumy2,b+n2) #95\% ci 2
    ```

- Posterior means: `$\mathbb{E}[\theta_1 | \{y_{i,1}\}] = 1.955$` and `$\mathbb{E}[\theta_2 | \{y_{i,2}\}] = 1.511$`.
  
--

- 95% credible intervals
  + `$\theta_1$`: [1.71, 2.22].
  + `$\theta_2$`: [1.17, 1.89].

---
## Hoff example: birth rates

Prior and posteriors:

---
## Hoff example: birth rates

- Posteriors indicate considerable evidence birth rates are higher among women without bachelor's degrees.

- Confirms what we observed.

- Using sampling we can quickly calculate `$\Pr(\theta_1 > \theta_2 | \textrm{data})$`.
    
    ```r
    mean(rgamma(10000,219,112)>rgamma(10000,68,45))
    ```

We have `$\Pr(\theta_1 > \theta_2 | \textrm{data}) = 0.97$`.
  
--

- Why/how does it work?

- .hlight[Monte Carlo approximation] coming soon!

- Clearly, that probability will change with different priors.

---
## Posterior predictive distribution

- <div class="question">
What is the posterior predictive distribution for the Poisson-gamma model?
</div>

- Let `$a_n = a+\sum y_i$` and `$b_n = b+n$`.

- We have
.block[
.small[
$$
`\begin{aligned}
f(y_{n+1}|y_{1:n}) &= \int f(y_{n+1}|\theta) \pi(\theta|y_{1:n})\,d\theta \\
&= \int \textrm{Po}(y_{n+1}; \theta)  \textrm{Ga}(\theta; a_n,b_n)\,d\theta \\
&= ... \\
&= ... \\
&= \dfrac{\Gamma(a_n + y_{n+1})}{\Gamma(a_n)\Gamma(y_{n+1}+1)} \ \left(\dfrac{b_n}{b_n + 1} \right)^{a_n} \ \left(\dfrac{1}{b_n + 1} \right)^{y_{n+1}}
\end{aligned}`
$$
]
]

which is the .hlight[negative binomial distribution], `$\textrm{Neg-binomial}\left(a_n,\dfrac{1}{b_n + 1}\right)$`.
  
--

- The .hlight[prior predictive distribution] `$f(y_{n+1}; a,b)$`  takes a similar form.

---
## Negative binomial distribution

- Originally derived as the number of successes in a sequence of independent `$\textrm{Bernoulli}(p)$` trials before `$r$` failures occur.

- The negative binomial distribution `$\textrm{Neg-binomial}\left(r,p\right)$` is parameterized by `$r$` and `$p$` and the pmf is given by
.block[
.small[
`$$\Pr[Y = y | r, p] = {y + r - 1 \choose y} (1-p)^rp^y; \ \ \ \ y=0,1,2,\ldots; \ \ \ p \in [0,1].$$`
]
]

- Starting with this, the distribution can be extended to allow `$r \in (0,\infty)$` as
.block[
.small[
`$$\Pr[Y = y | r, p] = \dfrac{\Gamma(y+r)}{\Gamma(y+1)\Gamma(r)} (1-p)^rp^y; \ \ \ \ y=0,1,2,\ldots; \ \ \ p \in [0,1].$$`
]
]

- Some properties:

+ `$\mathbb{E}[\theta] = \dfrac{pr}{1-p}$`
  
  + `$\mathbb{V}[\theta] = \dfrac{pr}{(1-p)^2}$`

---
## Posterior predictive distribution

- The negative binomial distribution is an over-dispersed generalization of the Poisson.

- <div class="question">
What does over-dispersion mean?
</div>

- In marginalizing `$\theta$` out of the Poisson likelihood, over a gamma distribution, we obtain a negative-binomial.

- For `$(y_{n+1}|y_{1:n}) \sim \textrm{Neg-binomial}\left(a_n,\dfrac{1}{b_n + 1}\right)$`, we have

+ `$\mathbb{E}[y_{n+1}|y_{1:n}] = \dfrac{a_n}{b_n} = \mathbb{E}[\theta|y_{1:n}] =$` posterior mean, and
  
  + `$\mathbb{V}[y_{n+1}|y_{1:n}] = \dfrac{a_n(b_n+1)}{b_n^2} = \mathbb{E}[\theta|y_{1:n}] \left(\dfrac{b_n+1}{b_n}\right)$`,
  
    so that variance is larger than the mean by an amount determined by `$b_n$`, which takes the over-dispersion into account.

---
## Predictive uncertainty

- Note that as the sample size `$n$` increases, the posterior density for `$\theta$` becomes more and more concentrated.
.block[
.small[
`$$\mathbb{V}[\theta | y_{1:n}] = \dfrac{a_n}{b_n^2} = \dfrac{a + \sum_i y_i}{(b + n)^2} \approx \dfrac{\bar{y}}{n} \rightarrow 0.$$`
]
]

- Also, recall that `$\mathbb{V}[y_{n+1}|y_{1:n}] = \mathbb{E}[\theta|y_{1:n}] \left(\dfrac{b_n+1}{b_n}\right)$`.

- As we have less uncertainty about `$\theta$`, the inflation factor
.block[
.small[
`$$\dfrac{b_n+1}{b_n} = \dfrac{b + n+1}{b + n} \rightarrow 1$$`
]
]

and the predictive density `$f(y_{n+1}|y_{1:n}) \rightarrow \textrm{Po}(\bar{y})$`.
  
--

- Of course, in smaller samples, it is important to inflate our predictive intervals to account for uncertainty in `$\theta$`.

---
## Back to birth rates

- Let's compare the posterior predictive distributions for the two groups of women.
<img src="05-Monte-Carlo_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />

---
## Poisson model in terms of rate

- In many applications, it is often convenient to parameterize the Poisson model a bit differently. One option takes the form
.block[
.small[
`$$y_i \sim \textrm{Po}(x_i\theta); \ \ \ i=1,\ldots, n.$$`
]
]

where `$x_i$` represents an explanatory variable and `$\theta$` is once again the population parameter of interest. The model is not exchangeable in the `$y_i$`'s but is exchangeable in the pairs `$(x,y)_i$`.
  
--

- In epidemiology, `$\theta$` is often called the population "rate" and `$x_i$` is called the "exposure" of unit `$i$`.

- When dealing with mortality rates in different counties for example, `$x_i$` can be the population `$n_i$` in county `$i$`, with `$\theta =$` the overall mortality rate.

- The gamma distribution is still conjugate for `$\theta$`, with the resulting posterior taking the form
.block[
.small[
`$$\pi(\theta | \{x_i, y_i\}): \theta | \{x_i, y_i\} \sim \textrm{Ga}(a+\sum_i y_i,b+\sum_i x_i).$$`
]
]

---
## BDA example: asthma mortality rate

- Consider an example on estimating asthma mortality rates for cities in the US.

- Since actual mortality rates can be small on the raw scale, they are often commonly estimated per 100,000 or even per one million.

- To keep it simple, let's use "per 100,000" for this example.

- For inference, ideally, we collect data which should basically count the number of asthma-related deaths per county.

- Note that inference is by county here, so county is indexes observations in the sample.

- Since we basically have count data, a Poisson model would be reasonable here.

---
## Asthma mortality rate

- Since each city would be expected to have different populations, we might consider the sampling model:
.block[
.small[
`$$y_i \sim \textrm{Po}(x_i\theta); \ \ \ i=1,\ldots, n.$$`
]
]

where 
    + `$x_i$` is the "exposure" for county `$i$`, that is, population of county `$i$` is `$x_i \times 100,000$`; and
    + `$\theta$` is the unknown "true" city mortality rate per 100,000.

- Suppose 
  + we pick a city in the US with population of 200,000;

--
  + we find that 3 people died of asthma, i.e., roughly 1.5 cases per 100,000.

- Thus, we have one single observation with `$x_i = 2$` and `$y_i = 3$` for this city.

---
## Asthma mortality rate

- Next, we need to specify a prior. What is a sensible prior here?

- Perhaps we should look at mortality rates around the world or in similar countries.

- Suppose reviews of asthma mortality rates around the world suggest rates above 1.5 per 100,000 are very rare in Western countries, with typical rates around 0.6 per 100,000.

- Let's try a gamma distribution with `$\mathbb{E}[\theta] = 0.6$` and `$\Pr[\theta \geq 1.5]$` very low!

- A few options here, but let's go with `$\textrm{Ga}(3,5)$`, which has `$\mathbb{E}[\theta] = 0.6$` and `$\Pr[\theta \geq 1.5] \approx 0.02$`.

- Using trial-and error, explore more options in R!

---
## Asthma mortality rate

- Therefore, our posterior takes the form
.block[
.small[
`$$\pi(\theta | \{x_i, y_i\}): \theta | \{x_i, y_i\} \sim \textrm{Ga}(a+\sum_i y_i,b+\sum_i x_i)$$`
]
]

which is actually
.block[
.small[
`$$\pi(\theta | x,y) = \textrm{Ga}(a+y,b+x) = \textrm{Ga}(3+3,5+2) = \textrm{Ga}(6,7).$$`
]
]

- `$\mathbb{E}[\theta | x,y] = 6/7=0.86$` so that we expect less than 1 (0.86 to be exact) asthma-related deaths per 100,000 people in this city.
  
--

- In fact, the posterior probability that the long term death rate from asthma in this city is more than 1 per 100,000, `$\Pr[\theta > 1| x,y]$`, is 0.3.

- Also, `$\Pr[\theta \leq 2 | x,y] = 0.99$`, so that there is very little chance that we see more than 2 asthma-related deaths per 100,000 people in this city.

- Use `pgamma` in R to compute the probabilities.

---
## Prior vs posterior

Posterior is to the right of the prior since the data suggests higher mortality rates are more likely than the prior suggests. However, we only have one data point!

---
class: center, middle

# Finding conjugate distributions

---
## Finding conjugate distributions

- In the conjugate examples we have looked at so far, how did we know the prior distributions we chose would result in conjugacy?

- <div class="question">
Can we figure out the family of distributions that would be conjugate for arbitrary densities?
</div>

- Let's explore this using the .hlight[exponential distribution]. The exponential distribution is often used to model "waiting times" or other random variables (with support `$(0,\infty)$`) often measured on a time scale.

- If `$y \sim \textrm{Exp}(\theta)$`, we have the pdf
.block[
.small[
`$$f(y) = \theta e^{-y\theta}; \ \ \ y > 0.$$`
]
]

where `$\theta$` is the .hlight[rate parameter], and `$\mathbb{E}[y] = 1/\theta$`.
  
--

- Recall, if `$Y \sim \textrm{Ga}(1,\theta)$`, then `$Y \sim \textrm{Exp}(\theta)$`. What is `$\mathbb{V}[y]$` then?

- Let's figure out what the conjugate prior for this density would look like (to be done in class).

---
class: center, middle

# Monte Carlo approximation

---
## Monte Carlo approximation

- Monte Carlo integration is very key for Bayesian computation and using simulations in general.

- While we will focus on using Monte Carlo integration for Bayesian inference, the development is general and applies to any pdf/pmf `$p(\theta)$`.

- For our purposes, we will want to evaluate expectations of the form
.block[
.small[
`$$H = \int h(\theta) p(\theta) d\theta,$$`
]
]

for many different functions `$h(.)$` (usually scalar for us).
  
--

- Procedure:
  1. Generate a random sample `$\theta_1, \ldots, \theta_m \overset{ind}{\sim} p(\theta)$`.
  
--
  2. Estimate `$H$` using
.block[
.small[
`$$\bar{h} = \dfrac{1}{m} \sum_{i=1}^m h(\theta_i).$$`
]
]

---
## Monte Carlo approximation

- We have `$\mathbb{E}[h(\theta_i)] = H$`.

- Assuming `$\mathbb{E}[h^2(\theta_i)] < \infty$`, so that the variance of each `$h(\theta_i)$` is finite, we have

1. .hlight[LLN]: `$\bar{h} \overset{a.s.}{\rightarrow} H$`.
  
--
  2. .hlight[CLT]: `$\bar{h} - H$` is is asymptotically normal, with asymptotic variance
.block[
.small[
`$$\dfrac{1}{m} \int (h(\theta)-H)^2 p(\theta) d\theta,$$`
]
]

which can be approximated by
.block[
.small[
`$$v_m = \dfrac{1}{m^2} \sum_{i=1}^m (h(\theta_i)-\bar{h})^2.$$`
]
]

- `$\sqrt{v_m}$` is often called the .hlight[Monte Carlo standard error].

---
## Monte Carlo approximation

- That is, generally, taking large Monte Carlo sample sizes `$m$` (in the thousands or tens of thousands) can yield very precise, and cheaply computed, numerical approximations to mathematically difficult integrals.

- .hlight[What this means for us]: we can approximate just about any aspect of the posterior distribution with a large enough Monte Carlo sample.

- For samples `$\theta_1, \ldots, \theta_m$` drawn iid from `$p(\theta|y)$`, as `$m \rightarrow \infty$`, we have

--
  + `$\bar{\theta} = \dfrac{1}{m} \sum\limits_{i=1}^m \theta_i \rightarrow \mathbb{E}[\theta | y]$`
  
--
  + `$\hat{\sigma}_{\theta} = \dfrac{1}{m-1} \sum\limits_{i=1}^m (\theta_i - \bar{\theta})^2 \rightarrow \mathbb{V}[\theta | y]$`

--
  + `$\dfrac{1}{m} \sum\limits_{i=1}^m \mathbb{1}[\theta_i \leq c] = \dfrac{\# \theta_i \leq c}{m} \rightarrow \Pr[\theta \leq c| y]$`
  
--
  + `$[\dfrac{\alpha}{2}\textrm{th} \ \textrm{percentile of } (\theta_1, \ldots, \theta_m), (1-\dfrac{\alpha}{2})\textrm{th} \ \textrm{percentile of } (\theta_1, \ldots, \theta_m)]$` `$\rightarrow$` `$100 \times (1-\alpha)%$` quantile-based credible interval.

---
## Back to birth rates

- Suppose we randomly sample two women, one with degree and one without. To what extent do we expect the one without the degree to have more kids than the other, e.g. `$\tilde{y}_1 > \tilde{y}_2$`?

- Using R, 
    
    ```r
    set.seed(01222020)
    a=2; b=1; #prior
    n1=111; sumy1=217; n2=44; sumy2=66 #data
    mean(rnbinom(100000,size=(a+sumy1),mu=(a+sumy1)/(b+n1)) >
    rnbinom(10000,size=(a+sumy2),mu=(a+sumy2)/(b+n2)))
    ```
    
    ```
    ## [1] 0.48218
    ```
    
    ```r
    mean(rnbinom(100000,size=(a+sumy1),mu=(a+sumy1)/(b+n1))==
    rnbinom(10000,size=(a+sumy2),mu=(a+sumy2)/(b+n2)))
    ```
    
    ```
    ## [1] 0.21799
    ```

- That is, `$\Pr(\tilde{y}_1 > \tilde{y}_2) \approx 0.48$` and `$\Pr(\tilde{y}_1 = \tilde{y}_2) \approx 0.22$`.

- Strong evidence of difference between two populations does not really imply the difference in predictions is large.

---
## Monte Carlo approximation

- This general idea of using samples to "approximate" averages (expectations) is also useful when trying to approximate posterior predictive distributions.

- Quite often, we are able to sample from `$f(y_i; \theta)$` and `$\pi(\theta | \{y_i\})$` but not from `$f(y_{n+1}|y_{1:n})$` directly.

- We can do so indirectly using the following Monte Carlo procedure:
.block[
.small[
$$
`\begin{split}
\textrm{sample } \theta^{(1)} & \sim \pi(\theta | \{y_i\}), \ \textrm{ then sample } y_{n+1}^{(1)} \sim f(y_{n+1}; \theta^{(1)})\\
\textrm{sample } \theta^{(2)} & \sim \pi(\theta | \{y_i\}), \ \textrm{ then sample } y_{n+1}^{(2)} \sim f(y_{n+1}; \theta^{(2)})\\
& \ \ \vdots \hspace{133pt} \vdots \\
\textrm{sample } \theta^{(m)} & \sim \pi(\theta | \{y_i\}), \ \textrm{ then sample } y_{n+1}^{(m)} \sim f(y_{n+1}; \theta^{(m)}).\\
\end{split}`
$$
]
]

- The sequence `$\{(\theta, y_{n+1})^{(1)}, \ldots, (\theta, y_{n+1})^{(m)}\}$` constitutes `$m$` independent samples from the joint posterior of `$(\theta, Y_{n+1})$`.

- In fact, `$\{ y_{n+1}^{(1)}, \ldots, y_{n+1}^{(m)}\}$` are independent draws from the posterior predictive distribution we care about.