One parameter models cont’d

class: center, middle, inverse, title-slide

# One parameter models cont’d
### Dr. Olanrewaju Michael Akande
### Jan 17, 2020

---

## Announcements

- Tentative dates for quizzes:
  + Quiz I: Wed, Feb 12
  + Quiz II: Wed, Apr 1
--

- Homework 1 now online, due Thursday, Jan 23.

- My office hours are confirmed:
  + Wed 9:00 - 10:00am and Thur 11:45 - 12:45pm

- No lab on Monday; MLK day.

- Occasional short (timed) "participation" quizzes from next week.
  + Will usually be available on Sakai under "Tests & Quizzes"
  + Must take at the beginning of class if present
  + Must take between 11:45am and 1:00pm if absent

---
## Outline

- Beta-binomial (cont'd)
  + Example
  + Cautionary tale: parameters at the boundary
  + Marginal likelihood
  + Posterior prediction
  + Truncated priors
  
- Introduction to the Poisson-Gamma model
  + Recap of the distributions
  + Conjugacy

---
class: center, middle

# Beta-binomial cont'd

---
## Beta-binomial recap

Binomial likelihood:
.block[
.small[
`$$L(y; \theta) = {n \choose y} \theta^y(1-\theta)^{n-y}$$`
]
]

`$+$` Beta Prior:
.block[
.small[
`$$\pi(\theta) = \frac{1}{B(a,b)} \theta^{a-1} (1-\theta)^{b-1} = \textrm{Beta}(a,b)$$`
]
]

`$\Rightarrow$` Beta posterior:
.block[
.small[
`$$\pi(\theta | y) = \frac{1}{B(a+y,b+n-y)} \theta^{a+y-1} (1-\theta)^{b+n-y-1} = \textrm{Beta}(a+y,b+n-y).$$`
]
]

- Recall: for `$\textrm{Beta}(a,b)$`,
  + `$\mathbb{E}[Y] = \dfrac{a}{a+b}$`
  + `$\mathbb{V}[Y] = \dfrac{ab}{(a+b)^2(a+b+1)}$`

---
## Example: rare events

- .hlight[Step 1]. State the question:
  + What is the prevalence of an infectious disease in a small city?
  + Why? High prevalence means more public health precautions are recommended.
  
--

- .hlight[Step 2]. Collect the data:
  + Suppose you collect a small random sample of 20 individuals.
  
--

- .hlight[Step 3]. Explore the data:
  + Count the number of infected individuals in the sample.
  + Let `$Y$` be the corresponding random variable.

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Parameter of interest: `$\theta$` is the fraction of infected individuals in the city.
  + Sampling model: a reasonable model for `$Y$` can be `$\textrm{Bin}(20,\theta)$`
  
<img src="img/binomial_histograms.png" width="105" height="370px" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Prior specification: information from previous studies — infection rate in “comparable cities” ranges from 0.05 to
0.20 with an average of 0.10. So maybe a standard deviation of roughly 0.05?
  + What is a good prior? The **expected value** of `$\theta$` close to 0.10 and the **variance** close to 0.05^2.
  + Possible option: `$\mbox{Beta}(3.5,31.5)$` or maybe even `$\mbox{Beta}(3,32)$`?
  <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Under `$\mbox{Beta}(3,32)$`, `$\Pr(\theta < 0.1) \approx 0.67$`.
  + Posterior distribution for the model: `$(\theta | Y=y) = \textrm{Beta}(a+y,b+n-y)$`
  + Suppose `$Y=0$`. Then, `$(\theta | Y=y) = \textrm{Beta}(3,32+20)$`
<img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 5]. Check your models:
  + Compare performance of posterior mean and posterior probability that `$\theta < 0.1$`. .hlight[See pages 5 to 7 of the Hoff book (the section on sensitivity analysis)].
  + Under `$\mbox{Beta}(3,52)$`, 
      - `$\Pr(\theta < 0.1 | Y=y) \approx 0.92$`. More confidence in low values of `$\theta$`.
      - For `$\mathbb{E}(\theta | Y=y)$`, we have
.block[
.small[
$$
`\begin{split}
\mathbb{E}(\theta | y) & = \dfrac{a+y}{a+b+n} = \dfrac{3}{52} = 0.058.\\
\end{split}`
$$
]
]
      - Recall that the prior mean is `$a/(a+b)=0.09$`. Thus, we can see how that contributes to the prior mean.
.block[
.small[
$$
`\begin{split}
\mathbb{E}(\theta | y) & = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}\\
& = \dfrac{a+b}{a+b+n} \times \dfrac{a}{a+b} + \dfrac{n}{a+b+n} \times \dfrac{y}{n}\\
& = \dfrac{35}{52} \times \dfrac{3}{35} + \dfrac{20}{52} \times \dfrac{0}{n} = \dfrac{3}{52} = 0.058.\\
\end{split}`
$$
]
]

---
## Example: rare events

- .hlight[Step 6]. Answer the question:
  + People with low prior expectations are generally at least `$90\%$` certain that the infection rate is below 0.10. Again, .hlight[see pages 5 to 7 of the Hoff book].
  + `$\pi(\theta | Y)$` is to the left of `$\pi(\theta)$` because the observation `$Y=0$` provides evidence of a low value of `$\theta$`.
  + `$\pi(\theta | Y)$` is more peaked than `$\pi(\theta)$` because it combines information and so contains more information than `$\pi(\theta)$` alone.
  + The posterior expectation is 0.058.
  + The posterior mode is 0.04.
      - Note, for `$\mbox{Beta}(a,b)$`, the mode is `$\dfrac{a-1}{a+b-2}$`.
  + The posterior probability that `$\theta < 0.1$` is 0.92.
  
  
---
## Cautionary tale: parameters at the boundary

- Tuyl et al. (2008) discuss potential dangers of using priors that have `$a < 1$` with data that are all 0's (or `$b < 1$` with all 1's). They consider data on adverse reactions to a new radiological contrast agent.

- Let `$\theta_N$`: probability of a bad reaction using the new agent.

- Current standard agent causes bad reactions about 15 times in 10000, so one might think 0.0015 is a good guess for `$\theta_N$`.

- How do we choose a prior distribution?

---
## Potential prior distributions

- One might consider a variety of choices centered on `$15/10000 = 0.0015$`:
  + Prior 1: .hlight[Beta(1,666)] (mean 0.0015; 1 prior bad reaction in 667 administrations)
  
  + Prior 2: .hlight[Beta(0.05,33.33)] (mean 0.0015; 0.05 prior bad reactions in 33.38 administrations)
  
  + Prior 3: .hlight[Beta(1.6, 407.4)] (mode 0.0015; 409 prior administrations)
  
  + Prior 4: .hlight[Beta(1.05, 497)] (median 0.0015; 498.05 prior administrations)

- Does it matter which one we choose?
  
  
  
---
## Potential prior distributions

---
## Potential prior distributions

---
## Potential prior distributions

- Let's take a closer look at properties of these four prior distributions, concentrating on the probability that `$\theta_N < 0.0015$`.

- That is, new agent not more dangerous than old agent.

<br />       |  Be(1,666) | Be(0.05,33.33) | Be(1.6,407.4) | Be(1.05,497)
:----------- | :---------: |:---------: |:---------: |:---------: |
Prior prob | 0.632 | 0.882 | 0.222 | 0.500
Post prob (0 events) | 0.683 | 0.939 | 0.289 | 0.568
Post prob (1 event) | 0.319 | 0.162 | 0.074 | 0.213

- Suppose we have a single arm study of 100 subjects.

- Consider the two most likely potential outcomes:
  + 0 adverse outcomes observed
  + 1 adverse outcome observed
  
  
  
---
## Problems with the priors

- After just 100 trials with no bad reactions, the low weight (33.38 prior observations) prior indicates one should be 94% sure the new agent is equally safe as (or safer than) the old one.

- The low weight prior largely assumes the conclusion we actually hope for (that the new agent is safer), thus it takes very little confirmatory data to reach that conclusion.

- Is this the behavior we want?

- Take home message: be very careful with priors that have `$a < 1$` with data that are all 0's (or `$b < 1$` with all 1's).

- Given that we know the adverse event rate should be small, we might try a restricted prior e.g. Unif(0,0.1).

- In all cases, how many trials would we need, assuming no adverse reactions, to be 95% sure that the new agent is as safe as (or safer than) the old one? (Homework question!)

---
## Marginal likelihood

- Recall that the .hlight[marginal likelihood] is
.block[
.small[
`$$L(y) = \int L(y; \theta)p(\theta)\textrm{d}\theta.$$`
]
]

- What is the marginal likelihood for the Beta-binomial?

- We have
.block[
.small[
$$
`\begin{aligned}
L(y) & = \int L(y; \theta)p(\theta)\textrm{d}\theta \\
& = \int_0^1 {n \choose y} \theta^y(1-\theta)^{n-y} \frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1} d\theta\\
& = {n \choose y} \frac{B(a + y,\, b + n-y)}{B(a,b)},
\end{aligned}`
$$
]
]

by the integral definition of the Beta function.
  
--

- Deriving the marginal likelihood for conjugate distributions is often relatively straightforward.

---
## Posterior predictive distribution

- Let's go back to Bernoulli data. Suppose `$y_1,\ldots,y_n \overset{iid}{\sim} \textrm{Bernoulli}(\theta)$`.

- We may wish to predict a new data point `$y_{n+1}$`.

- We can do so using the posterior predictive distribution `$p(y_{n+1}|y_{1:n})$`.

- <div class="question">
Why are we not including the parameter in the posterior predictive distribution?
</div>

- Recall that we have conditional independence of the `$y$`'s given `$\theta$`.

- Generally,
.block[
.small[
$$
`\begin{aligned}
p(y_{n+1}|y_{1:n}) &= \int p(y_{n+1},\theta|y_{1:n})\,d\theta\\
&= \int p(y_{n+1}|\theta,y_{1:n}) p(\theta|y_{1:n})\,d\theta\\
& = \int p(y_{n+1}|\theta) p(\theta|y_{1:n})\,d\theta.
\end{aligned}`
$$
]
]

---
## Posterior predictive distribution

- When we talk about the posterior predictive distribution for Bernoulli data, we are really referring to `$p(y_{n+1} = 1|y_{1:n})$` and `$p(y_{n+1} = 0|y_{1:n})$`.

- Then,
.block[
.small[
$$
`\begin{aligned}
p(y_{n+1}=1|y_{1:n}) &= \int p(y_{n+1}= 1|\theta) p(\theta|y_{1:n})\,d\theta \\
&= ... \\
&= ...
\end{aligned}`
$$
]
]

<div class="question">
which simplifies to what? In class!
</div>
  
--

- What then is `$p(y_{n+1} = 0|y_{1:n})$`?

- Posterior predictive pmf therefore takes the form
.block[
.small[
`$$p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.$$`
]
]

- What are `$a_n$` and `$b_n$`?

---
## Priors with restricted support

- As we have seen, when dealing with rare events, we might expect the true proportion to be very small.

- In that case, we might want to try a restricted prior, e.g. Unif(0,0.1).

- Even when we don't have rare events, we might still desire truncation if we are certain the true proportion lies within `$(a,b)$` with `$0 < a < b < 1$`.

- It is therefore often really useful to incorporate truncation.

- Let `$\theta =$` probability of a randomly-selected student making an `$A$` in this course.

- You may want to rule out very low & very high values -- perhaps `$\theta \in [0.35, 0.6]$` with probability one.

- How to choose a prior restricted to this interval?

---
## Uniform priors

- One possibility is to just choose a uniform prior.

- When the parameter `$\theta$` is a probability, the typical uniform prior would correspond to beta(1,1).

- This is uniform on the entire (0,1) interval.

- However, we can just as easily choose a uniform prior on a narrower interval Unif(a,b) with `$0 < a < b < 1$`.

- Perhaps not flexible enough.

- Would be nice if we could pick a flexible beta density and then truncate it to `$(a,b)$`.

---
## Truncated random variables

- Suppose we have some arbitrary random variable `$\theta \sim f$` with support `$\Theta$`.

- For example, `$\theta \sim \textrm{Beta}(c,d)$` has support on (0,1).

- Then, we can modify the density `$f(\theta)$` to have support on a sub-interval `$[a,b] \in \Theta$`.

- The density `$f(\theta)$` **truncated** to `$[a,b]$` is
.block[
.small[
`$$f_{[a,b]}(\theta) = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{\int^b_a f(\theta^\star)\textrm{d}\theta^\star},$$`
]
]

with `$\mathbb{1}[A]$` being the indicator function that returns 1 if A is true & 0 otherwise.

---
## Truncated beta density

- Suppose to characterize the prior probability of earning an A, you poll a sample of students from a former STA 602
course and find that 10 earned an A and 10 earned a B (or lower).

- Therefore, you go with a beta(10,10) prior truncated to [0.35, 0.6].

- In R we can calculate the truncated beta density at p via
  
  ```r
  p <- seq(0,1,length=1000)
  f1 <- dbeta(p,10,10)
  f2 <- dbeta(p,10,10)*as.numeric(p>0.35 & p<0.6)/(pbeta(0.6,10,10) - pbeta(0.3,10,10))
  f3 <- dunif(p,0.35,.6)
  plot(p,f2,type='l',col='green4',xlim=c(0,1),ylab='Density', xlab=expression(theta),
      ylim=c(0,6))
  lines(p,f1,type='l',col='blue')
  lines(p,f3,type='l',col='red4')
  labels <- c("beta(10,10)", "truncated beta","unif(0.35,.6)")
  legend("topright", inset=.05, labels, lwd=2, lty=c(1,1,1), col=c('blue4','green4','red4'))
  ```

---
## Truncated beta density

What would that look like?
<img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />

---
## Truncated beta density

The truncated density by itself would look like
<img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />

---
## The inverse cdf method

- How to sample truncated random variables?

- First start with the pdf for an untruncated distribution such as `$\theta \sim \textrm{Beta}(c,d)$`.

- Suppose we then want to sample `$\theta \sim \textrm{Beta}_{[a,b]}(c,d)$`. How can we do that? One popular method is the .hlight[inverse-cdf method].

- The inverse cdf is useful for generating random variables in general, especially for generating truncated random variables.

- Suppose we have `$\theta \sim f$`, for some arbitrary continuous density `$f$`.

- According to probability integral transform, for any continuous random variable `$\theta$`, the random variable `$U = F(\theta)$` has a Unif(0,1) distribution. Note that `$F$` is the cdf.

- Thus, to use the inverse-cdf method to sample `$\theta \sim f$`, first sample `$u \sim \textrm{Unif}(0,1)$`, then set `$\theta = F^{-1}(u)$`.

---
## The inverse cdf method

- As an example, suppose we want to sample `$\theta \sim \textrm{Beta}(c,d)$` through the inverse cdf method.

- Very easy. Just do the following in R.
  
  ```r
  u <- runif (1, 0, 1)
  theta <- qbeta(u,c,d)
  ```

- That is, first sample from a uniform distribution.

- Then, transform it using the inverse cdf of the `$\textrm{Beta}(c,d)$` distribution.

- Viola!

---
## The inverse cdf method

- Back to the original problem: how to sample `$\theta \sim \textrm{Beta}_{[a,b]}(c,d)$`?

- If we had the inverse cdf of `$\textrm{Beta}(c,d)$`  truncated to `$[a, b]$`, then we could use the inverse cdf method. Easy enough! Let's find that inverse cdf.

- Let `$f$`, `$F$` and `$F^{-1}$` denote the pdf, cdf and inverse-cdf without truncation and let `$A=[a,b]$`.

- Recall that the density `$f(\theta)$` **truncated** to `$[a,b]$` is
.block[
.small[
`$$f_{A}(\theta) = f_{[a,b]}(\theta) = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{\int^b_a f(\theta^\star)\textrm{d}\theta^\star} = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{F(b) - F(a)}.$$`
]
]

- Therefore, the truncated cdf
.block[
.small[
`$$F_{A}(z) = \Pr[\theta \leq z] = \dfrac{F(z) - F(a)}{F(b) - F(a)}.$$`
]
]

- Not enough though. We need the truncated inverse cdf.

---
## The inverse cdf method

- To find the inverse cdf `$F^{-1}_A(u)$`, let `$F_{A}(z) = u$`. That is, set
.block[
.small[
`$$u = F_{A}(z) = \dfrac{F(z) - F(a)}{F(b) - F(a)}$$`
]
]

and solve for `$z$` as a function of `$u$`.
  
--

- Re-expressing as a function of `$F(z)$`,
.block[
.small[
`$$F(z) = \{F(b) - F(a)\}u + F(a).$$`
]
]

- Applying the untruncated inverse cdf `$F^{-1}$`  to both sides, we have
.block[
.small[
`$$z = F^{-1}[\{F(b) - F(a)\}u + F(a)] = F^{-1}_A(u).$$`
]
]

---
## The inverse cdf method

- We now have all the pieces to use the inverse-cdf method to sample `$\theta \sim f_A$`, that is, `$f$` truncated to A.

- First draw a `$\textrm{Unif}(0,1)$` random variable
  
  ```r
  u <- runif (1, 0, 1)
  ```

- Next, apply the linear transformation:
.block[
.small[
`$$u^\star = \{F(b) - F(a)\}u + F(a).$$`
]
]

- Finally, plug `$u^\star$` into the untruncated cdf `$\theta = F^{-1}(u^\star)$`.

- Note we can equivalently sample `$u^\star \sim runif(1,F(a),F(b))$`.