One parameter models cont’d; Loss functions and Bayes risk

class: center, middle, inverse, title-slide

# One parameter models cont’d; Loss functions and Bayes risk
### Dr. Olanrewaju Michael Akande
### Jan 22, 2020

---

## Announcements

- Add/drop today

- HW1 due tomorrow

- Take the participation quiz for today on Sakai

---
## Outline

- Loss functions and Bayes risk

- Frequentist vs Bayesian intervals

- Poisson-Gamma model
  + Recap of the distributions
  + Conjugacy
  + Example
  + Posterior prediction
  + Other parameterizations

---
class: center, middle

# Loss functions and Bayes risk

---
## Bayes estimate

- As we've seen by now, having posterior distributions instead of one-number summaries is great for capturing uncertainty.

- That said, it is still very appealing to have simple summaries, especially when dealing with clients or collaborators from other fields, who desire one.

- Can we obtain a single estimate of `$\theta$` based on the posterior? Sure!

- .hlight[Bayes estimate] is the value `$\hat{\theta}$`, that minimizes the Bayes risk.

- .hlight[Bayes risk] is defined as the expected loss averaged over the posterior distribution.

- Put differently, a .hlight[Bayes estimate] `$\hat{\theta}$` has the lowest posterior expected loss.

- <div class="question">
That's fine, but what does expected loss mean?
</div>

- .hlight[Frequentist risk] also exists but we won't go into that here.

---
## Loss functions

- A .hlight[loss function] `$L(\theta,\delta(y))$` is a function of a parameter `$\theta$`, where `$\delta(y)$` is some .hlight[decision] about `$\theta$`, based on just the data `$y$`.

- For example, `$\delta(y) = \bar{y}$` can be the decision to use the sample mean to estimate `$\theta$`, the true population mean.

- `$L(\theta,\delta(y))$` determines the penalty for making the decision `$\delta(y)$`, if `$\theta$` is the true parameter; `$L(\theta,\delta(y))$` characterizes the price paid for errors.

- A common choice for example, when dealing with point estimation, is the .hlight[squared error loss], which has
.block[
.small[
`$$L(\theta,\delta(y)) = (\theta - \delta(y))^2.$$`
]
]

- Bayes risk is thus
.block[
.small[
`$$\rho(\theta,\delta) = \mathbb{E}\left[\ L(\theta,\delta(y)) | \ y\right] = \int L(\theta,\delta(y)) \ p(\theta|y) \ d\theta,$$`
]
]

and we proceed to find the value `$\hat{\theta}$`, that is, the decision `$\delta(y)$`, that minimizes the Bayes risk.

---
## Bayes estimator under squared error loss

- Turns out that, under squared error loss, the decision `$\delta(y)$` that minimizes the posterior risk is the posterior mean.

- Proof: Let `$L(\theta,\delta(y)) = (\theta - \delta(y))^2$`. Then,
.block[
.small[
$$
`\begin{aligned}
\rho(\theta,\delta) & = \int L(\theta,\delta(y)) \ p(\theta|y) \ d\theta.\\
& = \int (\theta - \delta(y))^2 \ p(\theta|y) \ d\theta.\\
\end{aligned}`
$$
]
]

- Expand, then take the partial derivative of `$\rho(\theta,\delta)$` with respect to `$\delta(y)$`.

- <div class="question">
To be continued on the board!
</div>

- Easy to see then that `$\delta(y) = \mathbb{E}[\theta | x]$` is the minimizer.

- Well that's great! The posterior mean is often very easy to calculate in most cases. In the beta-binomial case, we have
.block[
.small[
`$$\hat{\theta} = \frac{a+y}{a+b+n}.$$`
]
]

---
## What about other loss functions?

- Clearly, squared error is only one possible loss function. An alternative is .hlight[absolute loss], which has
.block[
.small[
`$$L(\theta,\delta(y)) = |\theta - \delta(y)|.$$`
]
]

- Absolute loss places less of a penalty on large deviations & the resulting Bayes estimate is **posterior median**.

- Median is actually relatively easy to estimate.

- Recall that for a continuous random variable `$Y$` with cdf `$F$`, the median of the distribution is the value `$z$`, which satisfies
.block[
.small[
`$$F(z) = \Pr(Y\leq z) = \dfrac{1}{2}= \Pr(Y\geq z) = 1-F(z).$$`
]
]

- As long as we know how to evaluate the CDF of the distribution we have, we can solve for `$z$`.

- Think R!

---
## What about other loss functions?

- For the beta-binomial model, the CDF of the beta posterior can be written as
.block[
.small[
`$$F(z) = \Pr(\theta\leq z | y) = \int^z_0 \textrm{beta}(\theta; a+y, b+n-y) d\theta.$$`
]
]

- Then, if `$\hat{\theta}$` is the median, we have that `$F(\hat{\theta}) = 0.5$`.

- To solve for `$\hat{\theta}$`, apply the inverse CDF `$\hat{\theta} = F^{-1}(0.5)$`.

- In R, that's simply
    
    ```r
    qbeta(0.5,a+y,b+n-y)
    ```

- For other popular distributions, switch out the beta.

---
## Loss functions and decisions

- Loss functions are not specific to estimation problems but are a critical part of decision making.

- For example, suppose you are deciding how much money to bet ($A) on Duke in the first UNC-Duke men's basketball game this year (next month).

- Suppose, if Duke
  + loses (y = 0), you lose the amount you bet ($A)
  + wins (y = 1), you gain B per $1 bet
  
--

- <div class="question">
What is a good sampling distribution for y here?
</div>
  
--

- Then, the loss function can be characterized as
.block[
.small[
`$$L(A,y) = A(1-y) - y(BA),$$`
]
]
  
  with your action being the amount bet A.
  
--

- <div class="question">
When will your bet be "rational"?
</div>

---
## How much to bet on Duke?

- `$y$` is an unknown state, but we can think of it as a new prediction `$y_{n+1}$` given that we have data from win-loss records `$(y_{1:n})$` that can be converted into a Bayesian posterior,
.block[
.small[
`$$\theta \sim \textrm{beta}(a_n,b_n),$$`
]
]

--
  with this posterior concentrated slightly to the left of 0.5, if we only use data on UNC-Duke games (UNC men lead Duke 139-112 all time).
  
--

- Actually, it might make more sense to focus on more recent head-to-head data and not the all time record.

- In fact, we might want to build a model that predicts the outcome of the game using historical data & predictors (current team rankings, injuries, etc).

- However, to keep it simple for this illustration, go with the posterior above.

---
## How much to bet on Duke?

- The Bayes risk for action A is then the expectation of the loss function,
.block[
.small[
`$$\rho(A) = \mathbb{E}\left[\ L(A,y) | \ y_{1:n}\right].$$`
]
]

- To calculate this as a function of `$A$` and find the optimal `$A$`, we need to marginalize over the **posterior predictive distribution** for `$y$`.

- <div class="question">
Why are we using the posterior predictive distribution here instead of the posterior distribution?
</div>

- Recall from the last class that
.block[
.small[
`$$p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.$$`
]
]

- Specifically, that the posterior predictive distribution here is `$\textrm{Bernoulli}(\hat{\theta})$`, with
.block[
.small[
`$$\hat{\theta} = \dfrac{a_n}{a_n + b_n}$$`
]
]

- By the way, what do `$a_n$` and `$b_n$` represent?

---
## How much to bet on Duke?

- With the loss function `$L(A,y) = A(1-y) - y(BA)$`, and using the notation `$y_{n+1}$` instead of `$y$` (to make it obvious the game has not been played), the Bayes risk (expected loss) for bet `$A$` is
.block[
.small[
$$
`\begin{split}
\rho(A) & = \mathbb{E}\left[\ L(A,y_{n+1}) | \ y_{1:n}\right] = \mathbb{E}\left[A(1-y_{n+1}) - y_{n+1}(BA) | \ y_{1:n}\right]\\
& = A \ \mathbb{E}\left[1-y_{n+1} | \ y_{1:n}\right]  - (BA) \ \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right]\\
& = A \ \left(1 -  \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right] \right)  - (BA) \ \mathbb{E}\left[y_{n+1} | \ y_{1:n}\right]\\
& = A \ \left(1 -\mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) \right).
\end{split}`
$$
]
]

- Hence, your bet is rational as long as
.block[
.small[
`$$\mathbb{E}\left[y_{n+1} | \ y_{1:n}\right](1+B) > 1.$$`
]
]

- Clearly, there is no limit to the amount you should bet if this condition is satisfied (the loss function is clearly too simple).

- Loss function needs to be carefully chosen to lead to a good decision - finite resources, diminishing returns, limits on donations, etc.

- <div class="question">
Want more on loss functions, expected loss/utility, or decision problems in general? Consider taking a course on decision theory (STA623?).
</div>

---
class: center, middle

# Frequentist vs Bayesian intervals

---
## Frequentist confidence intervals

- Recall that a frequentist confidence interval  `$[l(y); u(y)]$` has 95% frequentist coverage for a population parameter `$\theta$` if, before we collect the data,
.block[
.small[
`$$\Pr[l(y) < \theta < u(y) | \theta] = 0.95.$$`
]
]

- This means that 95% of the time, our constructed interval will cover the true parameter, and 5% of the time it won't.

- In any given sample, you don't know whether you're in the lucky 95% or the unlucky 5%.

- You just know that either the interval covers the parameter, or it doesn't (useful, but not too helpful clearly). There is NOT a 95% chance your interval covers the true parameter once you have collected the data.

- Asking about the definition of a confidence interval is tricky, even for those who know what they're doing.

---
## Bayesian intervals

- An interval `$[l(y); u(y)]$` has 95% Bayesian coverage for `$\theta$` if
.block[
.small[
`$$\Pr[l(y) < \theta < u(y) | Y=y] = 0.95.$$`
]
]

- This describes our information about where `$\theta$` lies _after_ we observe the data.

- Fantastic!

- This is actually the interpretation people want to give to the frequentist confidence interval.

- Bayesian interval estimates are often generally called .hlight[credible intervals].

---
## Bayesian quantile-based interval

- The easiest way to obtain a Bayesian interval estimate is to use posterior quantiles.

- Easy since we either know the posterior densities exactly or can sample from the distributions.

- To make a `$100 \times (1-\alpha)%$` quantile-based credible interval, find numbers (quantiles) `$\theta_{\alpha/2} < \theta_{1-\alpha/2}$` such that

1. `$\Pr(\theta < \theta_{\alpha/2} | Y=y) = \dfrac{\alpha}{2}$`; and
  
  2. `$\Pr(\theta > \theta_{1-\alpha/2} | Y=y) = \dfrac{\alpha}{2}$`.
  
--

- This is an .hlight[equal-tailed interval]. Often when researchers refer to a credible interval, this is what they mean.

---
## Equal-tailed quantile-based interval

- This is Figure 3.6 from the Hoff book. Focus on the quantile-based credible interval for now.

- Note that there are values of `$\theta$` outside the quantile-based credible interval, with higher density than some values inside the interval. This suggests that we can do better with interval estimation.

---
## HPD region

- A `$100 \times (1-\alpha)%$` highest posterior density (HPD) region is a subset `$s(y)$` of the parameter space `$\Theta$` such that

1. `$\Pr(\theta \in s(y) | Y=y) = 1-\alpha$`; and
  
  2. If `$\theta_a \in s(y)$` and `$\theta_b \notin s(y)$`, then `$\Pr(\theta_a | Y=y) > \Pr(\theta_b | Y=y)$`.

- `$\Rightarrow$` **All** points in a HPD region have higher posterior density than points outside the region.

*Note this region is not necessarily a single interval (e.g., in the case of a multimodal posterior).*
  
--

- The basic idea is to gradually move a horizontal line down across the density, including in the HPD region all values of `$\theta$` with a density above the horizontal line.

- Stop moving the line down when the posterior probability of the values of `$\theta$` in the region reaches `$1-\alpha$`.

---
## HPD region

Hoff Figure 3.6 shows how to construct an HPD region.

---
class: center, middle

# Poisson-gamma model

---
## Poisson distribution recap

- `$Y_1,\ldots,Y_n \overset{iid}{\sim} \textrm{Po}(\theta)$` denotes that each `$Y_i$` is a .hlight[Poisson random variable].

- The Poisson distribution is commonly used to model count data consisting of the number of events in a given time interval.

- Some examples: # children, # lifetime romantic partners, # songs on iPhone, # tumors on mouse, etc.

- The Poisson distribution is parameterized by `$\theta$` and the pmf is given by
.block[
.small[
`$$\Pr[Y_i = y_i | \theta] = \dfrac{\theta^y_i e^{-\theta}}{y_i!}; \ \ \ \ y_i=0,1,2,\ldots; \ \ \ \ \theta > 0.$$`
]
]

where
.block[
.small[
`$$\mathbb{E}[Y_i] = \mathbb{V}[Y_i] = \theta.$$`
]
]

- <div class="question">
What is the joint likelihood? What is the best guess (MLE) for the Poisson parameter? What is the sufficient statistic for the Poisson parameter?
</div>

---
## Gamma density recap

- The .hlight[gamma density] will be useful as a prior for parameters that are strictly positive.

- If `$\theta \sim \textrm{Ga}(a,b)$`, we have the pdf
.block[
.small[
`$$f(\theta) = \frac{b^a}{\Gamma(a)} \theta^{a-1}e^{-b\theta}.$$`
]
]

where `$a$` is known as the .hlight[shape parameter] and `$b$`, the .hlight[rate parameter]. 
  
--

- Another parameterization uses the .hlight[scale parameter] `$\phi = 1/b$` instead of `$b$`.

- Some properties:

+ `$\mathbb{E}[\theta] = \dfrac{a}{b}$`
  
  + `$\mathbb{V}[\theta] = \dfrac{a}{b^2}$`
  
  + `$\textrm{Mode}[\theta] = \dfrac{a-1}{b} \ \ \textrm{for} \ \ a \geq 1$`

---
## Gamma density

- If our prior guess of the expected count is `$\mu$` & we have a prior "scale" `$\phi$`, we can let
.block[
.small[
`$$\mathbb{E}[\theta] = \mu = \dfrac{a}{b}; \ \ \mathbb{V}[\theta] = \mu \phi = \dfrac{a}{b^2},$$`
]
]

and solve for `$a$`, `$b$`. We can play the same game if we have a prior variance or standard deviation.
  
  
--

- More properties:

+ If `$\theta_1,\ldots,\theta_p \overset{ind}{\sim} \textrm{Ga}(a_i,b)$`, then `$\sum_i \theta_i \sim \textrm{Ga}(\sum_i a_i,b)$`.

--
  + If `$\theta \sim \textrm{Ga}(a,b)$`, then for any `$c > 0$`, `$c\theta \sim \textrm{Ga}(a,b/c)$`.

--
  + If `$\theta \sim \textrm{Ga}(a,b)$`, then `$1/\theta$` has an .hlight[Inverse-Gamma distribution]. 
  
--
  *We'll take advantage of these soon!*
  
  
  
---
## Example gamma distributions

*R has the option to specify either the rate or scale parameter so always make sure to specify correctly when using "dgamma","rgamma",etc!*.