Probability review and one parameter models

class: center, middle, inverse, title-slide

# Probability review and one parameter models
### Dr. Olanrewaju Michael Akande
### Jan 15, 2020

---

## Announcements

- No make-up for Monday's lab.

- Final exam will be either online or take home. Not in class.

- Homework one soon...but here are some readings to keep you busy:
  1. Efron, B., 1986. Why isn't everyone a Bayesian?. The American Statistician, 40(1), pp. 1-5.
  2. Gelman, A., 2008. Objections to Bayesian statistics. Bayesian Analysis, 3(3), pp. 445-449.
  3. Diaconis, P., 1977. Finite forms of de Finetti's theorem on exchangeability. Synthese, 36(2), pp. 271-281.
  4. Gelman, A., Meng, X. L. and Stern, H., 1996. Posterior predictive assessment of model fitness via realized discrepancies. Statistica sinica, pp. 733-760.
  5. Dunson, D. B., 2018. Statistics in the big data era: Failures of the machine. Statistics & Probability Letters, 136, pp. 4-9.

---
## Outline

- Probability review
  + Random variables
  + Joint distributions
  + Independence and exchangeability

- Introduction to Bayesian Inference (Cont'd)
  + Conjugacy
  + Kernels
  + Bernoulli and binomial data
  + Selecting priors
  + Truncated priors

---
class: center, middle

# Probability review

---
## Discrete random variables

- A .hlight[random variable] is .hlight[discrete] if the set of all possible outcomes is .hlight[countable].

- The .hlight[probability mass function (pmf)] of a discrete random variable `$Y$`, `$p(y)$` describes the probability associated with each possible value of `$Y$`.

- `$p(y)$` has the following properties:

1. `$0 \leq p(y) \leq 1$` for all values `$y \in \mathcal{Y}$`.
  
  2. `$\sum_{y \in \mathcal{Y}} p(y) = 1$`.

---
## Bernoulli distribution

- The .hlight[Bernoulli distribution] can be used to describe an experiment with two outcomes, such as
  + Flipping a coin (heads or tails);
  + Vote turnout (vote or not); and
  + The outcome of a basketball game (win or loss).

- In all cases, we can represent this as a binary random variable where the probability of "success" is `$\theta$` and the probability of "failure" is `$1-\theta$`.

- We usually write this as: `$Y \sim \textrm{Bernoulli}(\theta)$`, where `$\theta \in [0,1]$`.

- It follows that
.block[
.small[
`$$\Pr(Y=y|\theta) = \theta^y(1-\theta)^{1-y}; \ \ \ y=0,1.$$`
]
]

- <div class="question">
What is the mean of this distribution? What is the variance?
</div>

---
## Binomial distribution

- The .hlight[binomial distribution] describes the number of successes from `$n$` independent Bernoulli trials.

- That is, `$Y =$` number of "successes" in `$n$` independent trials and `$\theta$` is the probability of success per trial.

- We usually write this as: `$Y \sim \textrm{Bin}(n,\theta)$`, where `$\theta \in [0,1]$`.

- The pmf is
.block[
.small[
`$$\Pr(Y=y|\theta,n) = {n \choose y} \theta^y(1-\theta)^{n-y}; \ \ \ y=0,1,\ldots,n.$$`
]
]

- **Example**: `$Y =$` number of individuals with type I diabetes out of a sample of `$n$` surveyed.

- Binomial likelihoods are commonly used in collecting data on proportions.

- <div class="question">
What is the mean of this distribution? What is the variance?
</div>

---
## Poisson distribution

- `$Y \sim \textrm{Po}(\theta)$` denotes that `$Y$` is a .hlight[Poisson random variable].

- The Poisson distribution is commonly used to model count data consisting of the number of events in a given time interval.

- The Poisson distribution is parameterized by `$\theta$` and the pmf is given by
.block[
.small[
`$$\Pr[Y = y | \theta] = \dfrac{\theta^y e^{-\theta}}{y!}; \ \ \ \ y=0,1,2,\ldots; \ \ \ \ \theta > 0.$$`
]
]

- Similar to binomial but with no limit on the total number of counts.

- <div class="question">
What is the mean of this distribution? What is the variance?
</div>

---
## General discrete distributions

- Useful to consider general discrete distributions having an arbitrary form.

- Suppose `$Y \in \{y_1^\star,\ldots,y_k^\star\}$`. Then define `$\Pr(Y = y_h^\star) = \pi_h$` for each `$h = 1,\ldots, k$`. That is,
.block[
.small[
`$$\Pr[Y = y| \boldsymbol{\pi}] = \prod_h \pi_h^{\mathbb{1}[Y = y_h^\star]}; \ \ y \in \{y_1^\star,\ldots,y_k^\star\}$$`
]
]
  where `$\boldsymbol{\pi} = (\pi_1,\ldots,\pi_k)$`.

- `$(y_1^\star,\ldots,y_k^\star)$` are "atoms" representing possible values for `$Y$`.

- For example, these may be words in a dictionary or values for education as a categorical variable. Useful for text data, categorical observations, etc.

- Can also write as `$Y \sim \sum^k_{h=1} \pi_h \delta_{y_h^\star}$`, where `$\delta_{y_h^\star}$` denotes a unit mass at `$y_h^\star$`.

- Often called the .hlight[categorical distribution] or .hlight[generalized Bernoulli distribution]. Also, see the .hlight[multinomial distribution].

---
## Continuous random variables

- The .hlight[probability density function (pdf)], `$p(y)$` or `$f(y)$`, of a continuous random variable `$Y$` has slightly different properties:
  
  1. `$0 \leq f(y)$` for all `$y \in \mathcal{Y}$`.
  
  2. `$\int_{y \in \mathbb{R}} p(y) \textrm{d}y = 1$`.
  
--

- The pdf for a continuous random variable is not necessarily less than 1.

- Also, `$p(y)$` is NOT the probability of value `$y$`.

- However, if `$p(y_1) > p(y_2)$`, we say informally that `$y_1$` has a "higher probability" than `$y_2$`.

---
## Uniform density

- The simplest example of a continuous density is the .hlight[uniform density].

- `$Y \sim \textrm{Unif}(a,b)$` denotes density is uniform in interval `$(a,b)$`.

- The pdf is simply
.block[
.small[
`$$f(y) = \dfrac{1}{b-a}; \ \ \ y  \in (a,b).$$`
]
]

- The cdf is
.block[
.small[
`$$F(y) = \Pr(Y \leq y) = \int^y_a \dfrac{1}{b-a} \textrm{d}z = \dfrac{y-a}{b-a}$$`
]
]

- The mean (expectation) is
.block[
.small[
`$$\dfrac{a+b}{2}$$`
]
]

- <div class="question">
What is the variance? Also, can you prove the formula for the mean?
</div>

---
## Beta density

- The uniform density can be used as a prior for a probability if `$(a,b) \subset (0,1)$`.

- However, it is very inflexible clearly. <div class="question">
Why?
</div>

- An alternative for `$y \in \mathcal{Y}$` is the .hlight[beta density], written as `$Y \sim \textrm{Beta}(a,b)$`, with
.block[
.small[
`$$f(y) = \frac{1}{B(a,b)} y^{a-1} (1-y)^{b-1}; \ \ \ y  \in (0,1), \ a > 0, \ b > 0.$$`
]
]

where `$B(a,b) = \dfrac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}$`. `$\Gamma(n) = (n-1)!$` for any positive integer `$n$`.
  
--

- As we have already seen, the beta density is quite flexible in characterizing a broad variety of densities on `$(0,1)$`.

- <div class="question">
Beta(1,1) is the same as Unif(0,1). Workout the pdfs to convince yourself!
</div>

---
## Gamma density

- The .hlight[gamma density] will be useful as a prior for parameters that are strictly positive.

- For random variables `$Y \sim \textrm{Ga}(a,b)$`, we have the pdf
.block[
.small[
`$$f(y) = \frac{b^a}{\Gamma(a)} y^{a-1}e^{-by}; \ \ \ y  \in (0,\infty), \ a > 0, \ b > 0.$$`
]
]

- Properties:
.block[
.small[
`$$\mathbb{E}[Y] = \dfrac{a}{b}; \ \ \mathbb{V}[Y] = \dfrac{a}{b^2}.$$`
]
]

- **Note**: parameterizations of the gamma distribution vary!

- Under this parameterization, if `$Y \sim \textrm{Ga}(1,\theta)$`, then `$Y \sim \textrm{Exp}(\theta)$`, that is, the .hlight[exponential distribution].

---
## Continuous joint distributions

- Suppose we have two random variables `$\theta = (\theta_1,\theta_2)$`.

- Their **joint** distribution function is
.block[
.small[
`$$\Pr(\theta_1 \leq a,\theta_2 \leq b) = \int^a_{-\infty} \int^b_{-\infty} p(\theta_1,\theta_2) \textrm{d}\theta_1\textrm{d}\theta_2,$$`
]
]

where `$p(\theta_1,\theta_2)$` is the joint probability density function (pdf).
  
--

- The **marginal** density of `$\theta_1$` can be obtained by
.block[
.small[
`$$p(\theta_1) = \int^\infty_{-\infty} p(\theta_1,\theta_2) \textrm{d}\theta_2,$$`
]
]

which is referred to as marginalizing out `$\theta_2$`.
  
--

- We will be doing a lot of "marginalizations" so take note!

---
## Factorizing joint densities and independence

- The joint density `$p(\theta_1,\theta_2)$` can be factorized as
.block[
.small[
`$$p(\theta_1,\theta_2) = p(\theta_1|\theta_2)p(\theta_2), \ \ \ \textrm{or} \ \ \ p(\theta_1,\theta_2) = p(\theta_2|\theta_1)p(\theta_1).$$`
]
]

- For independent random variables, the joint density equals the product of the marginals:
.block[
.small[
`$$p(\theta_1,\theta_2) = p(\theta_1)p(\theta_2).$$`
]
]

- This implies that `$p(\theta_2|\theta_1) = p(\theta_2)$` and `$p(\theta_1|\theta_2) = p(\theta_1)$` under independence.

- These relationships extend automatically to `$\theta = (\theta_1,\ldots,\theta_p)$`. That is,
.block[
.small[
`$$p(\theta_1,\ldots,\theta_p) = \prod^p_{j=1} p(\theta_j),$$`
]
]

under mutual independence of the elements of the `$\theta$` vector.

---
## Conditional independence

- Suppose `$y_i \overset{iid}{\sim} f(\theta)$` for `$i = 1,\ldots,n$`.

- Data `$\{y_i\}$` are independent & identically distributed draws from distribution `$f(\theta)$`.

- The data are said to be .hlight[conditionally independent] given `$\theta$`.
.block[
.small[
`$$L(y; \theta) = \prod^n_{i=1} f(y_i; \theta),$$`
]
]

where `$L(y; \theta) =$` likelihood of the data conditionally on `$\theta$`.
  
--

- The .hlight[marginal likelihood] of the data is
.block[
.small[
`$$L(y) = \int L(y; \theta)p(\theta)\textrm{d}\theta.$$`
]
]

- `$L(y)$` can no longer be written as a product of densities as in `$\prod^n_{i=1} h(y_i)$`; we lose independence when we marginalize out `$\theta$`.

---
## Exchangeability

- In marginalizing out `$\theta$`, the observations `$\{y_i\}$` are no longer independent.

- `$\{y_i\}$` are .hlight[exchangeable] if `$p(y_1,\ldots,y_n) = p(y_{\pi_1},\ldots,y_{\pi_n})$`, for all permutations `$\pi$` of `$\{1,\ldots,n\}$`.

- .hlight[de Finetti's Theorem]: Suppose `$\{y_i\}$` are exchangeable under above definition for any `$n$`. Then
.block[
.small[
`$$p(y_1,\ldots,y_n) = \int \left[ \prod^n_{i=1} f(y_i; \theta) \right] p(\theta)\textrm{d}\theta.$$`
]
]

for some `$\theta$`, prior distribution `$p(\theta)$` and sampling model `$f(y_i;\theta)$`.
  
--

- Simply put, de Finetti's Theorem states that exchangeable observations are conditionally independent relative to some parameter.

- de Finetti's Theorem is critical in providing a motivation for using parameters and for putting priors on parameters.

---
class: center, middle

# Introduction to Bayesian Inference (Cont'd)

---
## Frequentist inference

- Given **data** `$\{y_i\}$` and an **unknown parameter** `$\theta$`, estimate said `$\theta$`.

- How to estimate `$\theta$` under the frequentist paradigm?
  + Maximum likelihood estimate (MLE) 
  + Method of moments
  + and so on...
  
--

- Frequentist ML estimation finds the one value of `$\theta$` that maximizes the likelihood.

- Typically uses large sample (asymptotic) theory to obtain confidence intervals and do hypothesis testing.

---
## Bayesian inference

- Once again, given **data** `$\{y_i\}$` and an **unknown parameter** `$\theta$`, estimate said `$\theta$`.

- Bayesians update their prior information for `$\theta$` with information in the data `$\{y_i\}$`, to obtain the posterior density `$p(\theta | y)$`.

- Personally, I prefer being able to obtain posterior densities that describe my parameter, instead of estimated summaries (usually measures of central tendency) along with confidence intervals.

- Bayes' theorem - reminder: 
.block[
.small[
`$$p(\theta | y) = \frac{p(\theta)L(y;\theta)}{\int_{\Theta}p(\tilde{\theta})L(y; \tilde{\theta}) \textrm{d}\tilde{\theta}} = \frac{p(\theta)L(y;\theta)}{L(y)}$$`
]
]

---
## Comments on the posterior density

- The posterior density is more concentrated than the prior & quantifies learning about `$\theta$`.

- In fact, this is the optimal way to learn from data - see discussion in Hoff chapter 1.

- As more & more data become available, posterior density will converge to a normal (Gaussian) density centered on the MLE (Bayes central limit theorem).

- In finite samples for limited data, the posterior can be highly skewed & noticeably non-Gaussian.

---
## Conjugacy

- Starting with an arbitrary prior density `$p(\theta)$` & likelihood `$L(y;\theta)$` we may encounter problems in calculating the
posterior density `$p(\theta | y)$`.

- In particular, you may notice the denominator in the Bayes' rule: 
.block[
.small[
`$$L(y) = \int_{\Theta}p(\tilde{\theta})L(y; \tilde{\theta}) \textrm{d}\tilde{\theta}.$$`
]
]
  This integral may not be analytically tractable!

- When the prior is .hlight[conjugate] however, the marginal likelihood can be calculated analytically.

- .hlight[Conjugacy] `$\Rightarrow$` the posterior has the same form as the prior.

- Often useful to think of hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of (historical) pseudo-observations with properties specified by the parameters.

- Conjugate priors make calculations easy but may not represent our prior information well.

---
## Kernels

- In Bayesian statistics, the .hlight[kernel] of a pdf omits any multipliers that do not depend on the random variable or parameter we care about.

- For many distributions, the kernel is in a simple form but the normalizing constant complicates calculations.

- If one recognizes the kernel as that matching a known distribution, then the normalizing factor can be reinstated. This is a very MAJOR TRICK we will use to calculate posterior distributions.

- For example, the normal density is given by
.block[
.small[
`$$p(y|\mu,\sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}}e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}$$`
]
]

but the kernel is just
.block[
.small[
`$$p(y|\mu,\sigma^2) \propto e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}.$$`
]
]

---
## Bernoulli data

- Back to our example: suppose `$\theta \in (0,1)$` is the population proportion of individuals with diabetes in the US.

- Suppose we take a sample of `$n$` individuals and record whether or not they have diabetes (as binary: 0,1).

- Then we can use the Bernoulli distribution as the sampling distribution. also, we already established that we can use a beta prior for `$\theta$`.

---
## Bernoulli data

- Generally, it turns out that if 
  + `$f(y_i; \theta): y_i \overset{iid}{\sim} \textrm{Bernoulli}(\theta)$` for `$i = 1,\ldots,n$`, and
  + `$p(\theta): \theta \sim \textrm{Beta}(a,b)$`,
  
  then the posterior distribution is also a beta distribution.
  
--

- <div class="question">
Can we derive the posterior distribution and its parameters? Let's do some work on the board!
</div>

- Updating a beta prior with a Bernoulli likelihood leads to a beta posterior - we have conjugacy!

- Specifically, we have.
.block[
.small[
`$$p(\theta | \{y_i\}): \theta | \{y_i\} \sim \textrm{Beta}(a+\sum y_i,b+n-\sum y_i).$$`
]
]

- This is the .hlight[beta-Bernoulli model]. More generally, this is just the .hlight[beta-binomial model].

---
## Beta-binomial in more detail

- Suppose the likelihood of the data is
.block[
.small[
`$$L(y; \theta) = {n \choose y} \theta^y(1-\theta)^{n-y}.$$`
]
]

- Suppose also that we have a `$\textrm{Beta}(a,b)$` prior on the probability `$\theta$`.

- Then the posterior density then has the beta form
.block[
.small[
`$$\pi(\theta | y) = \textrm{Beta}(a+y,b+n-y).$$`
]
]

- The posterior has expectation
.block[
.small[
`$$\mathbb{E}(\theta | y) = \dfrac{a+y}{a+b+n} = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}.$$`
]
]

- For this specification, sometimes `$a$` and `$b$` are interpreted as "prior data" with a interpreted as the prior number of 1's, `$b$` as the prior number of 0's, and `$a + b$` as the prior sample size.

- As we get more and more data, the majority of our information about `$\theta$` comes from the data as opposed to the prior.

---
## Binomial data

- For example, suppose you want to find the Bayesian estimate of the probability `$\theta$` that a coin comes up heads.

- Before you see the data, you express your uncertainty about `$\theta$` through the prior `$p(\theta) = \textrm{Beta}(2,2)$`

- Now suppose you observe 10 tosses, of which only 1 was heads.

- Then, the posterior density `$p(\theta \,|\, y, n)$` is `$\mbox{Beta}(3, 11)$`.

---
## Binomial data

- Recall that the mean of `$\mbox{Beta}(a,b)$` is `$a/(a+b)$`.

- That means, before you saw the data, you thought the mean for `$\theta$` was 2/(2+2) = 0.5.

- However, after seeing the data, you believe it is 3/(3+11) = 0.214.

- The variance of `$\mbox{Beta}(a,b)$` is `$ab/[(a+b)^2(a+b+1)]$`.

- So before you saw data, your uncertainty about `$\theta$` (i.e., your standard deviation) was `$\sqrt{4/[4^2 \times 5]} = 0.22$`.

- However, after seeing 1 Heads in 10 tosses, your uncertainty is 0.106.

- Clearly, as the number of tosses goes to infinity, your uncertainty goes to zero.

---
## Operationalizing data analysis

We will explore another example soon but first, how should we approach data analysis in general?

- .hlight[Step 1]. State the question.

- .hlight[Step 2]. Collect the data.

- .hlight[Step 3]. Explore the data.

- .hlight[Step 4]. Formulate and state a modeling framework.

- .hlight[Step 5]. Check your models.

- .hlight[Step 6]. Answer the question.

---
## Example: rare events

- .hlight[Step 1]. State the question:
  + What is the prevalence of an infectious disease in a small city?
  + Why? High prevalence means more public health precautions are recommended.
  
--

- .hlight[Step 2]. Collect the data:
  + Suppose you collect a small random sample of 20 individuals.
  
--

- .hlight[Step 3]. Explore the data:
  + Let `$Y$` denote the unknown number of infected individuals in the sample.

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Parameter of interest: `$\theta$` is the fraction of infected individuals in the city.
  + Sampling model: a reasonable model for `$Y$` can be `$\textrm{Bin}(20,\theta)$`
  
<img src="img/binomial_histograms.png" width="105" height="370px" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Prior specification: information from previous studies — infection rate in “comparable cities” ranges from 0.05 to
0.20 with an average of 0.10. So maybe a standard deviation of roughly 0.05?
  + What is a good prior? The **expected value** of `$\theta$` close to 0.10 and the **variance** close to 0.05.
  + Possible option: `$\mbox{Beta}(3.5,31.5)$` or maybe even `$\mbox{Beta}(3,32)$`?
  <img src="02-one-parameter-models-I_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 4]. Formulate and state a modeling framework:
  + Under `$\mbox{Beta}(3,32)$`, `$\Pr(\theta < 0.1) \approx 0.67$`.
  + Posterior distribution for the model: `$(\theta | Y=y) = \textrm{Beta}(a+y,b+n-y)$`
  + Suppose `$Y=0$`. Then, `$(\theta | Y=y) = \textrm{Beta}(3,32+20)$`
<img src="02-one-parameter-models-I_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---
## Example: rare events

- .hlight[Step 5]. Check your models:
  + Compare performance of posterior mean and posterior probability that `$\theta < 0.1$`.
  + Under `$\mbox{Beta}(3,52)$`, 
      - `$\Pr(\theta < 0.1 | Y=y) \approx 0.92$`. More confidence in low values of `$\theta$`.
      - For `$\mathbb{E}(\theta | Y=y)$`, we have
.block[
.small[
$$
`\begin{split}
\mathbb{E}(\theta | y) & = \dfrac{a+y}{a+b+n} = \dfrac{3}{52} = 0.058.\\
\end{split}`
$$
]
]
      - Recall that the prior mean is `$a/(a+b)=0.09$`. Thus, we can see how that contributes to the prior mean.
.block[
.small[
$$
`\begin{split}
\mathbb{E}(\theta | y) & = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}\\
& = \dfrac{a+b}{a+b+n} \times \dfrac{a}{a+b} + \dfrac{n}{a+b+n} \times \dfrac{y}{n}\\
& = \dfrac{35}{52} \times \dfrac{3}{35} + \dfrac{20}{52} \times \dfrac{0}{n} = \dfrac{3}{52} = 0.058.\\
\end{split}`
$$
]
]

---
## Example: rare events

- .hlight[Step 6]. Answer the question:
  + People with low prior expectations are generally at least `$90\%$` certain that the infection rate is below 0.10.
  + `$\pi(\theta | Y)$` is to the left of `$\pi(\theta)$` because the observation `$Y=0$` provides evidence of a low value of `$\theta$`.
  + `$\pi(\theta | Y)$` is more peaked than `$\pi(\theta)$` because it combines information and so contains more information than `$\pi(\theta)$` alone.
  + The posterior expectation is 0.058.
  + The posterior mode is 0.04.
      - Note, for `$\mbox{Beta}(a,b)$`, the mode is `$(a-1)/(a+b-2)$`.
  + The posterior probability that `$\theta < 0.1$` is 0.92.