Rejection sampling; introduction to the normal model

class: center, middle, inverse, title-slide

# Rejection sampling; introduction to the normal model
### Dr. Olanrewaju Michael Akande
### Jan 29, 2020

---

## Outline

- Sampling methods
  + Rejection sampling
  + Importance sampling
  
- The univariate normal model
  + Motivating example
  + The normal distribution
      - Properties
      - Likelihood
  + Inference for mean, conditional on variance
  + Conjugacy
  + Noninformative and improper priors

---
class: center, middle

# Sampling methods

---
## Rejection sampling

- Setup:
  + `$p(\theta)$` is some density we are interested in sampling from;
  
--
  + `$p(\theta)$` is tough to sample from but we are able to evaluate `$p(\theta)$` as a function at any point; and
  
--
  + `$g(\theta)$` is some .hlight[proposal distribution] or .hlight[importance sampling distribution] that is easier to sample from.

- Two key requirements:
  + `$g(\theta)$` is easy to sample from; and
  
--
  + `$g(\theta)$` is easy to evaluate at any point as is the case for `$p(\theta)$`.
  
--

- Usually, the context is one in which `$g(\theta)$` has been derived as an analytic approximation to `$p(\theta)$`; and the closer the approximation, the more accurate the resulting Monte Carlo analysis will be.

---
## Rejection sampling

- Procedure:

1. Define `$w(\theta) = p(\theta)/g(\theta)$`.
  
--
  2. Assume that `$w(\theta) = p(\theta)/g(\theta) < M$` for some constant M. If `$g(\theta)$` represents a good approximation to `$p(\theta)$`, then M should not be too far from 1.

--
  3. Generate a _candidate_ value `$\theta \sim g(\theta)$` and **accept** with probability `$w(\theta)/M$`: if accepted, `$\theta$` is a draw from `$p(\theta)$`; otherwise **reject** and try again.  
  
--
      Equivalently, generate `$u \sim U(0,1)$` independently of `$\theta$`. Then **accept** `$\theta$` as a draw from `$p(\theta)$` if, and only if, `$u < w(\theta)/M$`.

- For those interested, the proof that all accepted `$\theta$` values are indeed from `$p(\theta)$` is on the next slide. We will not spend time on it.

- Clearly, we need `$M$` for this to work. However, in the case of truncated densities, we actually have `$M$`.

---
## Proof for simple accept/reject

- We need to show that all accepted `$\theta$` values are indeed from `$p(\theta)$`. Equivalently, show that `$f(\theta | u < w(\theta)/M) = p(\theta)$`.

- By Bayes' theorem,
.block[
.small[
`$$f(\theta | u < w(\theta)/M) = \dfrac{\Pr(\theta \textrm{ and } u < w(\theta)/M)}{\Pr(u < w(\theta)/M)} = \dfrac{\Pr(u < w(\theta)/M \ | \ \theta) g(\theta)}{\Pr(u < w(\theta)/M)}.$$`
]
]

- But,
  + `$\Pr(u < w(\theta)/M \ | \ \theta) = w(\theta)/M$` since `$u \sim U(0,1)$`, and
  
  + 
.block[
.small[
$$
`\begin{split}
\Pr(u < w(\theta)/M) & = \int \Pr(u < w(\theta)/M \ | \ \theta) g(\theta) d\theta\\
& = \int w(\theta)/M g(\theta) d\theta = 1/M \int w(\theta) g(\theta) d\theta = 1/M \int p(\theta) d\theta = 1/M.
\end{split}`
$$
]
]

- Therefore,
.block[
.small[
`$$f(\theta | u < w(\theta)/M) = \dfrac{\Pr(u < w(\theta)/M \ | \ \theta) g(\theta)}{\Pr(u < w(\theta)/M)} = \dfrac{w(\theta)/M g(\theta)}{1/M} = w(\theta) g(\theta) = p(\theta).$$`
]
]

---
## Rejection sampling for truncated densities

- The inverse CDF method works well for truncated densities but what happens when we can not write down the truncated CDF?

- Suppose we want to sample from `$f_{[a,b]}(\theta)$`, that is, a known pdf `$f(\theta)$` truncated to `$[a,b]$`.

+ Recall that `$f_{[a,b]}(\theta) \propto f(\theta)\mathbb{1}[\theta \in [a,b]]$`. Using the notation for rejection sampling, `$p(\theta) = f_{[a,b]}(\theta)$` and `$g(\theta) = f(\theta)$`.
  
--

+ Set `$1/M = \int^b_a f(\theta^\star)\textrm{d}\theta^\star$`, so that `$M$` is the normalizing constant of the truncated density.
  
--

+ Then, `$w(\theta) = p(\theta)/g(\theta) = M \mathbb{1}[\theta \in [a,b]] \leq M$` as required.
  
  
  
---
## Rejection sampling for truncated densities

- We can then use the procedure on page 5 to generate the required samples.

- Specifically,

--
  + For each `$i=1,\ldots,m$`, generate `$\theta_i \sim f$`. If `$\theta_i \in [a,b]$`, accept `$\theta_i$`, otherwise **reject** and try again. 
  
--
  + Easy to show that this is equivalent to accepting each `$\theta_i$` with probability `$w(\theta)/M$`.

---
## Example

```r
#Simple code for using rejection sampling to generate m samples
#from the Beta[10,10] density truncated to (0.35,0.6).
set.seed(12345)
#NOTE: there are more efficient ways to write this code!

#set sample size and reate vector to store sample
m <- 10000; THETA <- rep(0,m)
#keep track of rejects
TotalRejects <- 0; Rejections <- NULL
#now the 'for loop'
for(i in 1:m){
  t <- 0
  while(t < 1){
    theta <- rbeta(1,10,10)
    if(theta > 0.35 & theta < 0.6){
      THETA[i] <- theta
      t <- 1
    } else {
    TotalRejects <- TotalRejects + 1
    Rejections <- rbind(Rejections,theta)
  }
}
}
#How many rejections in all, to generate m=10000 samples?
TotalRejects
```

```
## [1] 3740
```

Acceptance rate `$\approx 0.726$`.

---
## Example

How does our sample compare to the true truncated density? `$m = 100$`
<img src="06-normal-model-I_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

---
## Example

How does our sample compare to the true truncated density? `$m = 1000$`
<img src="06-normal-model-I_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---
## Example

How does our sample compare to the true truncated density? `$m = 10000$`
<img src="06-normal-model-I_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />

---
## Example

How does our sample compare to the true truncated density? `$m = 100000$`
<img src="06-normal-model-I_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />

---
## Comments

- Clearly less efficient than the inverse CDF method, which we already know how to use for this particular problem.

- When you can write down the truncated CDF, use the inverse CDF method instead.

- When you cannot, rejection sampling can be a possible alternative, as are many more sampling methods which we will not cover in this course.

- Anyway, generally, rejection sampling can still be very useful.

- Importance sampling is another related sampling method but we will not spend time on it. If you are interested, take a look at the next few slides. If not, feel free to skip to the normal model.

---
## Importance sampling

- .hlight[Importance sampling] is actually one of the first steps into Monte Carlo analysis, in which simulated values from one distribution are used to explore another.

- Simulation from the "wrong distribution" can be incredibly useful as we have seen with rejection sampling and will also see later in this course.

- Not used as often anymore but still of practical interest in
  + fairly small problems, in terms of dimension, 
  
--
  + in which the density of the distribution of interest can be easily evaluated, but when it is difficult to sample from directly, and
  
--
  + when it is relatively easy to identify and simulate from distributions that approximate the distribution of interest.
  
--

- Importance sampling and Rejection sampling use the same importance ratio ideas, but the latter leads to exact corrections and so exact samples from `$p(\theta)$`.

---
## Importance sampling

- Interest lies in expectations of the form (instead of the actual samples)
.block[
.small[
`$$H = \int h(\theta) p(\theta) d\theta,$$`
]
]

- Write
.block[
.small[
`$$H = \int h(\theta) w(\theta) g(\theta) d\theta \ \ \ \textrm{with} \ \ \ w(\theta) = p(\theta)/g(\theta)$$`
]
]

that is, `$\mathbb{E}[h(\theta)]$` under `$p(\theta)$` is just `$\mathbb{E}[h(\theta) w(\theta)]$` under `$g(\theta)$`.
  
- Using direct Monte Carlo integration
.block[
.small[
`$$\bar{h} = \dfrac{1}{m} \sum_{i=1}^m w(\theta_i) h(\theta_i).$$`
]
]

where `$\theta_1, \ldots, \theta_m \overset{ind}{\sim} g(\theta)$`. We are sampling from the "wrong" distribution.

---
## Importance sampling

- The measure of "how wrong" we are at each simulated `$\theta_m$` value is the .hlight[importance weight]
.block[
.small[
`$$w(\theta_i) = p(\theta_i)/g(\theta_i).$$`
]
]

These ratios weight the sample estimates `$h(\theta_i)$` to "correct" for the fact that we sampled the wrong distribution.
  
- See [Lopes & Gamerman (Ch 3.4)](https://www.amazon.com/Markov-Chain-Monte-Carlo-Statistical/dp/1584885874) and [Robert and Casella (Ch. 3.3)](https://www.amazon.com/Monte-Statistical-Methods-Springer-Statistics/dp/1441919392) for discussion of convergence and optimality.

- Clearly, the closer `$g$` is to `$p$`, the better the results, just as we had with rejection sampling.

---
## Importance sampling

- Key considerations:
  + MC estimate `$\bar{h}$` has the expectation `$H$`; and is generally almost surely convergent to `$H$` (under certain conditions of course but we will not dive into those).
  
--
  + `$\mathbb{V}[\bar{h}]$` is often going to be finite in cases in which, generally, `$w(\theta) = p(\theta)/g(\theta)$` is bounded and decays rapidly in the tails of `$p(\theta)$`.
  
--
  + Thus, superior MC approximations, are achieved for choices of `$g(\theta)$` whose tails dominate those of the target `$p(\theta)$`.
  
--
  + That is, importance sampling distributions should be chosen to have tails at least as fat as the target (think normal distribution vs t-distribution).
  
--
  + Obviously require the support of `$g(\theta)$` to be the same as, or contain, that of `$p(\theta)$`.
  
--

- These also clearly apply to rejection sampling too.
  
  
  
---
## Importance sampling

- Problems in which `$w(\theta) = p(\theta)/g(\theta)$` can be computed are actually rare.

- As you will see when we move away from conjugate distributions, we usually only know `$p(\theta)$` up to a normalizing constant.

- When this is the case, simply "re-normalize" the importance weights, so that
.block[
.small[
`$$\bar{h} = \dfrac{1}{m} \sum_{i=1}^m w_i h(\theta_i) \ \ \ \textrm{where} \ \ \ w_i = \dfrac{w(\theta_i)}{\sum^m_{i=1}w(\theta_i)}.$$`
]
]

- Generally, in importance sampling, weights that are close to uniform are desirable, and very unevenly distributed weights are not.

---
class: center, middle

# Introduction to the univariate normal model

---
## Motivating example: job training

- In the 1970s, researchers in the U.S. ran several randomized experiments intended to evaluate public policy programs.

- One of the most famous experiments is the National Supported Work (NSW) Demonstration, in which researchers wanted to assess whether or not job training for disadvantaged workers had an effect on their wages.

- Eligible workers were randomly assigned either to receive job training or not to receive job training.

- Candidates eligible for the NSW were randomized into the program between March 1975 and July 1977.

- For more details, read Lalonde, R. J. (1986) and Dehejia, R., and Wahba, S. (1999).

---
## Motivating example: job training

- Setup:
  + **Pre-training wages**: real annual earnings in 1974 before training.
  
--
  + Two groups: some participants received job training and the rest did not.
  
--
  + **Post-training wages**: real annual earnings in 1978 upon completion of training.
  
--

- <div class="question">
Question of interest: is there evidence that workers who receive job training tend to earn higher wages than workers who do not receive job training?
</div>

- The original study really is a causal inference setup, but the data used in this example only uses a subset of the data.

- The data is richer than what we will use it for (i.e., there are covariates we can control for) but we will only focus on the pre and post wages for the two groups.

---
## Job training: the data

- Data:
  
  + No training group (N): sample size `$n_N = 429$`.
  
  + Training group (T): sample size `$n_A = 185$`.
  
  + **Diff wages**: Post-training wages -- Pre-training wages.

- Summary statistics for change in annual earnings:
  
  + `$\bar{y}_N = 1364.93$`; `$s_N = 7460.05$`
  
  + `$\bar{y}_T = 4253.57$`; `$s_T = 8926.99$`

- Wages/income are well known to be approximately normally distributed. Let's look at the distribution of "change in annual earnings" for the two groups.

---
## Job training: the data

Not completely normal but not too far off either. Lots of overlap between the two groups.

---
## Model for changes in earnings

- `$$y_i^{(T)} \sim \mathcal{N}(\mu_T,\sigma^2_T)$$`
  `$$y_i^{(N)} \sim \mathcal{N}(\mu_N,\sigma^2_N)$$`

- Want posterior distribution of `$\mu_T - \mu_N$`. Specifically, we would like to compute `$\Pr[\mu_T > \mu_N | Y_T, Y_N)$` or equivalently, `$\Pr[\mu_T - \mu_N > 0 | Y_T, Y_N)$`.

- Inference for `$\mu_T - \mu_N$` can be complicated in frequentist paradigm when `$\sigma^2_T \neq \sigma^2_N$`.

- Use approximate `$t$`-distributions based on the Welch-Satterthwaite degrees of freedom.

- Trivial with Bayesian inference

- By the way, also trivial to compute `$\Pr[\sigma^2_T > \sigma^2_N | Y_T, Y_N)$` with Bayesian inference, which we will do later.

- How to do posterior inference for such normal models?

---
## Another example: Pygmalion study

- Pygmalion effect is a phenomenon where expectation affects performance.

- <div class="question">
Question of interest: do teachers' expectations impact academic development of children?
</div>

- Setup:
  + Researchers gave IQ test to elementary school children.
  
--
  + Randomly picked six children & told teachers that the test predicts them to **have high potential for accelerated growth**.
  
--
  + They randomly picked six children and told teachers that the test predicts them to have **NO potential for growth**.
  
--
  + At end of school year, they gave IQ test again to all students.
  
--
  + They recorded the change in IQ scores of each student.

---
## Another example: Pygmalion study

- Data:
  
  + Accelerated group (A): 20, 10, 19, 15, 9, 18.
  
  + No growth group (N): 3, 2, 6, 10, 11, 5.
--

- Summary statistics:
  
  + `$\bar{y}_A = 15.2$`; `$s_A = 4.71$`.
  
  + `$\bar{y}_N = 6.2$`; `$s_N = 3.65$`.
  
--

- IQ test scores are also well known to be approximately normally distributed.

- Can't really check this assumption with only `$n = 6$` observations.

---
## Model for changes in scores

- `$$y_i^{(A)} \sim \mathcal{N}(\mu_A,\sigma^2_A)$$`
  `$$y_i^{(N)} \sim \mathcal{N}(\mu_N,\sigma^2_N)$$`

- Once again, we want posterior distribution of `$\mu_A - \mu_N$`.

- As before, we would like to compute `$\Pr[\mu_A > \mu_N | Y_A, Y_N) \equiv \Pr[\mu_A - \mu_N > 0 | Y_A, Y_N)$`.

- We would also like to compute `$\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)$`.

- To answer both questions, let's learn the Bayesian normal model.

---
## Normal distribution

- A random variable `$Y$` has a .hlight[normal distribution], written as `$Y \sim \mathcal{N}(\mu, \sigma^2)$`, if the pdf is
.block[
.small[
`$$p(y; \mu,\sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}} \ e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}; \ \ \ y \in (-\infty,\infty), \ \ \mu \in (-\infty,\infty), \ \ \sigma \in (0,\infty).$$`
]
]

where `$\mu$` is the mean and `$\sigma^2$` is the variance.

- It is also common (and would often be more convenient for our purposes) to write the pdf in terms of .hlight[precision], `$\tau$`, where `$\tau = 1/\sigma^2$`.

- In that case, the pdf is instead
.block[
.large[
`$$p(y; \mu,\sigma^2) = \dfrac{1}{\sqrt{2\pi}} \tau^{\frac{1}{2}} \ e^{-\frac{1}{2} \tau (y-\mu)^2}; \ \ \ y \in (-\infty,\infty), \ \ \mu \in (-\infty,\infty), \ \ \tau \in (0,\infty).$$`
]
]

---
## Example normal distributions

---
## Comments on the normal distribution

- It is amazing how often real data are close to normally distributed.

- Likely a consequence of CLT -- sums and means of independent random variables tend to be approximately normally distributed.

- Occurs under very general conditions.

- Normality?
  + Height, weight and other body measurements,
  + Income\wages\earnings,
  + Cumulative hydrologic measures such as annual rainfall or monthly river discharge,
  + Errors in astronomical or physical observations,
  + Many more examples!

---
## Properties of the normal distribution

- Mean, median and mode are all the same `$(\mu)$`.

- Symmetric about the mean `$\mu$`.

- 95% of the density (95% probability) within `$\pm1.96\sigma$` (approximately two standard deviations) of the mean.

- If `$X \sim \mathcal{N}(\theta,s^2)$` and `$Y \sim \mathcal{N}(\mu, \sigma^2)$` with `$X \perp Y$`, then
.block[
.small[
`$$aX + bY \sim \mathcal{N}(a\theta + b\mu,a^2s^2+b^2\sigma^2),$$`
]
]

for constants `$a$` and `$b$`.
  
--

- When independence does not hold, the sum of two normally distributed random variables is still normally distributed.

- However, when that is the case, we must account for the correlation in the variance term.

---
## Notes on normal distribution in R

- `rnorm`, `dnorm`, `pnorm`, `qnorm` in R take mean and **standard deviation** `$\sigma$` as arguments.

- If you use the variance `$\sigma^2$` instead you will get wrong answers!

- For example, `rnorm(n,m,s)` generates `$n$` normal random variables with mean `$m$` and standard deviation `$s$`, that is, `$\mathcal{N}(m,s^2)$`.

---
## Normal model

- Suppose we have independent observations `$Y = (y_1,y_2,\ldots,y_n)$`, where each `$y_i \sim \mathcal{N}(\mu, \sigma^2)$` or `$y_i \sim \mathcal{N}(\mu, \tau^{-1})$`, with unknown parameters `$\mu$` and `$\sigma^2$` (or `$\tau$`).

- Then, the likelihood is
.block[
.small[
$$
`\begin{split}
L(Y; \mu,\sigma^2) & = \prod_{i=1}^n \dfrac{1}{\sqrt{2\pi}} \tau^{\frac{1}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau (y_i-\mu)^2\right\}\\
& \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau \sum_{i=1}^n (y_i-\mu)^2\right\}\\
& \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau \sum_{i=1}^n \left[ (y_i-\bar{y}) - (\mu - \bar{y}) \right]^2 \right\}\\
& \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 - \sum_{i=1}^n(\mu - \bar{y})^2 \right] \right\}\\
& \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau \left[ \sum_{i=1}^n (y_i-\bar{y})^2 - n(\mu - \bar{y})^2 \right] \right\}\\
& \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu - \bar{y})^2 \right\}.\\
\end{split}`
$$
]
]

---
## Likelihood for normal model

- Likelihood:
.block[
.large[
`$$L(Y; \mu,\sigma^2) \propto \tau^{\frac{n}{2}} \ \textrm{exp}\left\{-\frac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$`
]
]

where
  + `$\bar{y} =\sum_{i=1}^n y_i$` is the sample mean; and
  + `$s^2 = \sum_{i=1}^n (y_i-\bar{y})^2/(n-1)$` is the sample variance.

- Sufficient statistics:
  + Sample mean `$\bar{y}$`; and
  + Sample sum of squares `$SS = s^2(n-1) = \sum_{i=1}^n (y_i-\bar{y})^2$`.

- MLEs:
  + `$\hat{\mu} = \bar{y}$`.
  + `$\hat{\tau} = n/SS$`, and `$\hat{\sigma}^2 = SS/n$`.

---
## Inference for mean, conditional on variance

- We can break down inference problem for this two-parameter model into two one-parameter problems.

- First start by developing inference on `$\mu$` when `$\sigma^2$` is known. Turns out we can use a conjugate prior for `$\pi(\mu|\sigma^2)$`. We will get to unknown `$\sigma^2$` in the next class.

- For `$\sigma^2$` known, the normal likelihood further simplifies to
.block[
.small[
`$$\propto \ \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$`
]
]

leaving out everything else that does not depend on `$\mu$`.
  
--

- For `$\pi(\mu|\sigma^2)$`, we consider `$\mathcal{N}(\mu_0, \sigma_0^2)$`, i.e., `$\mathcal{N}(\mu_0, \tau_0^{-1})$`, where `$\tau_0^{-1} = \sigma_0^2$`.

- Let's derive the posterior `$\pi(\mu|Y,\sigma^2)$`.

---
## Inference for mean, conditional on variance

- Posterior:
.block[
.small[
`$$\pi(\mu|Y,\sigma^2) \ \propto \ \pi(\mu|\sigma^2) L(Y; \mu,\sigma^2) \ \propto \ \textrm{exp}\left\{-\frac{1}{2} \tau_0 (\mu - \mu_0)^2 \right\}\  \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu - \bar{y})^2 \right\}$$`
]
]

- Expanding out squared terms
.block[
.small[
`$$\Rightarrow \pi(\mu|Y,\sigma^2) \ \propto \ \textrm{exp}\left\{-\frac{1}{2} \tau_0 (\mu^2 - 2\mu\mu_0 + \mu_0^2) \right\}\  \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu^2 - 2\mu\bar{y} + \bar{y}^2) \right\}$$`
]
]

- Ignoring terms not containing `$\mu$`
.block[
.small[
$$
`\begin{split}
\Rightarrow \pi(\mu|Y,\sigma^2) \ & \propto \ \textrm{exp}\left\{-\frac{1}{2} \tau_0 (\mu^2 - 2\mu\mu_0) \right\}\  \textrm{exp}\left\{-\frac{1}{2} \tau n(\mu^2 - 2\mu\bar{y}) \right\}\\
& \propto \ \textrm{exp}\left\{-\frac{1}{2} \left[\tau_0 (\mu^2 - 2\mu\mu_0)  + \tau n(\mu^2 - 2\mu\bar{y}) \right] \right\}\\
& \propto \ \textrm{exp}\left\{-\frac{1}{2} \left[ \mu^2(\tau n + \tau_0) - 2\mu(\tau n\bar{y} + \tau_0\mu_0)   \right] \right\}.\\
\end{split}`
$$
]
]

---
## Inference for mean, conditional on variance

- Notice that `$\left[ \mu^2(\tau n + \tau_0) - 2\mu(\tau n\bar{y} + \tau_0\mu_0)   \right]$` is essentially a quadratic equation of the form `$a^\star\mu^2 - 2b^\star\mu + c^\star$`, where

--
  + `$a^\star = \tau n + \tau_0$`,
  
--
  + `$b^\star = \tau n\bar{y} + \tau_0\mu_0$`, and
  
--
  + `$c^\star$` does not depend on `$\mu$`.
  
      _Note that `$c^\star$` contains some of the terms we ignored on the previous slide but we need not know its exact form here._
  
--

- .hlight[Goal]: Turn this quadratic equation into an expression of the form `$m(\mu - r)^2$`, for some `$m$` and `$r$`, so that we have something that resembles the kernel of a normal density.

- <div class="question">
How? Complete the square!
</div>

---
## Inference for mean, conditional on variance

- Recall how to complete the square. Specifically, we can write
.block[
.small[
`$$a\mu^2 + b\mu + c$$`
]
]

as
.block[
.small[
`$$a(\mu + d)^2 + e,$$`
]
]

where
  + `$d = \dfrac{b}{2a}$`, and
  
  + `$e = c - \dfrac{b^2}{4a}$`.

---
## Inference for mean, conditional on variance

- Completing the square and rearranging (where `$a^\star = \tau n + \tau_0$` and `$b^\star = \tau n\bar{y} + \tau_0\mu_0$`),
.block[
.small[
$$
`\begin{split}
\Rightarrow \pi(\mu|Y,\sigma^2) \ & \propto \ \textrm{exp}\left\{-\frac{1}{2} \left[ a^\star\mu^2- 2b^\star\mu\right] \right\}\\
& = \ \textrm{exp}\left\{-\frac{1}{2} a^\star \left[ \mu^2- \dfrac{2b^\star}{a^\star}\mu\right] \right\}\\
& = \ \textrm{exp}\left\{-\frac{1}{2} a^\star \left[ \mu^2- \dfrac{2b^\star}{a^\star}\mu + \dfrac{(b^\star)^2}{(a^\star)^2} \right] + \dfrac{(b^\star)^2}{2a^\star} \right\}\\
& \propto \ \textrm{exp}\left\{-\frac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\},\\
\end{split}`
$$
]
]

which is the kernel of a normal distribution with
    + mean `$\dfrac{b^\star}{a^\star}$`, and
    + precision `$a^\star$` or variance `$(a^\star)^{-1}$`.

---
## Posterior with precision terms

- In terms of precision, we have
.block[
.large[
`$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \tau_n^{-1})$$`
]
]

where
.block[
.small[
`$$\mu_n = \dfrac{b^\star}{a^\star} = \dfrac{\tau n\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}$$`
]
]

and
.block[
.small[
`$$\tau_n = a^\star = \tau n + \tau_0.$$`
]
]

---
## Posterior with precision terms

- As mentioned before, Bayesians often prefer to talk about precision instead of variance.

- We have
  + `$\tau$` as the sampling precision (how close the `$y_i$`'s are to `$\mu$`).
  
  + `$\tau_0$` as the prior precision (our prior belief about the uncertainty about `$\mu$` around our prior guess `$\mu_0$`).
  
  + `$\tau_n$` as the posterior precision

- As we have on the previous slide, _the posterior precision equals the prior precision plus the data precision_.

- That is, we see that the posterior information is a sum of the prior information and the information from the data.

---
## Posterior with precision terms: combining information

- Posterior mean is weighted sum of prior information plus data information:
.block[
.small[
$$
`\begin{split}
\mu_n & = \dfrac{n\tau\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}\\
& = \dfrac{\tau_0}{\tau_0 + \tau n} \mu_0 + \dfrac{n\tau}{\tau_0 + \tau n} \bar{y}
\end{split}`
$$
]
]

- Recall that `$\sigma^2$` (and thus `$\tau$`) is known for now.

- If we think of the prior mean as being based on `$\kappa_0$` prior observations from a similar population as `$y_1,y_2,\ldots,y_n$`, then we might set `$\sigma_0^2 = \frac{\sigma^2}{\kappa_0}$`, which implies `$\tau_0 = \kappa_0 \tau$`, and then the posterior mean is given by
.block[
.small[
$$
`\begin{split}
\mu_n & = \dfrac{\kappa_0}{\kappa_0 + n} \mu_0 + \dfrac{n}{\kappa_0 + n} \bar{y}.
\end{split}`
$$
]
]

---
## Posterior with variance terms

- In terms of variances, we have
.block[
.large[
`$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \sigma_n^2)$$`
]
]

where
.block[
.l.small[
`$$\mu_n = \dfrac{b^\star}{a^\star} = \dfrac{ \dfrac{n}{\sigma^2}\bar{y} + \dfrac{1}{\sigma^2_0} \mu_0}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}}$$`
]
]

and
.block[
.small[
`$$\sigma^2_n = \dfrac{1}{a^\star} = \dfrac{1}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}}.$$`
]
]

- It is still easy to see that we can re-express the posterior information as a sum of the prior information and the information from the data.