class: center, middle, inverse, title-slide # Normal model cont’d ### Dr. Olanrewaju Michael Akande ### Jan 31, 2020 --- ## Announcements - Take "Participation Quiz II - Jan 31" on Sakai. - Homework 3 now online. -- ## Outline - Inference for mean, conditional on variance (cont'd) - Noninformative and improper priors - Joint inference for mean and variance - Back to the examples --- class: center, middle # The univariate normal model (cont'd) --- ## Conditional inference for the mean (recap) Normal data: `\(Y = (y_1,y_2,\ldots,y_n)\)`, where each .block[ .small[ `$$y_i \sim \mathcal{N}(\mu, \sigma^2); \ \ \ \textrm{or} \ \ \ y_i \sim \mathcal{N}(\mu, \tau^{-1}).$$` ] ] -- `\(+\)` Normal Prior (when `\(\sigma^2\)`/ `\(\tau\)` is known): .block[ .small[ `$$\mu|\sigma^2 \sim \mathcal{N}(\mu_0, \sigma_0^2); \ \ \ \textrm{or} \ \ \ \mu|\tau \sim \mathcal{N}(\mu_0, \tau_0^{-1}).$$` ] ] -- `\(\Rightarrow\)` Normal posterior (in terms of precision): .block[ .small[ `$$\mu|Y,\tau \sim \mathcal{N}(\mu_n, \tau_n^{-1}).$$` ] ] -- where + `\(\mu_n = \dfrac{\tau n\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}\)`; and + `\(\tau_n = \tau n + \tau_0\)`. --- ## Posterior with precision terms: combining information - Posterior mean is weighted sum of prior information plus data information: .block[ .small[ $$ `\begin{split} \mu_n & = \dfrac{n\tau\bar{y} + \tau_0\mu_0}{\tau n + \tau_0}\\ & = \dfrac{\tau_0}{\tau_0 + \tau n} \mu_0 + \dfrac{n\tau}{\tau_0 + \tau n} \bar{y} \end{split}` $$ ] ] -- - Relatively easy to set `\(\mu_0\)` if we have a "prior" guess of the mean. What about `\(\tau_0\)`? -- - If we think of the prior mean as being based on `\(\kappa_0\)` prior observations from a similar population as `\(Y\)`, then we might set `\(\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}\)`, which implies `\(\tau_0 = \kappa_0 \tau\)`. -- - Then the posterior mean is given by .block[ .small[ $$ `\begin{split} \mu_n & = \dfrac{\kappa_0}{\kappa_0 + n} \mu_0 + \dfrac{n}{\kappa_0 + n} \bar{y}. \end{split}` $$ ] ] --- ## Posterior with variance terms - In terms of variances, we have .block[ .large[ `$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \sigma_n^2)$$` ] ] where .block[ .small[ `$$\mu_n = \dfrac{ \dfrac{n}{\sigma^2}\bar{y} + \dfrac{1}{\sigma^2_0} \mu_0}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}} \ \ \ \textrm{and} \ \ \ \sigma^2_n = \dfrac{1}{\dfrac{n}{\sigma^2} + \dfrac{1}{\sigma^2_0}}.$$` ] ] -- - Easy to see that we can re-express the posterior information as a sum of the prior information and the information from the data. --- ## Posterior with variance terms - If we once again set `\(\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}\)`, the posterior mean is still given by .block[ .small[ $$ `\begin{split} \mu_n & = \dfrac{\kappa_0}{\kappa_0 + n} \mu_0 + \dfrac{n}{\kappa_0 + n} \bar{y}. \end{split}` $$ ] ] -- - By the way, setting `\(\sigma_0^2 = \dfrac{\sigma^2}{\kappa_0}\)` `\(\Rightarrow\)` prior dependence between `\(\mu\)` and `\(\sigma^2\)`, whereas an arbitrary `\(\sigma_0^2\)`, independent on `\(\sigma^2\)`, `\(\Rightarrow\)` prior independence between `\(\mu\)` and `\(\sigma^2\)`. --- ## Noninformative and improper priors - Generally, we must specify both `\(\mu_0\)` and `\(\tau_0\)` to do inference. -- - When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct. -- - To that end, there is often the desire to construct .hlight[noninformative priors], with the rationale being _"to let the data speak for themselves"_. -- - For example, we could instead assume a uniform prior on `\(\mu\)` that is constant over the real line, i.e., `\(\pi(\mu) \propto 1\)` `\(\Rightarrow\)` all values on the real line are equally likely apriori. -- - Clearly, this is not a valid pdf since it will not integrate to 1 over the real line. Such priors are known as .hlight[improper priors]. -- - An improper prior can still be very useful, we just need to ensure it results in a .hlight[proper posterior]. --- ## Jeffreys' prior - Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior? -- - Laplace proposed the uniform distribution. This proposal lacks invariance under monotone transformations of the parameter. -- - For example, a uniform prior on the binomial proportion parameter `\(\theta\)` is not the same as a uniform prior on the odds parameter `\(\phi = \dfrac{\theta}{1-\theta}\)`. -- - A more acceptable approach was introduced by Jeffreys. For single parameter models, the .hlight[Jeffreys' prior] defines a noninformative prior density of a parameter `\(\theta\)` as .block[ .small[ `$$\pi(\theta) \propto \sqrt{\mathcal{I}(\theta)}$$` ] ] where `\(\mathcal{I}(\theta)\)` is the .hlight[Fisher information] for `\(\theta\)`. --- ## Jeffreys' prior - The Fisher information gives a way to measure the amount of information a random variable `\(Y\)` carries about an unknown parameter `\(\theta\)` of a distribution that describes `\(Y\)`. -- - Formally, `\(\mathcal{I}(\theta)\)` is defined as .block[ .small[ `$$\mathcal{I}(\theta) = \mathbb{E} \left[ \left( \dfrac{\partial}{\partial \theta} \textrm{log} f(y; \theta) \right)^2 \bigg| \theta \right] = \int_\mathcal{Y} \left( \dfrac{\partial}{\partial \theta} \textrm{log} f(y; \theta) \right)^2 f(y; \theta) dy.$$` ] ] -- - Alternatively, .block[ .small[ `$$\mathcal{I}(\theta) = - \mathbb{E} \left[ \dfrac{\partial^2}{\partial^2 \theta} \textrm{log} f(y; \theta) \bigg| \theta \right] = - \int_\mathcal{Y} \left( \dfrac{\partial^2}{\partial^2 \theta} \textrm{log} f(y; \theta) \right) f(y; \theta) dy.$$` ] ] -- - Turns out that the Jeffreys' prior for `\(\mu\)` under the normal model, when `\(\sigma^2\)` is known, is .block[ .small[ `$$\pi(\mu) \propto 1,$$` ] ] the uniform prior over the real line. You should try to derive this. --- ## Inference for mean, conditional on variance using Jeffreys' prior - Recall that for `\(\sigma^2\)` known, the normal likelihood simplifies to .block[ .small[ `$$\propto \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$` ] ] ignoring everything else that does not depend on `\(\mu\)`. - With the Jeffreys' prior `\(\pi(\mu) \propto 1\)`, can we derive the posterior distribution? --- ## Inference for mean, conditional on variance using Jeffreys' prior - Posterior: .block[ .small[ $$ `\begin{split} \pi(\mu|Y,\sigma^2) \ & \propto \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\} \pi(\mu)\\ & \propto \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}.\\ \end{split}` $$ ] ] -- - This is the kernel of a normal distribution with + mean `\(\bar{y}\)`, and + precision `\(n\tau\)` or variance `\(\dfrac{1}{n\tau} = \dfrac{\sigma^2}{n}\)`. -- - Written differently, we have `\(\mu|Y,\sigma^2 \sim \mathcal{N}(\bar{y},\dfrac{\sigma^2}{n})\)` -- - <div class="question"> This should look familiar to you. Does it? </div> --- ## Joint inference for mean and variance - What happens when `\(\sigma\)`/ `\(\tau\)` is unknown? We need a joint prior `\(\pi(\mu,\sigma^2)\)` for `\(\mu\)` and `\(\sigma^2\)`. -- - Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far. -- - That is, .block[ .small[ `$$\pi(\mu,\sigma^2) \ = \ \pi(\mu | \sigma^2)\pi(\sigma^2).$$` ] ] -- - For `\(\pi(\sigma^2)\)`, we need a distribution with support on `\((0,\infty)\)`. One such family is the gamma family, but this is NOT conjugate for the variance of a normal distribution. -- - The gamma distribution is, however, conjugate for the precision `\(\tau\)`, and in that case, we say that `\(\sigma^2\)` has an .hlight[inverse-gamma] distribution. --- ## Conditional specification of prior - Once again, suppose `\(Y = (y_1,y_2,\ldots,y_n)\)`, where each .block[ .small[ `$$y_i \sim \mathcal{N}(\mu, \sigma^2); \ \ \ \textrm{or} \ \ \ y_i \sim \mathcal{N}(\mu, \tau^{-1}).$$` ] ] -- - A conjugate joint prior is given by .block[ .small[ $$ `\begin{split} \tau = \dfrac{1}{\sigma^2} \ & \sim \textrm{Gamma}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \mu|\sigma^2 & \sim \mathcal{N}\left(\mu_0, \dfrac{\sigma^2}{\kappa_0}\right) \ \ \ \textrm{or} \ \ \ \mu|\tau & \sim \mathcal{N}\left(\mu_0, \dfrac{1}{\kappa_0 \tau}\right).\\ \end{split}` $$ ] ] -- - This is often called a .hlight[normal-gamma] prior distribution. -- - `\(\sigma_0^2\)` is the prior guess for `\(\sigma^2\)`, while `\(\nu_0\)` is often referred to as the "prior degrees of freedom", our degree of confidence in `\(\sigma_0^2\)`. -- - We do not have conjugacy if we replace `\(\dfrac{\sigma^2}{\kappa_0}\)` in the normal prior with an arbitrary prior variance independent of `\(\sigma^2\)`. To do inference in that scenario, we need .hlight[Gibbs sampling] (to come next week!). --- ## Posterior for the mean given variance, under normal-gamma prior - Based on the normal-gamma prior, we need `\(\pi(\mu|Y,\sigma^2)\)` and `\(\pi(\tau|Y)\)`. -- - For `\(\pi(\mu|Y,\sigma^2)\)`, we can leverage our previous results. We have .block[ .large[ `$$\mu|Y,\sigma^2 \sim \mathcal{N}(\mu_n, \dfrac{\sigma^2}{\kappa_n}) \ \ \ \textrm{or} \ \ \ \mu|Y,\tau \sim \mathcal{N}(\mu_n, \dfrac{1}{\kappa_n \tau})$$` ] ] where .block[ .small[ `$$\mu_n = \dfrac{ \dfrac{n}{\sigma^2}\bar{y} + \dfrac{\kappa_0}{\sigma^2} \mu_0}{\dfrac{n}{\sigma^2} + \dfrac{\kappa_0}{\sigma^2}} = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y} \ \ \ \textrm{and} \ \ \ \kappa_n = \kappa_0 + n.$$` ] ] -- - `\(\mu_n\)` is simply the sample mean of the current and prior observations, and posterior variance of `\(\mu\)` given `\(\sigma^2\)` is `\(\sigma^2\)` divided by the total number of observations (prior and current). --- ## Posterior derivation - Some algebra is required to get the marginal posterior of `\(\tau\)`. Let's write the full joint posterior and go from there. We must keep some of the terms we discarded in the last lecture. -- - Recall the likelihood .block[ .large[ `$$L(Y; \mu,\tau) \propto \tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\},$$` ] ] -- - Now, `\(\mu|\tau \sim \mathcal{N}\left(\mu_0, \dfrac{1}{\kappa_0 \tau}\right)\)` `\(\Rightarrow\)` .block[ .small[ `$$\pi(\mu|\tau) \propto \ \textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\}.$$` ] ] -- - and `\(\tau \sim \textrm{Ga}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\)` `\(\Rightarrow\)` .block[ .small[ `$$\pi(\tau) \propto \tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\}.$$` ] ] --- ## Posterior derivation .block[ .small[ $$ `\begin{split} \Rightarrow \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \pi(\mu|\sigma^2) \times \pi(\tau) \times L(Y; \mu,\sigma^2)\\ \\ & \boldsymbol{ \propto} \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\}}_{\propto \ \pi(\mu|\sigma^2)} \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\}}_{\propto \ \pi(\tau)}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\propto \ L(Y; \mu,\sigma^2)}\\ & \boldsymbol{=} \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\textrm{Terms involving } \mu}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\} \tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\}}_{\textrm{Terms NOT involving } \mu}\\ & \boldsymbol{=} \textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu^2 - 2\mu\mu_0 + \mu_0^2) \right\}\ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu^2 - 2\mu\bar{y} + \bar{y}^2) \right\}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}` $$ ] ] --- ## Posterior derivation .block[ .small[ $$ `\begin{split} \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\kappa_0 \tau (\mu^2 - 2\mu\mu_0) + \tau n(\mu^2 - 2\mu\bar{y}) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2(n\tau + \kappa_0 \tau) - 2\mu(n\tau \bar{y} + \kappa_0 \tau\mu_0) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}` $$ ] ] -- - Set `\(a^\star = (n\tau + \kappa_0 \tau)\)` and `\(b^\star = (n\tau \bar{y} + \kappa_0 \tau\mu_0)\)`, then complete the square for the first part like we did in the last lecture. .block[ .small[ $$ `\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2 a^\star - 2\mu b^\star \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}` $$ ] ] --- ## Posterior derivation .block[ .small[ $$ `\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 + \dfrac{(b^\star)^2}{2a^\star} \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(b^\star)^2}{a^\star}\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)^2}{(n\tau + \kappa_0 \tau)}\right] \right\}}_{\textrm{Expand terms and recombine}} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\dfrac{n\kappa_0\tau^2 (\mu_0^2- 2\mu_0\bar{y} + \bar{y}^2)}{\tau(n + \kappa_0)}\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}` $$ ] ] --- ## Posterior derivation .block[ .small[ $$ `\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \\ & \boldsymbol{=} \ \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\}}_{\textrm{Substitute the values for } a^\star \textrm{ and } b^\star \textrm{ back}} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\ \\ & \boldsymbol{=} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} (n\tau + \kappa_0 \tau) \left[ \mu^2- \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)}{(n\tau + \kappa_0 \tau)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ \end{split}` $$ ] ] --- ## Posterior derivation .block[ .small[ $$ `\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \tau(n + \kappa_0) \left[ \mu^2- \dfrac{(n \bar{y} + \kappa_0 \mu_0)}{(n + \kappa_0)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ & \boldsymbol{=} \ \ \mathcal{N}\left(\mu_n, \dfrac{1}{\kappa_n\tau} \right) \times \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}\right) = \pi(\mu|Y, \tau) \pi(\tau|Y), \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \kappa_n & = \kappa_0 + n\\ \mu_n & = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y}\\ \nu_n & = \nu_0 + n\\ \sigma_n^2 & = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right] = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + \sum_{i=1}^n (y_i-\bar{y})^2 + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right]\\ \end{split}` $$ ] ] -- - Turns out that the marginal posterior of `\(\mu\)`, that is, `\(\pi(\mu|Y) = \int_0^\infty \pi(\mu, \tau|Y) d\tau\)` is a **t-distribution**. You can derive that distribution if you are interested, we won't spend time on it in class. --- ## Back to our examples - .hlight[Pygmalion: questions of interest] + Is the average improvement for the accelerated group larger than that for the no growth group? - What is `\(\Pr[\mu_A > \mu_N | Y_A, Y_N)\)`? + Is the variance of improvement scores for the accelerated group larger than that for the no growth group? - What is `\(\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)\)`? -- - .hlight[Job training: questions of interest] + Is the average change in annual earnings for the training group larger than that for the no training group? - What is `\(\Pr[\mu_T > \mu_N | Y_T, Y_N)\)`? + Is the variance of change in annual earnings for the training group larger than that for the no training group? - What is `\(\Pr[\sigma^2_T > \sigma^2_N | Y_T, Y_N)\)`? --- ## Mildly informative priors - We will focus on the Pygmalion study. Follow the same approach for the job training data. -- - Suppose you have no idea whether students would improve IQ on average. Set `\(\mu_{0A} = \mu_{0N} = 0\)`. -- - Suppose you don't have any faith in this belief, and think it is the equivalent of having only 1 prior observation in each group. Set `\(\kappa_{0A} = \kappa_{0N} = 1\)`. -- - Based on the literature, SD of change scores should be around 10 in each group, but still you don't have a lot of faith in this belief. Set `\(\nu_{0A} = \nu_{0N} = 1\)` and `\(\sigma^2_{0A} = \sigma^2_{0N} = 100\)`. -- - Graph priors to see if they accord with your beliefs. Sampling new values of `\(Y\)` from the priors offers a good check. --- ## Recall the Pygmalion data - Data: + Accelerated group (A): 20, 10, 19, 15, 9, 18. + No growth group (N): 3, 2, 6, 10, 11, 5. -- - Summary statistics: + `\(\bar{y}_A = 15.2\)`; `\(s_A = 4.71\)`. + `\(\bar{y}_N = 6.2\)`; `\(s_N = 3.65\)`. --- ## Analysis with mildly informative priors .block[ .small[ $$ `\begin{split} \kappa_{nA} & = \kappa_{0A} + n_A = 1 + 6 = 7\\ \kappa_{nN} & = \kappa_{0N} + n_N = 1 + 6 = 7\\ \nu_{nA} & = \nu_{0A} + n_A = 1 + 6 = 7\\ \nu_{nN} & = \nu_{0N} + n_N = 1 + 6 = 7\\ \mu_{nA} & = \dfrac{\kappa_{0A} \mu_{0A} + n_A\bar{y}_A}{\kappa_{nA}} = \dfrac{ (1)(0) + (6)(15.2) }{7} \approx 13.03\\ \mu_{nN} & = \dfrac{\kappa_{0N} \mu_{0N} + n_N\bar{y}_N}{\kappa_{nN}} = \dfrac{ (1)(0) + (6)(6.2) }{7} \approx 5.31\\ \sigma_{nA}^2 & = \dfrac{1}{\nu_{nA}}\left[\nu_{0A}\sigma_{0A}^2 + s^2_A(n_A-1) + \dfrac{n_A\kappa_{0A}}{\kappa_{nA}} (\bar{y}_A - \mu_{0A})^2\right] \\ & = \dfrac{1}{7}\left[(1)(100) + (22.17)(5) + \dfrac{(6)(1)}{(7)} (15.2- 0)^2\right] \approx 58.41\\ \sigma_{nN}^2 & = \dfrac{1}{\nu_{nN}}\left[\nu_{0N}\sigma_{0N}^2 + s^2_N(n_N-1) + \dfrac{n_N\kappa_{0N}}{\kappa_{nN}} (\bar{y}_N - \mu_{0N})^2\right] \\ & = \dfrac{1}{7}\left[(1)(100) + (13.37)(5) + \dfrac{(6)(1)}{(7)} (6.2- 0)^2\right] \approx 28.54\\ \end{split}` $$ ] ] --- ## Analysis with mildly informative priors - So our joint posterior is .block[ .small[ $$ `\begin{split} \mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\mu_{nA}, \dfrac{1}{\kappa_{nA}\tau_A} \right) = \mathcal{N}\left(13.03, \dfrac{1}{7\tau_A} \right)\\ \tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{\nu_{nA}}{2}, \dfrac{\nu_{nA} \sigma_{nA}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (58.41)}{2}\right)\\ \mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\mu_{nN}, \dfrac{1}{\kappa_{nN}\tau_N} \right) = \mathcal{N}\left(5.31, \dfrac{1}{7\tau_N} \right)\\ \tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{\nu_{nN}}{2}, \dfrac{\nu_{nN} \sigma_{nN}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (28.54)}{2}\right)\\ \end{split}` $$ ] ] --- ## Monte Carlo sampling - To evaluate whether the accelerated group has larger IQ gains than the normal group, we would like to estimate quantities like `\(\Pr[\mu_A > \mu_N | Y_A, Y_N)\)` which are based on the **marginal posterior** of `\(\mu\)` rather than the **conditional distribution**. -- - Fortunately, this is easy to do by generating samples of `\(\mu\)` and `\(\sigma^2\)` from their joint posterior. --- ## Monte Carlo sampling - Suppose we simulate values using the following Monte Carlo procedure: .block[ .small[ $$ `\begin{split} \tau^{(1)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(1)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(1)}} \right)\\ \tau^{(2)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(2)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(2)}} \right)\\ & \ \ \vdots \\ & \ \ \vdots \\ & \ \ \vdots \\ \tau^{(m)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(m)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(m)}} \right)\\ \end{split}` $$ ] ] --- ## Monte Carlo sampling - Note that we are sampling each `\(\mu^{(j)}\)`, `\(j=1,\ldots,m\)`, from its conditional distribution, not from the marginal. -- - The sequence of pairs `\(\{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\}\)` simulated using this method are independent samples from the joint posterior `\(\pi(\mu,\tau | Y)\)`. -- - Additionally, the simulated sequence `\(\{\mu^{(1)}, \ldots, \mu^{(m)}\}\)` are independent samples from the **marginal posterior distribution**. -- - While this may seem odd, keep in mind that while we drew the `\(\mu\)`'s as conditional samples, each was conditional on a different value of `\(\tau\)`. -- - Thus, together they constitute marginal samples of `\(\mu\)`. --- ## Monte Carlo sampling It is easy to sample from these posteriors: ```r aA <- 7/2 aN <- 7/2 bA <- (7/2)*58.41 bN <- (7/2)*28.54 muA <- 13.03 muN <- 5.31 kappaA <- 7 kappaN <- 7 tauA_postsample <- rgamma(10000,aA,bA) thetaA_postsample <- rnorm(10000,muA,sqrt(1/(kappaA*tauA_postsample))) tauN_postsample <- rgamma(10000,aN,bN) thetaN_postsample <- rnorm(10000,muN,sqrt(1/(kappaN*tauN_postsample))) sigma2A_postsample <- 1/tauA_postsample sigma2N_postsample <- 1/tauN_postsample ``` --- ## Monte Carlo sampling - Is the average improvement for the accelerated group larger than that for the no growth group? + What is `\(\Pr[\mu_A > \mu_N | Y_A, Y_N)\)`? ```r mean(thetaA_postsample > thetaN_postsample) ``` ``` ## [1] 0.9721 ``` -- - Is the variance of improvement scores for the accelerated group larger than that for the no growth group? + What is `\(\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)\)`? ```r mean(sigma2A_postsample > sigma2N_postsample) ``` ``` ## [1] 0.8185 ``` -- - <div class="question"> What can we conclude from this? </div> --- ## Improper prior - Let's be very objective with the prior selection. In fact, let's be extreme! -- + If we let the normal variance `\(\rightarrow \infty\)` then our prior on `\(\mu\)` is `\(\propto 1\)` (recall the Jeffreys' prior on `\(\mu\)` for known `\(\sigma^2\)`). -- + If we let the gamma variance get very large (e.g., `\(a,b \rightarrow 0\)`), then the prior on `\(\sigma^2\)` is `\(\propto \dfrac{1}{\sigma^2}\)`. -- - `\(\pi(\mu,\sigma^2) \propto \dfrac{1}{\sigma^2}\)` is improper (does not integrate to 1) but does lead to a proper posterior distribution that yields inferences similar to frequentist ones. -- - For that choice, we have .block[ .small[ $$ `\begin{split} \mu|Y,\tau & \sim \mathcal{N}\left(\bar{y},\dfrac{1}{n \tau}\right)\\ \tau | Y & \sim \textrm{Gamma}\left(\dfrac{n-1}{2}, \dfrac{(n-1)s^2}{2}\right)\\ \end{split}` $$ ] ] --- ## Analysis with noninformative priors - So our joint posterior is .block[ .small[ $$ `\begin{split} \mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\bar{y}_A,\dfrac{1}{n_A \tau_A}\right) = \mathcal{N}\left(15.2, \dfrac{1}{6\tau_A} \right)\\ \tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{n_A-1}{2}, \dfrac{(n_A-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(22.17)}{2}\right)\\ \mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\bar{y}_N,\dfrac{1}{n_N \tau_N}\right) = \mathcal{N}\left(6.2, \dfrac{1}{6\tau_N} \right)\\ \tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{n_N-1}{2}, \dfrac{(n_N-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(13.37)}{2}\right)\\ \end{split}` $$ ] ] --- ## Monte Carlo sampling It is easy to sample from these posteriors: ```r aA <- (6-1)/2 aN <- (6-1)/2 bA <- (6-1)*22.17/2 bN <- (6-1)*13.37/2 muA <- 15.2 muN <- 6.2 tauA_postsample_impr <- rgamma(10000,aA,bA) thetaA_postsample_impr <- rnorm(10000,muA,sqrt(1/(6*tauA_postsample_impr))) tauN_postsample_impr <- rgamma(10000,aN,bN) thetaN_postsample_impr <- rnorm(10000,muN,sqrt(1/(6*tauN_postsample_impr))) sigma2A_postsample_impr <- 1/tauA_postsample_impr sigma2N_postsample_impr <- 1/tauN_postsample_impr ``` --- ## Monte Carlo sampling - Is the average improvement for the accelerated group larger than that for the no growth group? + What is `\(\Pr[\mu_A > \mu_N | Y_A, Y_N)\)`? ```r mean(thetaA_postsample_impr > thetaN_postsample_impr) ``` ``` ## [1] 0.9941 ``` -- - Is the variance of improvement scores for the accelerated group larger than that for the no growth group? + What is `\(\Pr[\sigma^2_A > \sigma^2_N | Y_A, Y_N)\)`? ```r mean(sigma2A_postsample_impr > sigma2N_postsample_impr) ``` ``` ## [1] 0.7113 ``` -- - <div class="question"> How does the new choice of prior affect our conclusions? </div> --- ## Recall the job training data - Data: + No training group (N): sample size `\(n_N = 429\)`. + Training group (T): sample size `\(n_A = 185\)`. -- - Summary statistics for change in annual earnings: + `\(\bar{y}_N = 1364.93\)`; `\(s_N = 7460.05\)` + `\(\bar{y}_T = 4253.57\)`; `\(s_T = 8926.99\)` -- - <div class="question"> Using the same approach we used for the Pygmalion data, answer the questions of interest. </div>