class: center, middle, inverse, title-slide # Hierarchical models II ### Dr. Olanrewaju Michael Akande ### March 25, 2020 --- ## Announcements - Review changes to syllabus. -- - Any concerns from the lab meetings? -- - Going forward, there will be 5 minute breaks (roughly) halfway through each class meeting. -- - Some initial details on final exam. -- ## Outline - Hierarchical modeling of means recap - Hierarchical modeling of means and variances - Gibbs sampler - ELS data --- ## Regular univariate normal model - Recall that if we assume .block[ .small[ `$$y_i \sim \mathcal{N}(\mu, \sigma^2), \ \ i=1, \ldots, n,$$` ] ] -- and set our priors to be .block[ .small[ $$ `\begin{split} \pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma_0^2\right).\\ \pi(\sigma^2) & = \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right),\\ \end{split}` $$ ] ] -- then we have .block[ .small[ $$ `\begin{split} \pi(\mu, \sigma^2 | Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n} p(y_{i} | \mu, \sigma^2 ) \right\} \cdot \pi(\mu) \cdot \pi(\sigma^2). \\ \end{split}` $$ ] ] --- ## Full conditionals - So that .block[ .small[ $$ `\begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}` $$ ] ] -- - and .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \mu,Y) \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right), \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu)^2 \right].\\ \end{split}` $$ ] ] --- ## Hierarchical modeling of means recap - We've looked at the hierarchical normal model of the form .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2_j \right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J. \end{split}` $$ ] ] -- - The model gives us an extra hierarchy through the prior on the means, leading to sharing of information across the groups, when estimating the group-specific means. -- - As before, first set `\(\sigma^2_j = \sigma^2\)` for all groups, to simplify posterior inference. We will revisit this today. -- - Thus, we only have two variance terms, `\(\sigma^2\)` and `\(\tau^2\)`, to inform us on the within-group variation and between-group variation respectively. --- ## Hierarchical normal model recap - Standard semi-conjugate priors as before: .block[ .small[ $$ `\begin{split} \pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\ \pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \pi(\tau^2) & = \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right).\\ \end{split}` $$ ] ] -- with + `\(\mu_0\)`: best guess of average of school averages + `\(\gamma^2_0\)`: set based on plausible ranges of values of `\(\mu\)` -- + `\(\tau_0^2\)`: best guess of the (scaled) variance of school averages + `\(\eta_0\)`: set based on how tight prior for `\(\tau^2\)` is around `\(\tau_0^2\)` -- + `\(\sigma_0^2\)`: best guess of the (scaled) variance of individual test scores around respective school means + `\(\nu_0\)`: set based on how tight prior for `\(\sigma^2\)` is around `\(\sigma_0^2\)`. --- ## Posterior inference recap - The resulting posterior is therefore: .block[ .small[ $$ `\begin{split} \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) & \boldsymbol{\propto} p(y | \theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times \pi(\mu, \sigma^2,\tau^2)\\ \\ & \boldsymbol{=} p(y | \theta_1, \ldots, \theta_J, \sigma^2 )\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu,\tau^2)\\ & \ \ \ \ \times \pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \\ & \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\}\\ & \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\}\\ & \ \ \ \ \times\pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] --- ## Full conditional for grand mean recap - .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}` $$ ] ] -- - This looks like the full conditional distribution from the one-sample normal case, so that .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}` $$ ] ] and `\(\bar{\theta} = \frac{1}{J} \sum\limits^J_{j=1} \theta_j\)`. --- ## Full conditionals for group means recap - .block[ .small[ $$ `\begin{split} \pi(\theta_j | \theta_{-j}, \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}` $$ ] ] -- - Those terms include a normal density for `\(\theta_j\)` multiplied by a product of normal densities in which `\(\theta_j\)` is the mean, again mirroring the one-sample case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\theta_j | \theta_{-j}, \mu, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_j^\star, \tau_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \tau_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \mu_j^\star = \tau_j^\star \left[ \dfrac{n_j}{\sigma^2} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}` $$ ] ] --- ## Full conditionals for within-group variance recap - .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}` $$ ] ] -- - We can take advantage of the one-sample normal problem, so that our full conditional posterior is .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_n = \nu_0 + \sum\limits_{j=1}^{J} n_j ; \ \ \ \ \ \ \ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{j=1}^{J}\sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}` $$ ] ] --- ## Full conditionals for across-group variance recap - .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] -- - Again, we have .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 & = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}` $$ ] ] --- ## Hierarchical modeling of means and variances - Often researchers emphasize differences in means. However, variances can be very important. -- - If we think means vary across groups, why shouldn't we worry about variances also varying across groups? -- - In that case, we have the model .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2_j \right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J, \end{split}` $$ ] ] -- - However, now we also need a prior on all the `\(\sigma^2_j\)`'s that lets us borrow information about across groups. --- ## Full conditionals - Notice that our prior won't affect the full conditions for `\(\mu\)` and `\(\tau^2\)` since those have nothing to do with all the `\(\sigma^2_j\)`'s. -- - The full conditional for each `\(\theta_j\)`, we have .block[ .small[ $$ `\begin{split} \pi(\theta_j | \theta_{-j}, \mu, \sigma^2_1, \ldots, \sigma^2_J,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2_j ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}` $$ ] ] with the only change from before being `\(\sigma^2_j\)`. -- - That is, those terms still include a normal density for `\(\theta_j\)` multiplied by a product of normals in which `\(\theta_j\)` is the mean, again mirroring the previous case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\theta_j | \theta_{-j}, \mu, \sigma^2_1, \ldots, \sigma^2_J,\tau^2, Y) & = \mathcal{N}\left(\mu_j^\star, \tau_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \tau_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2_j} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \mu_j^\star = \tau_j^\star \left[ \dfrac{n_j}{\sigma^2_j} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}` $$ ] ] --- ## How about within-group variances? - Now we need to find a semi-conjugate prior for the `\(\sigma^2_j\)`'s. Before, with one `\(\sigma^2\)`, we had .block[ .small[ $$ `\begin{split} \pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right),\\ \end{split}` $$ ] ] which was nicely semi-conjugate. -- - That suggests that maybe we should start with. .block[ .small[ $$ `\begin{split} \sigma^2_1, \ldots, \sigma^2_J | \nu_0, \sigma_0^2 & \sim \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \end{split}` $$ ] ] -- - However, if we just fix the hyperparameters `\(\nu_0\)` and `\(\sigma_0^2\)` in advance, the prior on the `\(\sigma^2_j\)`'s does not allow borrowing of information across other values of `\(\sigma^2_j\)`, to aid in estimation. -- - Thus, we actually need to treat `\(\nu_0\)` and `\(\sigma_0^2\)` as parameters in a hierarchical model for both means and variances. --- ## How about within-group variances? - Before we get to the choice of the priors for `\(\nu_0\)` and `\(\sigma_0^2\)`, we have enough to derive the full conditional for each `\(\sigma^2_j\)`. This actually takes a similar form to what we had before we indexed by `\(j\)`, that is, .block[ .small[ $$ `\begin{split} \pi(\sigma^2_j | \sigma^2_{-j}, \theta_1, \ldots, \theta_J, \mu, \tau^2, \nu_0, \sigma_0^2,Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2_j ) \right\} \cdot \pi(\sigma^2_j | \nu_0, \sigma_0^2)\\ \end{split}` $$ ] ] -- - This still looks like what we had before, that is, products of normals and one inverse-gamma, so that .block[ .small[ $$ `\begin{split} \pi(\sigma^2_j | \sigma^2_{-j}, \theta_1, \ldots, \theta_J, \mu, \tau^2, \nu_0, \sigma_0^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_{j}^\star}{2}, \dfrac{\nu_{j}^\star\sigma_{j}^{2(\star)}}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_{j}^\star = \nu_0 + n_j ; \ \ \ \ \ \ \ \sigma_{j}^{2(\star)} & = \dfrac{1}{\nu_{j}^\star} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}` $$ ] ] --- ## Remaining hyper-priors - Now we can get back to priors for `\(\nu_0\)` and `\(\sigma_0^2\)`. Turns out that a semi-conjugate prior for `\(\sigma_0^2\)` (see question 2 on homework 2) is a gamma distribution. That is, if we set .block[ .small[ $$ `\begin{split} \pi(\sigma_0^2) & = \mathcal{Ga} \left(a,b\right),\\ \end{split}` $$ ] ] then, .block[ .small[ $$ `\begin{split} \pi(\sigma_0^2 | \theta_1, \ldots, \theta_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\sigma_0^2)\\ & \propto \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \ \cdot \ \mathcal{Ga} \left(\sigma_0^2 ; a,b\right) \\ \end{split}` $$ ] ] -- - Recall that + `\(\mathcal{Ga}(y; a, b) \equiv \dfrac{b^a}{\Gamma(a)} y^{a - 1} e^{-b y}\)`, and + `\(\mathcal{IG}(y; a, b) \equiv \dfrac{b^a}{\Gamma(a)} y^{-(a+1)} e^{-\dfrac{b}{y}}\)`. --- ## Remaining hyper-priors - So `\(\pi(\sigma_0^2 | \theta_1, \ldots, \theta_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \nu_0, Y)\)` .block[ .small[ $$ `\begin{split} & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\sigma_0^2)\\ & \boldsymbol{\propto} \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \ \cdot \ \mathcal{Ga} \left(\sigma_0^2 ; a,b\right) \\ & \boldsymbol{=} \left[ \prod_{j=1}^{J} \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} (\sigma^2_j)^{-\left(\dfrac{\nu_0}{2}+1 \right)} e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right] \cdot \left[ \dfrac{b^a}{\Gamma(a)} (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\ & \boldsymbol{\propto} \left[ \prod_{j=1}^{J} \left( \sigma_0^2 \right)^\left(\dfrac{\nu_0}{2} \right) e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right] \cdot \left[ (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\ & \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left(\dfrac{J \nu_0}{2} \right) e^{- \sigma_0^2 \left[ \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right] \cdot \left[ (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\ \end{split}` $$ ] ] --- ## Remaining hyper-priors - That is, the full conditional is .block[ .small[ $$ `\begin{split} \pi(\sigma_0^2 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left(\dfrac{J \nu_0}{2} \right) e^{- \sigma_0^2 \left[ \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right] \cdot \left[ (\sigma_0^2)^{a - 1} e^{-b \sigma_0^2} \right] \\ & \boldsymbol{\propto} \left[ \left( \sigma_0^2 \right)^\left( a + \dfrac{J \nu_0}{2} - 1 \right) e^{- \sigma_0^2 \left[ b + \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right] \\ & \equiv \mathcal{Ga} \left(\sigma_0^2 ; a_n ,b_n \right), \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} a_n = a + \dfrac{J \nu_0}{2}; \ \ \ \ b_n = b + \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j}. \end{split}` $$ ] ] --- ## Remaining hyper-priors - Ok that leaves us with one parameter to go, i.e., `\(\nu_0\)`. Turns out there is no simple conjugate/semi-conjugate prior for `\(\nu_0\)`. -- - Common practice is to restrict `\(\nu_0\)` to be an integer (which makes sense when we think of it as being degrees of freedom, which also means it cannot be zero). With the restriction, we need a discrete distribution as the prior with support on `\(\nu_0 = 1, 2, 3, \ldots\)`. -- - **Poll question: Can we use either a binomial or a Poisson prior on for `\(\nu_0\)`?** -- - A popular choice is the geometric distribution with pmf `\(p(\nu_0) = (1-p)^{\nu_0-1} p\)`. -- - However, we will rewrite the kernel as `\(\pi(\nu_0) \propto e^{-\alpha \nu_0}\)`. How did we get here from the geometric pmf and what is `\(\alpha\)`? --- ## Final full conditional - With this prior, `\(\pi(\nu_0 | \theta_1, \ldots, \theta_J, \sigma^2_1, \ldots, \sigma^2_J, \mu, \tau^2, \sigma_0^2, Y)\)` .block[ .small[ $$ `\begin{split} & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\sigma^2_j | \nu_0, \sigma_0^2) \right\} \cdot \pi(\nu_0)\\ & \propto \ \mathcal{IG} \left(\sigma^2_j; \dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right) \ \cdot \ e^{-\alpha \nu_0} \\ & \boldsymbol{=} \left[ \prod_{j=1}^{J} \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \left(\sigma^2_j\right)^{-\left(\dfrac{\nu_0}{2}+1 \right)} e^{-\dfrac{\nu_0\sigma_0^2}{2 (\sigma^2_j)}} \right] \cdot e^{-\alpha \nu_0} \\ & \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}+1 \right)} \cdot e^{- \nu_0 \left[ \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right] \cdot e^{-\alpha \nu_0} \\ \end{split}` $$ ] ] --- ## Final full conditional - That is, the full conditional is .block[ .small[ $$ `\begin{split} \pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}+1 \right)} \cdot e^{- \nu_0 \left[ \alpha + \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right], \\ \end{split}` $$ ] ] which is not a known kernel and is thus unnormalized (i.e., does not integrate to 1 in its current form). -- - This sure looks like a lot, but it will be relatively easy to compute in R. -- - Now, technically, the support is `\(\nu_0 = 1, 2, 3, \ldots\)`, however, we can compute this to compute the unnormalized distribution across a grid of `\(\nu_0\)` values, say, `\(\nu_0 = 1, 2, 3, \ldots\)` for some large `\(K\)`, and then sample. --- ## Final full conditional - One more thing, computing these probabilities on the raw scale can be problematic particularly because of the product inside. Good idea to transform to the log scale instead. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left[ \left( \dfrac{\left( \dfrac{\nu_0\sigma_0^2}{2} \right)^\left(\dfrac{\nu_0}{2} \right)}{\Gamma\left(\dfrac{\nu_0}{2} \right)} \right)^J \cdot \left(\prod_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right)^{\left(\dfrac{\nu_0}{2}-1 \right)} \cdot e^{- \nu_0 \left[ \alpha + \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right]} \right]\\ \\ \Rightarrow \ \text{ln} \pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left(\dfrac{J\nu_0}{2} \right) \text{ln} \left( \dfrac{\nu_0\sigma_0^2}{2} \right) - J\text{ln}\left[ \Gamma\left(\dfrac{\nu_0}{2} \right) \right] \\ & \ \ \ \ + \left(\dfrac{\nu_0}{2}+1 \right) \left(\sum_{j=1}^{J} \text{ln} \left[\dfrac{1}{\sigma^2_j} \right] \right) \\ & \ \ \ \ - \nu_0 \left[ \alpha + \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right] \\ \end{split}` $$ ] ] --- ## ELS data - Finally, enough math and some data! -- - We have data from the 2002 Educational Longitudinal Survey (ELS). This survey includes a random sample of 100 large urban public high schools, and 10th graders randomly sampled within these high schools. ```r Y <- as.matrix(dget("http://www2.stat.duke.edu/~pdh10/FCBS/Inline/Y.school.mathscore")) dim(Y) ``` ``` ## [1] 1993 2 ``` ```r head(Y) ``` ``` ## school mathscore ## [1,] 1 52.11 ## [2,] 1 57.65 ## [3,] 1 66.44 ## [4,] 1 44.68 ## [5,] 1 40.57 ## [6,] 1 35.04 ``` ```r length(unique(Y[,"school"])) ``` ``` ## [1] 100 ``` --- ## ELS data First, some EDA: ```r Data <- as.data.frame(Y) Data$school <- as.factor(Data$school) Data %>% group_by(school) %>% na.omit()%>% summarise(avg_mathscore = mean(mathscore)) %>% dplyr::ungroup() %>% ggplot(aes(x = avg_mathscore))+ geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="16-hierarchical-models-II_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## ELS data There does appear to be school-related differences in means and in variances, some of which are actually related to the sample sizes. --- ## ELS data Consider the math scores of these children: <img src="img/ELS2002.png" height="450px" style="display: block; margin: auto;" /> --- ## ELS hypotheses - Investigators may be interested in the following: + Differences in mean scores across schools + Differences in school-specific variances - How do we evaluate these questions in a statistical model? --- ## Hierarchical model - We can write out the full model we've been describing as follows. .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2_j \right); \ \ \ i = 1,\ldots, n_j\\ \\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J\\ \sigma^2_1, \ldots, \sigma^2_J | \nu_0, \sigma_0^2 & \sim \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \\ \mu & \sim \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\ \tau^2 & \sim \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right).\\ \\ \pi(\nu_0) & \propto e^{-\alpha \nu_0} \\ \sigma_0^2 & \sim \mathcal{Ga} \left(a,b\right).\\ \end{split}` $$ ] ] -- - Now, we need to specify hyperparameters. That should be fun! --- ## Prior specification - This exam was designed to have a national mean of 50 and standard deviation of 10. Suppose we don't have any other information. -- - Then, we can specify .block[ .small[ $$ `\begin{split} \mu & \sim \mathcal{N}\left(\mu_0 = 50, \gamma^2_0 = 25\right)\\ \\ \tau^2 & \sim \mathcal{IG} \left(\dfrac{\eta_0}{2} = \dfrac{1}{2}, \dfrac{\eta_0\tau_0^2}{2} = \dfrac{100}{2}\right).\\ \\ \pi(\nu_0) & \propto e^{-\alpha \nu_0} \propto e^{- \nu_0} \\ \\ \sigma_0^2 & \sim \mathcal{Ga} \left(a = 1,b = \dfrac{1}{100} \right).\\ \end{split}` $$ ] ] -- - <div class="question"> Are these prior distributions overly informative? </div> --- ## Full conditionals (recap) - .block[ .small[ $$ `\begin{split} & \pi(\theta_j | \cdots \cdots ) = \mathcal{N}\left(\mu_j^\star, \tau_j^\star \right) \ \ \ \ \textrm{where}\\ \\ & \tau_j^\star = \dfrac{1}{ \dfrac{n_j}{\sigma^2_j} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \mu_j^\star = \tau_j^\star \left[ \dfrac{n_j}{\sigma^2_j} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}` $$ ] ] -- - .block[ .small[ $$ `\begin{split} & \pi(\sigma^2_j | \cdots \cdots ) = \mathcal{IG} \left(\dfrac{\nu_{j}^\star}{2}, \dfrac{\nu_{j}^\star\sigma_{j}^{2(\star)}}{2}\right) \ \ \ \ \textrm{where}\\ \\ & \nu_{j}^\star = \nu_0 + n_j ; \ \ \ \ \ \ \ \sigma_{j}^{2(\star)} = \dfrac{1}{\nu_{j}^\star} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}` $$ ] ] -- - .block[ .small[ $$ `\begin{split} & \pi(\mu | \cdots \cdots ) = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ & \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}` $$ ] ] --- ## Full conditionals (recap) - .block[ .small[ $$ `\begin{split} & \pi(\tau^2 | \cdots \cdots ) = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ & \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}` $$ ] ] -- - .block[ .small[ $$ `\begin{split} \text{ln} \pi(\nu_0 | \cdots \cdots ) & \boldsymbol{\propto} \left(\dfrac{J\nu_0}{2} \right) \text{ln} \left( \dfrac{\nu_0\sigma_0^2}{2} \right) - J\text{ln}\left[ \Gamma\left(\dfrac{\nu_0}{2} \right) \right] \\ & \ \ \ \ + \left(\dfrac{\nu_0}{2}+1 \right) \left(\sum_{j=1}^{J} \text{ln} \left[\dfrac{1}{\sigma^2_j} \right] \right) \\ & \ \ \ \ - \nu_0 \left[ \alpha + \dfrac{\sigma_0^2}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j} \right] \\ \end{split}` $$ ] ] -- - .block[ .small[ $$ `\begin{split} & \pi(\sigma_0^2 | \cdots \cdots ) = \mathcal{Ga} \left(\sigma_0^2 ; a_n ,b_n \right) \ \ \ \ \textrm{where}\\ \\ & a_n = a + \dfrac{J \nu_0}{2}; \ \ \ \ b_n = b + \dfrac{\nu_0}{2} \sum\limits_{j=1}^{J} \dfrac{1}{\sigma^2_j}.\\ \end{split}` $$ ] ] --- ## Side notes - Obviously, as you have seen in the lab, we can simply use Stan (or JAGS, BUGS) to fit these models without needing to do any of this ourselves. -- - The point here (as you should already know by now) is to learn and understand all the details, including the math! --- ## Gibbs sampler ```r #Data summaries J <- length(unique(Y[,"school"])) ybar <- c(by(Y[,"mathscore"],Y[,"school"],mean)) s_j_sq <- c(by(Y[,"mathscore"],Y[,"school"],var)) n <- c(table(Y[,"school"])) #Hyperparameters for the priors mu_0 <- 50 gamma_0_sq <- 25 eta_0 <- 1 tau_0_sq <- 100 alpha <- 1 a <- 1 b <- 1/100 #Grid values for sampling nu_0_grid nu_0_grid<-1:5000 #Initial values for Gibbs sampler theta <- ybar sigma_sq <- s_j_sq mu <- mean(theta) tau_sq <- var(theta) nu_0 <- 1 sigma_0_sq <- 100 ``` --- ## Gibbs sampler ```r #first set number of iterations and burn-in, then set seed n_iter <- 10000; burn_in <- 0.3*n_iter set.seed(1234) #Set null matrices to save samples SIGMA_SQ <- THETA <- matrix(nrow=n_iter, ncol=J) OTHER_PAR <- matrix(nrow=n_iter, ncol=4) #Now, to the Gibbs sampler for(s in 1:(n_iter+burn_in)){ #update the theta vector (all the theta_j's) tau_j_star <- 1/(n/sigma_sq + 1/tau_sq) mu_j_star <- tau_j_star*(ybar*n/sigma_sq + mu/tau_sq) theta <- rnorm(J,mu_j_star,sqrt(tau_j_star)) #update the sigma_sq vector (all the sigma_sq_j's) nu_j_star <- nu_0 + n theta_long <- rep(theta,n) nu_j_star_sigma_j_sq_star <- nu_0*sigma_0_sq + c(by((Y[,"mathscore"] - theta_long)^2,Y[,"school"],sum)) sigma_sq <- 1/rgamma(J,(nu_j_star/2),(nu_j_star_sigma_j_sq_star/2)) #update mu gamma_n_sq <- 1/(J/tau_sq + 1/gamma_0_sq) mu_n <- gamma_n_sq*(J*mean(theta)/tau_sq + mu_0/gamma_0_sq) mu <- rnorm(1,mu_n,sqrt(gamma_n_sq)) ``` --- ## Gibbs sampler ```r #update tau_sq eta_n <- eta_0 + J eta_n_tau_n_sq <- eta_0*tau_0_sq + sum((theta-mu)^2) tau_sq <- 1/rgamma(1,eta_n/2,eta_n_tau_n_sq/2) #update sigma_0_sq sigma_0_sq <- rgamma(1,(a + J*nu_0/2),(b + nu_0*sum(1/sigma_sq)/2)) #update nu_0 log_prob_nu_0 <- (J*nu_0_grid/2)*log(nu_0_grid*sigma_0_sq/2) - J*lgamma(nu_0_grid/2) + (nu_0_grid/2+1)*sum(log(1/sigma_sq)) - nu_0_grid*(alpha + sigma_0_sq*sum(1/sigma_sq)/2) nu_0 <- sample(nu_0_grid,1, prob = exp(log_prob_nu_0 - max(log_prob_nu_0)) ) #this last step substracts the maximum logarithm from all logs #it is a neat trick that throws away all results that are so negative #they will screw up the exponential #note that the sample function will renormalize the probabilities internally #save results only past burn-in if(s > burn_in){ THETA[(s-burn_in),] <- theta SIGMA_SQ[(s-burn_in),] <- sigma_sq OTHER_PAR[(s-burn_in),] <- c(mu,tau_sq,sigma_0_sq,nu_0) } } colnames(OTHER_PAR) <- c("mu","tau_sq","sigma_0_sq","nu_0") ``` --- ## Posterior inference The blue lines indicate the posterior median and a 95% for `\(\mu\)`. The red asterisks indicate the data values `\(\bar{y}_j\)`. <img src="16-hierarchical-models-II_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Posterior inference Posterior summaries of `\(\sigma^2_j\)`. <img src="16-hierarchical-models-II_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Posterior inference Shrinkage as a function of sample size. ``` ## n Sample group mean Post. est. of group mean Post. est. of overall mean ## 1 31 50.81355 50.49363 48.10549 ## 2 22 46.47955 46.71544 48.10549 ## 3 23 48.77696 48.71578 48.10549 ## 4 19 47.31632 47.44935 48.10549 ## 5 21 36.58286 38.04669 48.10549 ``` ``` ## n Sample group mean Post. est. of group mean Post. est. of overall mean ## 15 12 56.43083 54.67213 48.10549 ## 16 23 55.49609 54.72904 48.10549 ## 17 7 37.92714 40.86290 48.10549 ## 18 14 50.45357 50.03007 48.10549 ``` ``` ## n Sample group mean Post. est. of group mean Post. est. of overall mean ## 67 4 65.01750 56.90436 48.10549 ## 68 19 44.74684 45.13522 48.10549 ## 69 24 51.86917 51.31079 48.10549 ## 70 27 43.47037 43.86470 48.10549 ## 71 22 46.70455 46.88374 48.10549 ## 72 13 36.95000 38.55704 48.10549 ``` --- ## How about non-normal models? - Suppose we have `\(y_{ij} \in \{0,1,\ldots\}\)` being a count for subject `\(i\)` in group `\(j\)`. -- - For count data, it is natural to use a Poisson likelihood, that is, .block[ .small[ $$ y_{ij} \sim \text{Poisson}(\theta_j) $$ ] ] where each `\(\theta_j = \mathbb{E}[y_{ij}]\)` is a group specific mean. -- - When there are limited data within each group, it is natural to borrow information. -- - How can we accomplish this with a hierarchical model? -- - See homework 6 for a similar setup!