class: center, middle, inverse, title-slide # Hierarchical models I ### Dr. Olanrewaju Michael Akande ### Feb 28, 2020 --- ## Announcements - No HW today. - Next HW immediately after spring break on Monday, March 16. -- - Midterm exam next Friday, March 6. - Practice questions on Sakai later today or tomorrow. - Review session next Wednesday, March 4. --- ## Outline - Introduction to hierarchical models - Shrinkage - Comparing two groups - BMI example - Comparing multiple groups with same variance --- ## Motivation - Sometimes, we may have a natural grouping in our data, for example + students within schools, + patients within hospitals, + voters within counties or states, + biology data, where animals are followed within natural populations organized geographically and, in some cases, socially. -- - For such grouped data, we may want to do inference across all the groups, for example, comparison of the group means. -- - Ideally, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups when possible. -- - .hlight[Hierarchical modeling] provides a principled way to do so. --- ## Bayes estimators and bias - Recall the normal model: .block[ .small[ `$$y_{i} | \mu, \sigma^2 \overset{iid}{\sim} \mathcal{N} \left( \mu, \sigma^2\right).$$` ] ] -- - The MLE for the population mean `\(\mu\)` is just the sample mean `\(\bar{y}\)`. -- - `\(\bar{y}\)` is unbiased for `\(\mu\)`. That is, for any data `\(y_i \overset{iid}{\sim} \mathcal{N} \left( \mu, \sigma^2\right)\)`, `\(\mathbb{E}[\bar{y}] = \mu\)`. -- - However, recall that in the conjugate Normal-Gamma normal for example, the posterior expectation is a **weighted average** of the prior mean and the sample mean. -- - That is, it is actually biased! --- ## Shrinkage - Usually through the weighting of the sample data and prior, the Bayes procedure has the tendency to pull the estimate of `\(\mu\)` toward the prior mean. -- - Of course, the magnitude of the pull depends on the sample size. -- - This "pulling" phenomenon is referred to as .hlight[shrinkage]. -- - <div class="question"> Why would we ever want to do this? Why not just stick with the MLE? </div> -- - Well, in part, because shrinkage estimators are often "more accurate" in prediction problems -- i.e. they tend to do a better job of predicting a future outcome or of recovering the actual parameter values. Remember variance-bias trade off! -- - The fact that a biased estimator would do a better job in many prediction problems can be proven rigorously, and is referred to as .hlight[Stein's paradox]. --- ## Modern relevance - Stein's result implies, in particular, that the sample mean is an *inadmissible* estimator of the mean of a multivariate normal distribution in more than two dimensions -- i.e. there are other estimators that will come closer to the true value in expectation. -- - In fact, these are Bayes point estimators (the posterior expectation of the parameter `\(\mu\)` ). -- - Most of what we do now in high-dimensional statistics is develop biased estimators that perform better than unbiased ones. -- - Examples: lasso regression, ridge regression, various kinds of hierarchical Bayesian models, etc. -- - Today we will get a very basic introduction to .hlight[Bayesian hierarchical models], which provide a formal and coherent framework for constructing shrinkage estimators. --- ## Why hierarchical models? - **Bayesian hierarchical models** is a sort of catch-all phrase for a large class of models that have several levels of conditional distributions making up the prior. -- - Like simpler one-level priors, they also accomplish shrinkage. However, they are much more flexible. -- - Why use them? Several reasons: + We may want to exploit more complex dependence structures. -- + We may have many parameters relative to the amount of data that we have, and want to borrow information in estimating them. -- + We may want to shrink toward something other than a simple prior mean/hyper-parameter. --- ## Comparing two groups - Suppose we want to do inference on mean body mass index (BMI) for two groups (male or female). -- - BMI is known to often follow a normal distribution, so let's assume the same here. -- - We should expect some relationship between the mean BMI for the two groups. -- - We may also think the shape of the two distributions would be relatively the same (at least as a simplifying assumption for now). -- - Thus, a reasonable model might be .block[ .small[ $$ `\begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\theta_m, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\theta_f, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\ \end{split}` $$ ] ] but with some relationship between `\(\theta_m\)` and `\(\theta_f\)`. --- ## Application - First, let's do classical inference on such data. The data we will use in the `\(\textsf{R}\)` package `rethinking`. ```r #install.packages(c("coda","mvtnorm","devtools","loo","dagitty")) #library(devtools) #devtools::install_github("rmcelreath/rethinking",ref="Experimental") #library(rethinking) data(Howell1) Howell1[1:15,] ``` ``` ## height weight age male ## 1 151.765 47.82561 63.0 1 ## 2 139.700 36.48581 63.0 0 ## 3 136.525 31.86484 65.0 0 ## 4 156.845 53.04191 41.0 1 ## 5 145.415 41.27687 51.0 0 ## 6 163.830 62.99259 35.0 1 ## 7 149.225 38.24348 32.0 0 ## 8 168.910 55.47997 27.0 1 ## 9 147.955 34.86988 19.0 0 ## 10 165.100 54.48774 54.0 1 ## 11 154.305 49.89512 47.0 0 ## 12 151.130 41.22017 66.0 1 ## 13 144.780 36.03221 73.0 0 ## 14 149.900 47.70000 20.0 0 ## 15 150.495 33.84930 65.3 0 ``` --- ## Data - For now, focus on data for individuals under age 15. ```r htm <- Howell1$height/100 bmi <- Howell1$weight/(htm^2) y_male <- bmi[Howell1$age<15 & Howell1$male==1] y_female <- bmi[Howell1$age<15 & Howell1$male==0] n_m <- length(y_male) n_f <- length(y_female) n_f ``` ``` ## [1] 84 ``` ```r n_m ``` ``` ## [1] 77 ``` ```r summary(y_male) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 12.07 13.87 14.63 14.84 15.53 18.22 ``` ```r summary(y_female) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 9.815 13.559 14.305 14.585 15.712 18.741 ``` --- ## Classical inference - No significant difference in group means. ```r t.test(y_male,y_female) ``` ``` ## ## Welch Two Sample t-test ## ## data: y_male and y_female ## t = 1.1204, df = 157.87, p-value = 0.2643 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.1947946 0.7054729 ## sample estimates: ## mean of x mean of y ## 14.84037 14.58503 ``` --- ## Simple weighted estimator - One parameterization that can reflect some relationship between `\(\theta_m\)` and `\(\theta_f\)` is .block[ .small[ $$ `\begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\ \end{split}` $$ ] ] -- where + `\(\theta_m = \mu + \delta\)` and `\(\theta_f = \mu - \delta\)`, + `\(\mu = \dfrac{\theta_m + \theta_f}{2}\)` is the pooled average, and + `\(\delta = \dfrac{\theta_m - \theta_f}{2}\)` is half of the population difference in means. --- ## Simple weighted estimator - Convenient prior: + `\(\pi(\mu, \delta, \sigma^2) = \pi(\mu) \cdot \pi(\delta) \cdot \pi(\sigma^2)\)`, where - `\(\pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2)\)`, - `\(\pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2)\)`, and - `\(\pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2})\)`. -- - We will set the hyper-parameters as: + `\(\mu_0 = 15, \gamma_0 = 5\)`, + `\(\delta_0 = 0, \tau_0 = 3\)`, + `\(\nu_0 = 1, \sigma_0 = 5\)`. -- - <div class="question"> Do these values seem reasonable to you? </div> --- ## Simple weighted estimator - Note that we can rewrite .block[ .small[ $$ `\begin{split} y_{i,male} & \overset{iid}{\sim} \mathcal{N} \left(\mu + \delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ y_{i,female} & \overset{iid}{\sim} \mathcal{N} \left(\mu - \delta, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}` $$ ] ] as .block[ .small[ $$ `\begin{split} (y_{i,male} - \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (y_{i,female} + \delta) & \overset{iid}{\sim} \mathcal{N} \left(\mu, \sigma^2\right); \ \ i = 1, \ldots, n_f\\ \end{split}` $$ ] ] -- or .block[ .small[ $$ `\begin{split} (y_{i,male} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_m;\\ (-1) (y_{i,female} - \mu) & \overset{iid}{\sim} \mathcal{N} \left(\delta, \sigma^2\right); \ \ i = 1, \ldots, n_f.\\ \end{split}` $$ ] ] as needed, so we can leverage past results for the full conditionals. --- ## Full conditionals - For the full conditionals we will derive today, we will take advantage of previous results from the regular univariate normal model. -- - Recall that if we assume .block[ .small[ `$$y_i \sim \mathcal{N}(\mu, \sigma^2), \ \ i=1, \ldots, n,$$` ] ] -- and set our priors to be .block[ .small[ $$ `\begin{split} \pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma_0^2\right).\\ \pi(\sigma^2) & = \mathcal{IG}\left(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2}\right),\\ \end{split}` $$ ] ] -- then we have .block[ .small[ $$ `\begin{split} \pi(\mu, \sigma^2 | Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n} p(y_{i} | \mu, \sigma^2 ) \right\} \cdot \pi(\mu) \cdot \pi(\sigma^2) \\ \end{split}` $$ ] ] --- ## Full conditionals - We have .block[ .small[ $$ `\begin{split} \pi(\mu | \sigma^2, Y) & = \mathcal{N}\left(\mu_n, \gamma_n^2\right).\\ \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \gamma^2_n = \dfrac{1}{ \dfrac{n}{\sigma^2} + \dfrac{1}{\gamma_0^2} }; \ \ \ \ \ \ \ \ \mu_n & = \gamma^2_n \left[ \dfrac{n}{\sigma^2} \bar{y} + \dfrac{1}{\gamma_0^2} \mu_0 \right], \end{split}` $$ ] ] -- - and .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \mu,Y) \boldsymbol{=} \mathcal{IG}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right), \end{split}` $$ ] ] where .block[ .small[ $$ `\begin{split} \nu_n & = \nu_0 + n; \ \ \ \ \ \ \ \sigma_n^2 = \dfrac{1}{\nu_n} \left[ \nu_0 \sigma_0^2 + \sum^n_{i=1} (y_i - \mu)^2 \right].\\ \end{split}` $$ ] ] --- ## Full conditionals - With `\(\pi(\mu) = \mathcal{N}(\mu_0, \gamma_0^2)\)`, we have .block[ .small[ $$ `\begin{split} \mu | Y, \delta, \sigma^2 & \sim \mathcal{N}(\mu_n, \gamma_n^2), \ \ \ \text{where}\\ \\ \gamma_n^2 & = \dfrac{1}{\dfrac{1}{\gamma_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \mu_n & = \gamma_n^2 \left[\dfrac{\mu_0}{\gamma_0^2} + \dfrac{ n_m \overline{(y_{i,male} - \delta)} + n_f \overline{(y_{i,female} + \delta)} }{\sigma^2} \right].\\ \end{split}` $$ ] ] where + `\(\overline{(y_{i,male} - \delta)} = \dfrac{1}{n_m} \sum\limits_{i=1}^{n_m} (y_{i,male} - \delta)\)`, and + `\(\overline{(y_{i,female} + \delta)} = \dfrac{1}{n_f} \sum\limits_{i=1}^{n_f} (y_{i,female} + \delta)\)`. --- ## Full conditionals - With `\(\pi(\delta) = \mathcal{N}(\delta_0, \tau_0^2)\)`, we have .block[ .small[ $$ `\begin{split} \delta | Y, \mu, \sigma^2 & \sim \mathcal{N}(\delta_n, \tau_n^2), \ \ \ \text{where}\\ \\ \tau_n^2 & = \dfrac{1}{\dfrac{1}{\tau_0^2} + \dfrac{n_m + n_f}{\sigma^2} }\\ \\ \delta_n & = \tau_n^2 \left[\dfrac{\delta_0}{\tau_0^2} + \dfrac{n_m \overline{(y_{i,male} - \mu)} + (-1) n_f \overline{(y_{i,female} + \mu)} }{\sigma^2} \right].\\ \end{split}` $$ ] ] where + `\(\overline{(y_{i,male} - \mu)} = \dfrac{1}{n_m} \sum\limits_{i=1}^{n_m} (y_{i,male} - \mu)\)`, and + `\(\overline{(y_{i,female} - \mu)} = \dfrac{1}{n_f} \sum\limits_{i=1}^{n_f} (y_{i,female} - \mu)\)`. --- ## Full conditionals - With `\(\pi(\sigma^2) = \mathcal{IG}(\dfrac{\nu_0}{2}, \dfrac{\nu_0 \sigma_0^2}{2})\)`, we have .block[ .small[ $$ `\begin{split} \sigma^2 | Y, \mu, \delta & \sim \mathcal{IG}(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}), \ \ \ \text{where}\\ \\ \nu_n & = \nu_n + n_m + n_f\\ \\ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{i=1}^{n_m} (y_{i,male} - [\mu + \delta])^2 + \sum\limits_{i=1}^{n_f} (y_{i,female} - [\mu - \delta])^2 \right].\\ \end{split}` $$ ] ] --- ## Application to data ```r #priors mu0 <- 15; gamma02 <- 5^2 delta0 <- 0; tau02 <- 3^2 nu0 <- 1; sigma02 <- 5^2 #starting values mu <- (mean(y_male) + mean(y_female))/2 delta <- (mean(y_male) - mean(y_female))/2 #no need for starting values for sigma_squared, we can sample it first MU <- DELTA <- SIGMA2 <- NULL ``` --- ## Application to data ```r #set seed set.seed(1234) #set number of iterations and burn-in n_iter <- 10000; burn_in <- 0.2*n_iter ##Gibbs sampler for (s in 1:(n_iter+burn_in)) { #update sigma2 sigma2 <- 1/rgamma(1,(nu0 + n_m + n_f)/2, (nu0*sigma02 + sum((y_male-mu-delta)^2) + sum((y_female-mu+delta)^2))/2) #update mu gamma2n <- 1/(1/gamma02 + (n_m + n_f)/sigma2) mun <- gamma2n*(mu0/gamma02 + sum(y_male-delta)/sigma2 + sum(y_female+delta)/sigma2) mu <- rnorm(1,mun,sqrt(gamma2n)) #update delta tau2n <- 1/(1/tau02 + (n_m+n_f)/sigma2) deltan <- tau2n*(delta0/tau02 + sum(y_male-mu)/sigma2 - sum(y_female-mu)/sigma2) delta <- rnorm(1,deltan,sqrt(tau2n)) #save parameter values MU <- c(MU,mu); DELTA <- c(DELTA,delta); SIGMA2 <- c(SIGMA2,sigma2) } ``` --- ## Posterior summaries ```r #library(coda) MU.mcmc <- mcmc(MU,start=1) summary(MU.mcmc) ``` ``` ## ## Iterations = 1:12000 ## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 12000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## 14.712517 0.118765 0.001084 0.001089 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## 14.48 14.63 14.71 14.79 14.95 ``` ```r (mean(y_male) + mean(y_female))/2 #compare to data ``` ``` ## [1] 14.7127 ``` --- ## Posterior summaries ```r DELTA.mcmc <- mcmc(DELTA,start=1) summary(DELTA.mcmc) ``` ``` ## ## Iterations = 1:12000 ## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 12000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## 0.127657 0.119522 0.001091 0.001091 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## -0.10691 0.04791 0.12743 0.20796 0.36407 ``` ```r summary((2*DELTA)) #rescale as difference in group means ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.63464 0.09582 0.25487 0.25531 0.41592 1.23660 ``` ```r mean(y_male) - mean(y_female) #compare to data ``` ``` ## [1] 0.2553392 ``` --- ## Posterior summaries ```r SIGMA2.mcmc <- mcmc(SIGMA2,start=1) summary(SIGMA2.mcmc) ``` ``` ## ## Iterations = 1:12000 ## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 12000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## 2.287927 0.257689 0.002352 0.002352 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## 1.833 2.107 2.272 2.455 2.841 ``` --- ## Diagnostics ```r plot(MU.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics ```r autocorr.plot(MU.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics ```r plot(DELTA.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics ```r autocorr.plot(DELTA.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics ```r plot(SIGMA2.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics ```r autocorr.plot(SIGMA2.mcmc) ``` <img src="15-hierarchical-models-I_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Application to data - Posterior probability that boys have larger average BMI than girls is 0.86! - Posterior medians and 95% credible intervals for the group means are actually quite similar to the unpooled (gender specific) intervals from classified inference. ```r #mean for boys quantile((MU+DELTA),probs=c(0.025,0.5,0.975)) ``` ``` ## 2.5% 50% 97.5% ## 14.50255 14.84146 15.17925 ``` ```r #mean for girls quantile((MU-DELTA),probs=c(0.025,0.5,0.975)) ``` ``` ## 2.5% 50% 97.5% ## 14.26848 14.58276 14.90761 ``` ```r #posterior probability girls have larger BMI than boys mean(DELTA > 0) ``` ``` ## [1] 0.8571667 ``` --- ## Application to data - Let's look at a different sub-population. For older individuals `\(> 75\)`, we only have 8 male and 4 female. ```r y_male <- bmi[Howell1$age > 75 & Howell1$male==1] y_female <- bmi[Howell1$age > 75 & Howell1$male==0] n_m <- length(y_male) n_f <- length(y_female) n_m ``` ``` ## [1] 8 ``` ```r n_f ``` ``` ## [1] 4 ``` --- ## Application to data - A 95% confidence interval for the difference between genders in BMI (estimated as 0.24) is (-4.20,4.68). ```r mean(y_male) - mean(y_female) ``` ``` ## [1] 0.2408966 ``` ```r t.test(y_male,y_female) ``` ``` ## ## Welch Two Sample t-test ## ## data: y_male and y_female ## t = 0.13801, df = 5.1869, p-value = 0.8954 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -4.197948 4.679741 ## sample estimates: ## mean of x mean of y ## 18.06751 17.82662 ``` --- ## Application to data - Let's apply the shrinkage model with these priors: + `\(\mu_0 = 18, \gamma_0 = 5\)`, + `\(\delta_0 = 0, \tau_0 = 3\)`, + `\(\nu_0 = 1, \sigma_0 = 5\)`. -- - Using the shrinkage model, the posterior mean is 0.25 with 95% CI (-3.45, 3.88). ```r mean((DELTA*2)) ``` ``` ## [1] 0.2493733 ``` ```r quantile((DELTA*2),probs=c(0.025,0.5,0.975)) ``` ``` ## 2.5% 50% 97.5% ## -3.4466931 0.2758598 3.8762543 ``` - Our precision has been improved by borrowing of information across the groups. Of course the prior is important here given the sample sizes. --- ## Comparing multiple groups - Suppose we wish to investigate the mean (and distribution) of test scores for students at `\(J\)` different high schools. -- - In each school `\(j\)`, where `\(j = 1, \ldots, J\)`, suppose we test a random sample of `\(n_j\)` students. -- - Let `\(y_{ij}\)` be the test score for the `\(i\)`th student in school `\(j\)`, with `\(i = 1,\ldots, n_j\)`, with .block[ .small[ `$$y_{ij} | \theta_j, \sigma^2_j \sim \mathcal{N} \left( \theta_j, \sigma^2_j\right)$$` ] ] where for each school `\(j\)`, `\(\theta_j\)` is the school-wide average test score, and `\(\sigma^2_j\)` is the school-wide variance of individual test scores. -- - This is what we did for the the Pygmalion study, job training data and the science classroom exercise on homework 3. --- ## School testing example - Classical inference for each school can be based on large sample 95% CI: `\(\bar{y}_j \pm 1.96 \sqrt{s^2_j/n_j}\)`, where `\(\bar{y}_j\)` is the sample average in school `\(j\)`, and `\(s^2_j\)` is the sample variance in school `\(j\)`. -- - Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools? `\(\bar{y}_j\)` can be a good estimate if `\(n_j\)` is large but it may be poor if `\(n_j\)` is small. -- - .hlight[Option II]: alternatively, we might believe that `\(\theta_j = \mu\)` for all `\(j\)`; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example. We can also set `\(\sigma^2_j = \sigma^2\)` for all `\(J\)`. -- - Option I ignores that the `\(\theta_j\)`'s should be reasonably similar, whereas option II ignores any differences between them. -- - It would be nice to find a compromise! Borrowing information across, and shrinking our estimate towards a **grand mean** could be very useful here. --- ## School testing example - For the Pygmalion study and job training data, we focused using priors that are independent between the groups. -- - For example, in the conjugate case, we would have .block[ .small[ $$ `\begin{split} \pi(\theta_j|\sigma^2_j) & = \mathcal{N}\left(\mu_0, \dfrac{\sigma^2_j}{\kappa_0}\right)\\ \pi(\sigma^2_j) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \end{split}` $$ ] ] for some hyperparameters (constants), `\(\mu_0\)`, `\(\kappa_0\)`, `\(\nu_0\)`, and `\(\sigma_0^2\)`. -- - In the semi-conjugate case, .block[ .small[ $$ `\begin{split} \pi(\theta_j) & = \mathcal{N}\left(\mu_0, \sigma^2_0\right)\\ \pi(\sigma^2_j) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\gamma_0^2}{2}\right)\\ \end{split}` $$ ] ] for some hyperparameters (constants), `\(\mu_0\)`, `\(\sigma^2_0\)`, `\(\nu_0\)`, and `\(\gamma_0^2\)`. --- ## Hierarchical normal model - Instead, we can assume that the `\(\theta_j\)`'s are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible schools. -- - For now, assume the variance is constant across schools. The hierarchical normal model assumes normal sampling models both within and between groups: .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2\right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J, \end{split}` $$ ] ] which gives us an extra level in the prior on the means, which leads to sharing of information across the groups in estimating the group-specific means. -- - We have an extra variance parameter `\(\tau^2\)`. Comparing `\(\tau^2\)` to `\(\sigma^2\)` tells us how much of the variation in `\(Y\)` is due to within-group versus between-group variation. --- ## Hierarchical normal model - Standard semi-conjugate priors are given by .block[ .small[ $$ `\begin{split} \pi(\mu) & = \mathcal{N}\left(\mu_0, \gamma^2_0\right)\\ \pi(\sigma^2) & = \mathcal{IG} \left(\dfrac{\nu_0}{2}, \dfrac{\nu_0\sigma_0^2}{2}\right)\\ \pi(\tau^2) & = \mathcal{IG} \left(\dfrac{\eta_0}{2}, \dfrac{\eta_0\tau_0^2}{2}\right).\\ \end{split}` $$ ] ] -- with + `\(\mu_0\)`: best guess of average of school averages + `\(\gamma^2_0\)`: set based on plausible ranges of values of `\(\mu\)` -- + `\(\tau_0^2\)`: best guess of variance of school averages + `\(\eta_0\)`: set based on how tight prior for `\(\tau^2\)` is around `\(\tau_0^2\)` -- + `\(\sigma_0^2\)`: best guess of variance of individual test scores around respective school means + `\(\nu_0\)`: set based on how tight prior for `\(\sigma^2\)` is around `\(\sigma_0^2\)`. --- ## Exchangeability - This model relies heavily on exchangeability across units at each level. -- - For example, we assume the schools are a random sample from the population of all schools, and the students within schools are a random sample of all the students in each school. -- - This is not always completely true. -- - Note: we can allow the variance to vary across schools if desired (and we will soon in fact). --- ## Exchangeability - Turns out that **conditional exchangeability** would be enough if we control for relevant variables in our modeling. -- - For example, the schools in Chapel Hill/Carrboro are not entirely exchangeable. -- - For example, Phoenix Academy is for students on long-term out-of-school suspension or who need to make up work due to extended absences (e.g., pregnancy), and Memorial Hospital School is for children battling serious illnesses. -- - However, if we condition on school type (public, charter, private, special services, home), the schools may then be exchangeable. --- ## Posterior inference - Recall the model is .block[ .small[ $$ `\begin{split} y_{ij} | \theta_j, \sigma^2 & \sim \mathcal{N} \left( \theta_j, \sigma^2\right); \ \ \ i = 1,\ldots, n_j\\ \theta_j | \mu, \tau^2 & \sim \mathcal{N} \left( \mu, \tau^2 \right); \ \ \ j = 1, \ldots, J, \end{split}` $$ ] ] -- - Under our prior specification, we can factor the posterior as follows: .block[ .small[ $$ `\begin{split} \pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y) & \boldsymbol{\propto} p(y | \theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu, \sigma^2,\tau^2)\\ & \ \ \ \ \times \pi(\mu, \sigma^2,\tau^2)\\ \\ & \boldsymbol{=} p(y | \theta_1, \ldots, \theta_J, \sigma^2 )\\ & \ \ \ \ \times p(\theta_1, \ldots, \theta_J | \mu,\tau^2)\\ & \ \ \ \ \times \pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \\ & \boldsymbol{=} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\}\\ & \ \ \ \ \times \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\}\\ & \ \ \ \ \times\pi(\mu) \cdot \pi(\sigma^2) \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] --- ## Full conditional for grand mean - The full conditional distribution of `\(\mu\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\mu\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\mu). \end{split}` $$ ] ] -- - This looks like the full conditional distribution from the one-sample normal case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\mu | \theta_1, \ldots, \theta_J, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\mu_n, \gamma^2_n \right) \ \ \ \ \textrm{where}\\ \\ \gamma^2_n = \dfrac{1}{ \dfrac{J}{\tau^2} + \dfrac{1}{\gamma_0^2} } ; \ \ \ \ \ \ \ \ \mu_n = \gamma^2_n \left[ \dfrac{J}{\tau^2} \bar{\theta} + \dfrac{1}{\gamma_0^2} \mu_0 \right] \end{split}` $$ ] ] and `\(\bar{\theta} = \frac{1}{J} \sum\limits^J_{j=1} \theta_j\)`. --- ## Full conditionals for group means - Similarly, the full conditional distribution of each `\(\theta_j\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\theta_j\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot p(\theta_j | \mu,\tau^2) \\ \end{split}` $$ ] ] -- - Those terms include a normal for `\(\theta_j\)` multiplied by a product of normals in which `\(\theta_j\)` is the mean, again mirroring the one-sample case, so you can show that .block[ .small[ $$ `\begin{split} \pi(\theta_j | \mu, \sigma^2,\tau^2, Y) & = \mathcal{N}\left(\theta_j^\star, \nu_j^\star \right) \ \ \ \ \textrm{where}\\ \\ \nu_j^\star & = \dfrac{1}{ \dfrac{n_j}{\sigma^2} + \dfrac{1}{\tau^2} } ; \ \ \ \ \ \ \ \theta_j^\star = \nu_j^\star \left[ \dfrac{n_j}{\sigma^2} \bar{y}_j + \dfrac{1}{\tau^2} \mu \right] \end{split}` $$ ] ] --- ## Full conditionals for group means - Our estimate for each `\(\theta_j\)` is a weighted average of `\(\bar{y}_j\)` and `\(\mu\)`, ensuring that we are borrowing information across all levels through `\(\mu\)` and `\(\tau^2\)`. -- - The weights for the weighted average is determined by relative precisions from the data and from the second level model. -- - The groups with smaller `\(n_j\)` have estimated `\(\theta_j^\star\)` closer to `\(\mu\)` than schools with larger `\(n_j\)`. -- - Thus, degree of shrinkage of `\(\theta_j\)` depends on ratio of within-group to between-group variances. --- ## Full conditionals for across-group variance - The full conditional distribution of `\(\tau^2\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\tau^2\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} p(\theta_j | \mu,\tau^2) \right\} \cdot \pi(\tau^2)\\ \end{split}` $$ ] ] -- - As in the case for `\(\mu\)`, this looks like the one-sample normal problem, and our full conditional posterior is .block[ .small[ $$ `\begin{split} \pi(\tau^2 | \theta_1, \ldots, \theta_J, \mu, \sigma^2, Y) & = \mathcal{IG} \left(\dfrac{\eta_n}{2}, \dfrac{\eta_n\tau_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \eta_n = \eta_0 + J ; \ \ \ \ \ \ \ \tau_n^2 & = \dfrac{1}{\eta_n} \left[\eta_0\tau_0^2 + \sum\limits_{j=1}^{J} (\theta_j - \mu)^2 \right].\\ \end{split}` $$ ] ] --- ## Full conditionals for within-group variance - Finally, the full conditional distribution of `\(\sigma^2\)` is proportional to the part of the joint posterior `\(\pi(\theta_1, \ldots, \theta_J, \mu, \sigma^2,\tau^2 | Y)\)` that involves `\(\sigma^2\)`. -- - That is, .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & \boldsymbol{\propto} \left\{ \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j, \sigma^2 ) \right\} \cdot \pi(\sigma^2)\\ \end{split}` $$ ] ] -- - We can again take advantage of the one-sample normal problem, so that our full conditional posterior is .block[ .small[ $$ `\begin{split} \pi(\sigma^2 | \theta_1, \ldots, \theta_J, \mu, \tau^2, Y) & = \mathcal{IG} \left(\dfrac{\nu_n}{2}, \dfrac{\nu_n\sigma_n^2}{2}\right) \ \ \ \ \textrm{where}\\ \\ \nu_n = \nu_0 + \sum\limits_{j=1}^{J} n_j ; \ \ \ \ \ \ \ \sigma_n^2 & = \dfrac{1}{\nu_n} \left[\nu_0\sigma_0^2 + \sum\limits_{j=1}^{J}\sum\limits_{i=1}^{n_j} (y_{ij} - \theta_j)^2 \right].\\ \end{split}` $$ ] ]