STA 602L Lab 8: Hierarchical modeling

STA 602L: Bayesian and Modern Statistics

Mar 23, 2020

Due: 5:00pm, Mar 24, 2020

Housekeeping

R/RStudio

You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.

R Markdown

You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-602l-s20.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-602l-s20.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.

Gradescope

You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/77790/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.

Getting started

You will need the following R packages. If you do not already have them installed, please do so first using the install.packages function.

If you run into any problems running any of the functions in the rstanarm package, first uninstall both the rstanarm and rstan packages. To be on the safe side, it is sometimes necessary to remove any existing RStan via

Next, restart R. Then, re-install a fresh copy of rstan first, then rstanarm afterwards. Make sure to install version 2.19.3 for both packages.

Hierarchical/multilevel data

We have data from a General Certificate of Secondary (GCSE) exam, which is an academic qualification exam taken in the UK. There are two components to the exam: a written paper and course work. Scores for both are included in the dataset, along with the school each student attended, the student’s unique ID, and the gender of the student. We will work only with the course work variable as our response variable:

data(Gcsemv, package = "mlmRev")
dim(Gcsemv)
## [1] 1905    5
summary(Gcsemv)
##      school        student     gender      written        course   
##  68137  : 104   77     :  14   F:1128   Min.   : 1    Min.   :  9  
##  68411  :  84   83     :  14   M: 777   1st Qu.:37    1st Qu.: 63  
##  68107  :  79   53     :  13            Median :46    Median : 76  
##  68809  :  73   66     :  13            Mean   :46    Mean   : 73  
##  22520  :  65   27     :  12            3rd Qu.:55    3rd Qu.: 86  
##  60457  :  54   110    :  12            Max.   :90    Max.   :100  
##  (Other):1446   (Other):1827            NA's   :202   NA's   :180
# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")

# Use only total score on coursework paper
GCSE <- subset(x = Gcsemv,
               select = c(school, student, female, course))

# Count unique schools and students
m <- length(unique(GCSE$school))
N <- nrow(GCSE)

We will use a slightly different notation from what we used in class. Here, each individual \(i\), with \(i = 1,\ldots, n\), belongs to a certain school \(j\), with \(j = 1,\ldots, m\), and \(m =\) 73. We also have the gender variable which we can use as a regressor/predictor/covariate for the response. A natural first step may be to model course work with a linear regression model with the gender variable as a covariate. Actually let’s start with a bit of explanatory analysis.


  1. Simple EDA: for each school, calculate the sample average of the course scores and plot the distribution of these averages in a histogram. What does this plot reveal?

As one extreme, suppose we did not want to take the different schools into account, and simply fit the linear model \[y_i \sim N(\alpha + x_i^T \beta, \sigma).\]

We call this model a pooled model because we are ignoring differences between the groups and use the common \(\beta\) coefficient. The pooled model estimates a single model, with inference performed on \(\alpha, \beta,\) and \(\sigma\).

The other extreme is an unpooled model where we fit a separate model for each group/school. So for individual \(i\) in school \(j\), we fit the model \[y_{ij} \sim N(\alpha_{j} + x_i^T \beta_j, \sigma_j).\]

In the unpooled framework, we estimate \(m\) models, but information is shared between groups. That is, we estimate the coefficients for the first school independently of the coefficients for the second school. Both are fit here:

pooled <- stan_glm(course ~ 1 + female, data = GCSE, refresh = 0)
unpooled <- stan_glm(course ~ -1 + school + female,data=GCSE, refresh = 0)

However, it seems likely that we can improve our models if we can share information about the \(\alpha\) and \(\beta\) between groups. As discussed in class, this naturally leads to a hierarchical framework where we use a prior distribution to encode relationships across the schools. We will explore several multilevel models for this data.

Model 1: varying intercept model with no predictors

If we let \(Y_{ij}\) be individual \(i\) in school \(j\)’s exam score, we can write the following model:

\[ \begin{align*} &Y_{ij} = \theta_j + \epsilon_{ij}, \quad \epsilon_{ij} \overset{iid} \sim N(0, \sigma^2) \\ &\theta_j = \mu_\theta + \omega_j, \quad \omega_j \overset{iid} \sim N(0, \tau^2) \end{align*} \]

We see that \(\theta_j\) is the school-specific intercept, and \(\mu_\theta\) is the overall mean across the \(m\) schools. We could introduce the covariates into this model as well, but we begin with a simple intercept-only regression:

The stan_lmer() function allows for easy multilevel/hierarchical modeling. Take a look at the help page for stan_lmer() to see default options. Looking to the formula, we have

\[\text{formula} = \text{course}\sim 1 + (1 \ | \text{ school}).\]

Similar to the syntax for the usual lm(), the variable on the left of the tilde is our response variable. The 1 on the right side of the formula specifies that we would like an intercept, and the (1 | school) term specifies that we would like the intercept to vary by the variable school. We will estimate an overall intercept, and then for each school we will estimate a term that provides an adjustment or deviation from that intercept. We can think of the multilevel model as introducing an additional random effect or component of variation, because we are allowing the intercept to vary across schools.

In the function call, we did not set a prior, which means that we used the default priors for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). We can see the priors that were used by running the following code:

## Priors for model 'mod1' 
## ------
## Intercept (after predictors centered)
##   Specified prior:
##     ~ normal(location = 0, scale = 10)
##   Adjusted prior:
##     ~ normal(location = 0, scale = 163)
## 
## Auxiliary (sigma)
##   Specified prior:
##     ~ exponential(rate = 1)
##   Adjusted prior:
##     ~ exponential(rate = 0.061)
## 
## Covariance
##  ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
## ------
## See help('prior_summary.stanreg') for more details

Notice that the priors have been scaled (we could set autoscale = F in the function call to avoid this). Where did the adjusted scale values come from? Consider the observed standard deviation of the course variable:

## [1] 16

Let’s look at the output of the model fitting. The output contains summaries for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). Can you identify which is which? If you cannot, ask the TA.

## stan_lmer
##  family:       gaussian [identity]
##  formula:      course ~ 1 + (1 | school)
##  observations: 1725
## ------
##             Median MAD_SD
## (Intercept) 73.763  1.112
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.820  0.240
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  school   (Intercept)  8.879  
##  Residual             13.820  
## Num. levels: school 73 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Note: The standard deviations reported (labeled MAD_SD in the print output) are proportional to the median absolute deviation (mad) from the median. Compared to the raw posterior standard deviation, the MAD_SD will be more robust for long-tailed distributions.

We can obtain posterior summaries and credible intervals as follows:

## stan_lmer
##  family:       gaussian [identity]
##  formula:      course ~ 1 + (1 | school)
##  observations: 1725
## ------
##             Median MAD_SD
## (Intercept) 73.763  1.112
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 13.820  0.240
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  school   (Intercept)  8.879  
##  Residual             13.820  
## Num. levels: school 73 
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
## 
## Model Info:
##  function:     stan_lmer
##  family:       gaussian [identity]
##  formula:      course ~ 1 + (1 | school)
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 1725
##  groups:       school (73)
## 
## Estimates:
##                                         mean    sd      2.5%    98%  
## (Intercept)                            73.756   1.141  71.466  75.967
## sigma                                  13.820   0.243  13.350  14.319
## Sigma[school:(Intercept),(Intercept)]  78.835  15.522  52.773 113.840
## 
## MCMC diagnostics
##                                       mcse  Rhat  n_eff
## (Intercept)                           0.059 1.009  370 
## sigma                                 0.004 1.000 4226 
## Sigma[school:(Intercept),(Intercept)] 0.548 1.003  801 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

  1. Report the posterior estimates of \(\mu_\theta\), \(\sigma\), and \(\tau^2\).

If we want to create plots, it is helpful to extract the posterior draws. We can extract the draws for each variable by specifying the variable name, as we’ve seen before. Notice that we use regex_pars, which means that we want to extract all the variables with names that match the form the regular expression.

## [1] 4000   76
## [1] "(Intercept)"                 "b[(Intercept) school:20920]"
## [3] "b[(Intercept) school:22520]" "b[(Intercept) school:22710]"
## [5] "b[(Intercept) school:22738]" "b[(Intercept) school:22908]"
## [1] "b[(Intercept) school:76631]"          
## [2] "b[(Intercept) school:77207]"          
## [3] "b[(Intercept) school:84707]"          
## [4] "b[(Intercept) school:84772]"          
## [5] "sigma"                                
## [6] "Sigma[school:(Intercept),(Intercept)]"

The intercept variable is the same for all of the 73 schools (corresponds to the 1 in the regression formula). The (1 | school) term in the formula is each school’s specific difference from the overall intercept. These differences have coefficients named b[(Intercept) school: <school number>]. The following lines of code compute the 73 total varying intercepts and store the the posterior means and 95% credible intervals for each intercept:

Comparisons between schools

Now that we have sampled all the parameters and obtained the varying intercepts posterior estimates, we may be interested in comparing the intercepts of schools.


  1. Choose two schools and report on their difference in average scores with descriptive statistics, a histogram, and interpretation.

Model 2: varying intercept with a single individual-level predictor

We can add a level of complexity to the model by taking advantage of the covariates provided to us. Let \(x_{ij}\) be the value of the covariate for individual \(i\) in school \(j\). Then the only modification to Model 1 is a change to the sampling model:

\[Y_{ij} \sim N(\theta_j + \beta X_{ij}, \sigma^2),\]

where \(\theta_j\) takes the same form as before. If we allow \(X_{ij}\) to represent whether or not individual \(i\) is female, how would we code this? (The coefficient for the female covariate \(\beta\) will be the same for all schools). This would be reasonable if we think that the effect of gender on our response variable is the same across schools.

See the code below for how to run this model. Notice in the code that we specify a prior for \(\mu_\theta\) hyperparameter and for the \(\beta\) coefficient, and that we chose to not autoscale the data. How informative is our prior for \(\beta\)?


  1. What are the posterior means and credible intervals of \(\mu_\theta,\beta, \sigma\), and \(\tau^2\)?

Model 3: varying intercept and varying slope model with single predictor

Now, we allow the coefficient for female to vary across the 73 schools: \[Y_{ij} \sim N(\theta_j + \beta_j X_{ij}, \sigma^2).\]

This would be reasonable if we think that the effect of gender on our response variable actual might differ across schools. When we do not allow group-specific intercepts and slopes, it is common to model the coefficients independently (that is, specify an independent prior for each one). However, if we allow the intercept and slope to vary across the schools, we can model them as dependent in the prior:

\[ \begin{bmatrix}\theta_j \\ \beta_j \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_\theta \\ \mu_\beta \end{bmatrix}, \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix}\right) \]

We will now have to specify a prior for the covariance matrix, which we will call \(\Sigma\). Setting priors for covariance parameters is always a tricky task, and we here we will explain the default priors used in stan_lmer(). We will discuss this a bit more when we get to Bayesian linear regression. The method used in stan_lmer() decomposes a covariance matrix into a correlation matrix \(R\) and a matrix of variances \(V\):

\[ \begin{align*} \Sigma &= \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix} \\ &= \begin{bmatrix} \sigma^2_\theta & \rho \sigma_\theta \sigma_\beta \\ \rho \sigma_\theta \sigma_\beta & \sigma^2_\beta \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma^2_\theta/\sigma^2 & \rho \sigma_\theta \sigma_\beta/\sigma^2 \\ \rho \sigma_\theta \sigma_\beta/\sigma^2 & \sigma^2_\beta/\sigma^2 \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix} \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix}\\ &= \sigma^2 VRV \end{align*} \]

After decomposing the covariance matrix into correlation and variance matrices, the variances are further decomposed into the product of a simplex vector (its elements sum to \(1\)) and the trace of the matrix. An LKJ prior is placed on the correlation matrix, with default being jointly uniform over all correlation matrices of the same dimensions as \(R\). A symmetric Dirichlet prior is used on the simplex vector, with default being uniform over space of simplex vectors of the same size. This all sounds like a lot. Don’t worry, for now, let’s just focus on fitting the model. See the priors help page for rstanarm for more information.

Now, to actually fit the model, we specify that we want both the intercept and the slope of the female covariate to vary across schools. The code will take much longer to run because of the extra sampling that goes into the random slopes:

Model Comparison

Now that we’ve fit three different hierarchical models, we will compare them. We can use the compare_models() as we did in the GLM lab. However, since we are comparing more than 2 models, instead of a difference in expected log predictive density, the functions returns a matrix arranged in descending order according to expected out-of-sample predictive accuracy.

##      elpd_diff se_diff
## mod3   0.0       0.0  
## mod2 -29.4       9.9  
## mod1 -79.2      15.1
Here, we plot the regression lines for some of the schools using the following models:


  1. Compare and contrast the regression lines estimated using these different methods. Also, based on the output of loo_compare, which model woud you recommend?

Multilevel modeling exercise

Now that you have seen a multilevel model implemented with some of the rstanarm functionality, it’s your turn to specify one on your own.

We have an additional dataset on the radon levels in houses in the state of Minnesota. Specifically, for each house we have the radon measurement on the log scale (log_radon), an indicator from whether the measurement was take in the basement or the first floor (floor, where 0 = basement, 1 = first floor), and which of 85 counties the house belongs to. We have an additional fourth variable which gives the county-level uranium level (log_uranium).

Download the data (named radon.txt) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 8. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data by using the following R code.

radon <- read.csv("radon.txt", header = T,sep="")
radon$county <- as.factor(radon$county)

  1. Do you think a hierarchical model is warranted here? Do some EDA! Look for differences across counties.

  2. Begin by creating an unpooled model, i.e. a model where each county has a unique intercept and there are no other predictors. Call that model radon.unpooled. Then create a hierarchical/partially-pooled model where we model each county’s intercept hierarchically, again with no other predictors. Call that model radon.mod1.

Once you have fit the two models, run the following code. You should see two plots which give slightly wider than 95% credible intervals for the county-level intercepts. What do you notice?

n_county <- as.numeric(table(radon$county))
create_df <- function(sim,model){
  mean <- apply(sim,2,mean)
  sd <- apply(sim,2,sd)
  df <- cbind(n_county, mean, sd) %>%
    as.data.frame()%>%
    mutate(se = sd/ sqrt(n_county), model = model)
  return(df)
}

unpooled.sim <- as.matrix(radon.unpooled)
unpooled.df <- create_df(unpooled.sim[,1:85], model = "unpooled")


mod1.sim <- as.matrix(radon.mod1)[,1:86]
mod1.sim <- (mod1.sim[,1] + mod1.sim)[,-1]
partial.df <- create_df(mod1.sim, model = "partial")

ggplot(rbind(unpooled.df, partial.df)%>% mutate(model = factor(model, levels = c("unpooled", "partial"))), aes(x= n_county, y = mean)) +
   #draws the means
      geom_jitter() +
   #draws the CI error bars
      geom_errorbar(aes(ymin=mean-2*se, ymax= mean+2*se), width=.1)+
  ylim(0,3)+
  xlim(0,60)+
  geom_hline(aes(yintercept= mean(coef(radon.unpooled))))+
  facet_wrap(~model)
## Warning: Removed 6 rows containing missing values (geom_point).

  1. Fit a varying-intercept model, but now add the variable floor as a fixed slope. Call the model radon.mod2. For another model radon.mod3, fit a varying-intercept and varying-slope model, again with floor as the predictor.

    Lastly, recall that we have a fourth variable which gives the county-level log uranium measurements. A really powerful aspect of hierarchical/multilevel modeling is the ability to incorporate data at different levels of coarseness. Fit a varying-intercept model, but include both floor and log_uranium in your model as well. So now we have both an individual/house-level covariate, as well as a group/county-level covariate. Group-level predictors help reduce group-level variation, which induces stronger pooling effects. Call this model radon.mod4.

    Once you have fit these five models, report on the differences in model performance.


Grading

10 points: 3 points for question 8, 1 point for each remaining question.

Acknowledgement

This lab was adapted from this tutorial by Jordan Bryan and Becky Tang.