Due: 5:00pm, Mar 24, 2020
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very basic R Markdown template: https://sta-602l-s20.github.io/Course-Website/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://sta-602l-s20.github.io/Course-Website/resources/ ) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/77790/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
You will need the following R packages. If you do not already have them installed, please do so first using the install.packages
function.
require(tidyverse)
require(rstanarm)
require(magrittr)
library(ggplot2)
library(mlmRev)
library(tidybayes)
library(ggstance)
library(dplyr)
library(modelr)
If you run into any problems running any of the functions in the rstanarm
package, first uninstall both the rstanarm
and rstan
packages. To be on the safe side, it is sometimes necessary to remove any existing RStan via
Next, restart R. Then, re-install a fresh copy of rstan
first, then rstanarm
afterwards. Make sure to install version 2.19.3 for both packages.
We have data from a General Certificate of Secondary (GCSE) exam, which is an academic qualification exam taken in the UK. There are two components to the exam: a written paper and course work. Scores for both are included in the dataset, along with the school each student attended, the student’s unique ID, and the gender of the student. We will work only with the course work variable as our response variable:
## [1] 1905 5
## school student gender written course
## 68137 : 104 77 : 14 F:1128 Min. : 1 Min. : 9
## 68411 : 84 83 : 14 M: 777 1st Qu.:37 1st Qu.: 63
## 68107 : 79 53 : 13 Median :46 Median : 76
## 68809 : 73 66 : 13 Mean :46 Mean : 73
## 22520 : 65 27 : 12 3rd Qu.:55 3rd Qu.: 86
## 60457 : 54 110 : 12 Max. :90 Max. :100
## (Other):1446 (Other):1827 NA's :202 NA's :180
# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")
# Use only total score on coursework paper
GCSE <- subset(x = Gcsemv,
select = c(school, student, female, course))
# Count unique schools and students
m <- length(unique(GCSE$school))
N <- nrow(GCSE)
We will use a slightly different notation from what we used in class. Here, each individual \(i\), with \(i = 1,\ldots, n\), belongs to a certain school \(j\), with \(j = 1,\ldots, m\), and \(m =\) 73. We also have the gender variable which we can use as a regressor/predictor/covariate for the response. A natural first step may be to model course work with a linear regression model with the gender variable as a covariate. Actually let’s start with a bit of explanatory analysis.
course
scores and plot the distribution of these averages in a histogram. What does this plot reveal?As one extreme, suppose we did not want to take the different schools into account, and simply fit the linear model \[y_i \sim N(\alpha + x_i^T \beta, \sigma).\]
We call this model a pooled model because we are ignoring differences between the groups and use the common \(\beta\) coefficient. The pooled model estimates a single model, with inference performed on \(\alpha, \beta,\) and \(\sigma\).
The other extreme is an unpooled model where we fit a separate model for each group/school. So for individual \(i\) in school \(j\), we fit the model \[y_{ij} \sim N(\alpha_{j} + x_i^T \beta_j, \sigma_j).\]
In the unpooled framework, we estimate \(m\) models, but information is shared between groups. That is, we estimate the coefficients for the first school independently of the coefficients for the second school. Both are fit here:
pooled <- stan_glm(course ~ 1 + female, data = GCSE, refresh = 0)
unpooled <- stan_glm(course ~ -1 + school + female,data=GCSE, refresh = 0)
However, it seems likely that we can improve our models if we can share information about the \(\alpha\) and \(\beta\) between groups. As discussed in class, this naturally leads to a hierarchical framework where we use a prior distribution to encode relationships across the schools. We will explore several multilevel models for this data.
If we let \(Y_{ij}\) be individual \(i\) in school \(j\)’s exam score, we can write the following model:
\[ \begin{align*} &Y_{ij} = \theta_j + \epsilon_{ij}, \quad \epsilon_{ij} \overset{iid} \sim N(0, \sigma^2) \\ &\theta_j = \mu_\theta + \omega_j, \quad \omega_j \overset{iid} \sim N(0, \tau^2) \end{align*} \]
We see that \(\theta_j\) is the school-specific intercept, and \(\mu_\theta\) is the overall mean across the \(m\) schools. We could introduce the covariates into this model as well, but we begin with a simple intercept-only regression:
The stan_lmer()
function allows for easy multilevel/hierarchical modeling. Take a look at the help page for stan_lmer()
to see default options. Looking to the formula, we have
\[\text{formula} = \text{course}\sim 1 + (1 \ | \text{ school}).\]
Similar to the syntax for the usual lm()
, the variable on the left of the tilde is our response variable. The 1
on the right side of the formula specifies that we would like an intercept, and the (1 | school)
term specifies that we would like the intercept to vary by the variable school
. We will estimate an overall intercept, and then for each school we will estimate a term that provides an adjustment or deviation from that intercept. We can think of the multilevel model as introducing an additional random effect or component of variation, because we are allowing the intercept to vary across schools.
In the function call, we did not set a prior, which means that we used the default priors for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). We can see the priors that were used by running the following code:
## Priors for model 'mod1'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 0, scale = 10)
## Adjusted prior:
## ~ normal(location = 0, scale = 163)
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 0.061)
##
## Covariance
## ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
## ------
## See help('prior_summary.stanreg') for more details
Notice that the priors have been scaled (we could set autoscale = F
in the function call to avoid this). Where did the adjusted scale values come from? Consider the observed standard deviation of the course variable:
## [1] 16
Let’s look at the output of the model fitting. The output contains summaries for \(\mu_\theta\), \(\sigma^2\), and \(\tau^2\). Can you identify which is which? If you cannot, ask the TA.
## stan_lmer
## family: gaussian [identity]
## formula: course ~ 1 + (1 | school)
## observations: 1725
## ------
## Median MAD_SD
## (Intercept) 73.763 1.112
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 13.820 0.240
##
## Error terms:
## Groups Name Std.Dev.
## school (Intercept) 8.879
## Residual 13.820
## Num. levels: school 73
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Note: The standard deviations reported (labeled MAD_SD in the print output) are proportional to the median absolute deviation (mad) from the median. Compared to the raw posterior standard deviation, the MAD_SD will be more robust for long-tailed distributions.
We can obtain posterior summaries and credible intervals as follows:
## stan_lmer
## family: gaussian [identity]
## formula: course ~ 1 + (1 | school)
## observations: 1725
## ------
## Median MAD_SD
## (Intercept) 73.763 1.112
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 13.820 0.240
##
## Error terms:
## Groups Name Std.Dev.
## school (Intercept) 8.879
## Residual 13.820
## Num. levels: school 73
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
summary(mod1,
pars = c("(Intercept)", "sigma", "Sigma[school:(Intercept),(Intercept)]"),
probs = c(0.025, 0.975),
digits = 3)
##
## Model Info:
## function: stan_lmer
## family: gaussian [identity]
## formula: course ~ 1 + (1 | school)
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 1725
## groups: school (73)
##
## Estimates:
## mean sd 2.5% 98%
## (Intercept) 73.756 1.141 71.466 75.967
## sigma 13.820 0.243 13.350 14.319
## Sigma[school:(Intercept),(Intercept)] 78.835 15.522 52.773 113.840
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.059 1.009 370
## sigma 0.004 1.000 4226
## Sigma[school:(Intercept),(Intercept)] 0.548 1.003 801
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
If we want to create plots, it is helpful to extract the posterior draws. We can extract the draws for each variable by specifying the variable name, as we’ve seen before. Notice that we use regex_pars
, which means that we want to extract all the variables with names that match the form the regular expression.
## [1] 4000 76
## [1] "(Intercept)" "b[(Intercept) school:20920]"
## [3] "b[(Intercept) school:22520]" "b[(Intercept) school:22710]"
## [5] "b[(Intercept) school:22738]" "b[(Intercept) school:22908]"
## [1] "b[(Intercept) school:76631]"
## [2] "b[(Intercept) school:77207]"
## [3] "b[(Intercept) school:84707]"
## [4] "b[(Intercept) school:84772]"
## [5] "sigma"
## [6] "Sigma[school:(Intercept),(Intercept)]"
# obtain draws for mu_theta
mu_theta_sims <- as.matrix(mod1, pars = "(Intercept)")
# obtain draws for each school's contribution to intercept
theta_sims <- as.matrix(mod1,
regex_pars ="b\\[\\(Intercept\\) school\\:")
# to finish: obtain draws for sigma and tau^2
sig_sims <- as.matrix(mod1,
pars = "sigma")
tau2_sims <- as.matrix(mod1,
pars = "Sigma[school:(Intercept),(Intercept)]")
The intercept variable is the same for all of the 73 schools (corresponds to the 1
in the regression formula). The (1 | school)
term in the formula is each school’s specific difference from the overall intercept. These differences have coefficients named b[(Intercept) school: <school number>]
. The following lines of code compute the 73 total varying intercepts and store the the posterior means and 95% credible intervals for each intercept:
int_sims <- as.numeric(mu_theta_sims) + theta_sims
# posterior mean
int_mean <- apply(int_sims, MARGIN = 2, FUN = mean)
# credible interval
int_ci <- apply(int_sims, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975))
int_ci <- data.frame(t(int_ci))
# combine into a single df
int_df <- data.frame(int_mean, int_ci)
names(int_df) <- c("post_mean","Q2.5", "Q97.5")
# sort DF according to posterior mean
int_df <- int_df[order(int_df$post_mean),]
# create variable "index" to represent order
int_df <- int_df %>% mutate(index = row_number())
# plot posterior means of school-varying intercepts, along with 95 CIs
ggplot(data = int_df, aes(x = index, y = post_mean))+
geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5))+
scale_x_continuous("Index", breaks = seq(0,m, 5)) +
scale_y_continuous(expression(paste("varying intercept ", theta[j])))
Now that we have sampled all the parameters and obtained the varying intercepts posterior estimates, we may be interested in comparing the intercepts of schools.
We can add a level of complexity to the model by taking advantage of the covariates provided to us. Let \(x_{ij}\) be the value of the covariate for individual \(i\) in school \(j\). Then the only modification to Model 1 is a change to the sampling model:
\[Y_{ij} \sim N(\theta_j + \beta X_{ij}, \sigma^2),\]
where \(\theta_j\) takes the same form as before. If we allow \(X_{ij}\) to represent whether or not individual \(i\) is female, how would we code this? (The coefficient for the female covariate \(\beta\) will be the same for all schools). This would be reasonable if we think that the effect of gender on our response variable is the same across schools.
See the code below for how to run this model. Notice in the code that we specify a prior for \(\mu_\theta\) hyperparameter and for the \(\beta\) coefficient, and that we chose to not autoscale the data. How informative is our prior for \(\beta\)?
mod2 <- stan_lmer(formula = course ~ 1 + female + (1 | school),
data = GCSE,
prior = normal(location = 0,
scale = 100,
autoscale = FALSE),
prior_intercept = normal(location = 0,
scale = 100,
autoscale = F),
seed = 349,
refresh = 0)
# plot varying intercepts
mod2.sims <- as.matrix(mod2)
group_int <- mean(mod2.sims[,1])
mp <- mean(mod2.sims[,2])
bp <- apply(mod2.sims[, 3:75], 2, mean)
xvals <- seq(0,1,.01)
plot(x = xvals, y = rep(0, length(xvals)),
ylim = c(50, 90), xlim = c(-0.1,1.1), xaxt = "n", xlab = "female", ylab = "course")
axis(side = 1, at = c(0,1))
for (bi in bp){
lines(xvals, (group_int + bi)+xvals*mp)
}
Now, we allow the coefficient for female to vary across the 73 schools: \[Y_{ij} \sim N(\theta_j + \beta_j X_{ij}, \sigma^2).\]
This would be reasonable if we think that the effect of gender on our response variable actual might differ across schools. When we do not allow group-specific intercepts and slopes, it is common to model the coefficients independently (that is, specify an independent prior for each one). However, if we allow the intercept and slope to vary across the schools, we can model them as dependent in the prior:
\[ \begin{bmatrix}\theta_j \\ \beta_j \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_\theta \\ \mu_\beta \end{bmatrix}, \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix}\right) \]
We will now have to specify a prior for the covariance matrix, which we will call \(\Sigma\). Setting priors for covariance parameters is always a tricky task, and we here we will explain the default priors used in stan_lmer()
. We will discuss this a bit more when we get to Bayesian linear regression. The method used in stan_lmer()
decomposes a covariance matrix into a correlation matrix \(R\) and a matrix of variances \(V\):
\[ \begin{align*} \Sigma &= \begin{bmatrix} \sigma^2_\theta & Cov(\theta_j, \beta_j)\\ Cov(\beta_j, \theta_j) & \sigma^2_\beta \end{bmatrix} \\ &= \begin{bmatrix} \sigma^2_\theta & \rho \sigma_\theta \sigma_\beta \\ \rho \sigma_\theta \sigma_\beta & \sigma^2_\beta \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma^2_\theta/\sigma^2 & \rho \sigma_\theta \sigma_\beta/\sigma^2 \\ \rho \sigma_\theta \sigma_\beta/\sigma^2 & \sigma^2_\beta/\sigma^2 \end{bmatrix} \\ &= \sigma^2 \begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix} \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix}\begin{bmatrix} \sigma_\theta/\sigma & 0 \\ 0 & \sigma_\beta/\sigma \end{bmatrix}\\ &= \sigma^2 VRV \end{align*} \]
After decomposing the covariance matrix into correlation and variance matrices, the variances are further decomposed into the product of a simplex vector (its elements sum to \(1\)) and the trace of the matrix. An LKJ prior is placed on the correlation matrix, with default being jointly uniform over all correlation matrices of the same dimensions as \(R\). A symmetric Dirichlet prior is used on the simplex vector, with default being uniform over space of simplex vectors of the same size. This all sounds like a lot. Don’t worry, for now, let’s just focus on fitting the model. See the priors help page for rstanarm
for more information.
Now, to actually fit the model, we specify that we want both the intercept and the slope of the female covariate to vary across schools. The code will take much longer to run because of the extra sampling that goes into the random slopes:
mod3 <- stan_lmer(formula = course~ 1+ female + (1 + female | school),
data = GCSE,
seed = 349,
refresh = 0)
mod3_sims <- as.matrix(mod3)
# obtain draws for mu_theta
mu_theta_sims <- as.matrix(mod3, pars = "(Intercept)")
fem_sims <- as.matrix(mod3, pars = "femaleF")
# obtain draws for each school's contribution to intercept
theta_sims <- as.matrix(mod3,
regex_pars ="b\\[\\(Intercept\\) school\\:")
beta_sims <- as.matrix(mod3,
regex_pars ="b\\[femaleF school\\:")
int_sims <- as.numeric(mu_theta_sims) + theta_sims
slope_sims <- as.numeric(fem_sims) + beta_sims
# posterior mean
slope_mean <- apply(slope_sims, MARGIN = 2, FUN = mean)
# credible interval
slope_ci <- apply(slope_sims, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975))
slope_ci <- data.frame(t(slope_ci))
# combine into a single df
slope_df <- data.frame(slope_mean, slope_ci, levels(GCSE$school))
names(slope_df) <- c("post_mean","Q2.5", "Q97.5", "school")
# sort DF according to posterior mean
slope_df <- slope_df[order(slope_df$post_mean),]
# create variable "index" to represent order
slope_df <- slope_df %>% mutate(index = row_number())
# plot posterior means of school-varying slopes, along with 95% CIs
ggplot(data = slope_df, aes(x = index, y = post_mean))+
geom_pointrange(aes(ymin = Q2.5, ymax = Q97.5))+
scale_x_continuous("Index", breaks = seq(1,m, 1),
labels = slope_df$school) +
scale_y_continuous(expression(paste("varying slopes ", beta[j])))+
theme(axis.text.x = element_text(angle = 90))
Now that we’ve fit three different hierarchical models, we will compare them. We can use the compare_models()
as we did in the GLM lab. However, since we are comparing more than 2 models, instead of a difference in expected log predictive density, the functions returns a matrix arranged in descending order according to expected out-of-sample predictive accuracy.
## elpd_diff se_diff
## mod3 0.0 0.0
## mod2 -29.4 9.9
## mod1 -79.2 15.1
Here, we plot the regression lines for some of the schools using the following models:
loo_compare
, which model woud you recommend?Now that you have seen a multilevel model implemented with some of the rstanarm
functionality, it’s your turn to specify one on your own.
We have an additional dataset on the radon levels in houses in the state of Minnesota. Specifically, for each house we have the radon measurement on the log scale (log_radon), an indicator from whether the measurement was take in the basement or the first floor (floor, where 0 = basement, 1 = first floor), and which of 85 counties the house belongs to. We have an additional fourth variable which gives the county-level uranium level (log_uranium).
Download the data (named radon.txt
) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 8. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data by using the following R code.
Do you think a hierarchical model is warranted here? Do some EDA! Look for differences across counties.
Begin by creating an unpooled model, i.e. a model where each county has a unique intercept and there are no other predictors. Call that model radon.unpooled
. Then create a hierarchical/partially-pooled model where we model each county’s intercept hierarchically, again with no other predictors. Call that model radon.mod1
.
Once you have fit the two models, run the following code. You should see two plots which give slightly wider than 95% credible intervals for the county-level intercepts. What do you notice?
n_county <- as.numeric(table(radon$county))
create_df <- function(sim,model){
mean <- apply(sim,2,mean)
sd <- apply(sim,2,sd)
df <- cbind(n_county, mean, sd) %>%
as.data.frame()%>%
mutate(se = sd/ sqrt(n_county), model = model)
return(df)
}
unpooled.sim <- as.matrix(radon.unpooled)
unpooled.df <- create_df(unpooled.sim[,1:85], model = "unpooled")
mod1.sim <- as.matrix(radon.mod1)[,1:86]
mod1.sim <- (mod1.sim[,1] + mod1.sim)[,-1]
partial.df <- create_df(mod1.sim, model = "partial")
ggplot(rbind(unpooled.df, partial.df)%>% mutate(model = factor(model, levels = c("unpooled", "partial"))), aes(x= n_county, y = mean)) +
#draws the means
geom_jitter() +
#draws the CI error bars
geom_errorbar(aes(ymin=mean-2*se, ymax= mean+2*se), width=.1)+
ylim(0,3)+
xlim(0,60)+
geom_hline(aes(yintercept= mean(coef(radon.unpooled))))+
facet_wrap(~model)
## Warning: Removed 6 rows containing missing values (geom_point).
Fit a varying-intercept model, but now add the variable floor as a fixed slope. Call the model radon.mod2
. For another model radon.mod3
, fit a varying-intercept and varying-slope model, again with floor as the predictor.
Lastly, recall that we have a fourth variable which gives the county-level log uranium measurements. A really powerful aspect of hierarchical/multilevel modeling is the ability to incorporate data at different levels of coarseness. Fit a varying-intercept model, but include both floor and log_uranium in your model as well. So now we have both an individual/house-level covariate, as well as a group/county-level covariate. Group-level predictors help reduce group-level variation, which induces stronger pooling effects. Call this model radon.mod4
.
Once you have fit these five models, report on the differences in model performance.
10 points: 3 points for question 8, 1 point for each remaining question.
This lab was adapted from this tutorial by Jordan Bryan and Becky Tang.