class: center, middle, inverse, title-slide # One parameter models cont’d ### Dr. Olanrewaju Michael Akande ### Jan 17, 2020 --- ## Announcements - Tentative dates for quizzes: + Quiz I: Wed, Feb 12 + Quiz II: Wed, Apr 1 -- - Homework 1 now online, due Thursday, Jan 23. -- - My office hours are confirmed: + Wed 9:00 - 10:00am and Thur 11:45 - 12:45pm -- - No lab on Monday; MLK day. -- - Occasional short (timed) "participation" quizzes from next week. + Will usually be available on Sakai under "Tests & Quizzes" + Must take at the beginning of class if present + Must take between 11:45am and 1:00pm if absent --- ## Outline - Beta-binomial (cont'd) + Example + Cautionary tale: parameters at the boundary + Marginal likelihood + Posterior prediction + Truncated priors - Introduction to the Poisson-Gamma model + Recap of the distributions + Conjugacy --- class: center, middle # Beta-binomial cont'd --- ## Beta-binomial recap Binomial likelihood: .block[ .small[ `$$L(y; \theta) = {n \choose y} \theta^y(1-\theta)^{n-y}$$` ] ] -- `\(+\)` Beta Prior: .block[ .small[ `$$\pi(\theta) = \frac{1}{B(a,b)} \theta^{a-1} (1-\theta)^{b-1} = \textrm{Beta}(a,b)$$` ] ] -- `\(\Rightarrow\)` Beta posterior: .block[ .small[ `$$\pi(\theta | y) = \frac{1}{B(a+y,b+n-y)} \theta^{a+y-1} (1-\theta)^{b+n-y-1} = \textrm{Beta}(a+y,b+n-y).$$` ] ] -- - Recall: for `\(\textrm{Beta}(a,b)\)`, + `\(\mathbb{E}[Y] = \dfrac{a}{a+b}\)` + `\(\mathbb{V}[Y] = \dfrac{ab}{(a+b)^2(a+b+1)}\)` --- ## Example: rare events - .hlight[Step 1]. State the question: + What is the prevalence of an infectious disease in a small city? + Why? High prevalence means more public health precautions are recommended. -- - .hlight[Step 2]. Collect the data: + Suppose you collect a small random sample of 20 individuals. -- - .hlight[Step 3]. Explore the data: + Count the number of infected individuals in the sample. + Let `\(Y\)` be the corresponding random variable. --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Parameter of interest: `\(\theta\)` is the fraction of infected individuals in the city. + Sampling model: a reasonable model for `\(Y\)` can be `\(\textrm{Bin}(20,\theta)\)` <img src="img/binomial_histograms.png" width="105" height="370px" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Prior specification: information from previous studies — infection rate in “comparable cities” ranges from 0.05 to 0.20 with an average of 0.10. So maybe a standard deviation of roughly 0.05? + What is a good prior? The **expected value** of `\(\theta\)` close to 0.10 and the **variance** close to 0.05^2. + Possible option: `\(\mbox{Beta}(3.5,31.5)\)` or maybe even `\(\mbox{Beta}(3,32)\)`? <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Under `\(\mbox{Beta}(3,32)\)`, `\(\Pr(\theta < 0.1) \approx 0.67\)`. + Posterior distribution for the model: `\((\theta | Y=y) = \textrm{Beta}(a+y,b+n-y)\)` + Suppose `\(Y=0\)`. Then, `\((\theta | Y=y) = \textrm{Beta}(3,32+20)\)` <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 5]. Check your models: + Compare performance of posterior mean and posterior probability that `\(\theta < 0.1\)`. .hlight[See pages 5 to 7 of the Hoff book (the section on sensitivity analysis)]. + Under `\(\mbox{Beta}(3,52)\)`, - `\(\Pr(\theta < 0.1 | Y=y) \approx 0.92\)`. More confidence in low values of `\(\theta\)`. - For `\(\mathbb{E}(\theta | Y=y)\)`, we have .block[ .small[ $$ `\begin{split} \mathbb{E}(\theta | y) & = \dfrac{a+y}{a+b+n} = \dfrac{3}{52} = 0.058.\\ \end{split}` $$ ] ] - Recall that the prior mean is `\(a/(a+b)=0.09\)`. Thus, we can see how that contributes to the prior mean. .block[ .small[ $$ `\begin{split} \mathbb{E}(\theta | y) & = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}\\ & = \dfrac{a+b}{a+b+n} \times \dfrac{a}{a+b} + \dfrac{n}{a+b+n} \times \dfrac{y}{n}\\ & = \dfrac{35}{52} \times \dfrac{3}{35} + \dfrac{20}{52} \times \dfrac{0}{n} = \dfrac{3}{52} = 0.058.\\ \end{split}` $$ ] ] --- ## Example: rare events - .hlight[Step 6]. Answer the question: + People with low prior expectations are generally at least `\(90\%\)` certain that the infection rate is below 0.10. Again, .hlight[see pages 5 to 7 of the Hoff book]. + `\(\pi(\theta | Y)\)` is to the left of `\(\pi(\theta)\)` because the observation `\(Y=0\)` provides evidence of a low value of `\(\theta\)`. + `\(\pi(\theta | Y)\)` is more peaked than `\(\pi(\theta)\)` because it combines information and so contains more information than `\(\pi(\theta)\)` alone. + The posterior expectation is 0.058. + The posterior mode is 0.04. - Note, for `\(\mbox{Beta}(a,b)\)`, the mode is `\(\dfrac{a-1}{a+b-2}\)`. + The posterior probability that `\(\theta < 0.1\)` is 0.92. --- ## Cautionary tale: parameters at the boundary - Tuyl et al. (2008) discuss potential dangers of using priors that have `\(a < 1\)` with data that are all 0's (or `\(b < 1\)` with all 1's). They consider data on adverse reactions to a new radiological contrast agent. -- - Let `\(\theta_N\)`: probability of a bad reaction using the new agent. -- - Current standard agent causes bad reactions about 15 times in 10000, so one might think 0.0015 is a good guess for `\(\theta_N\)`. -- - How do we choose a prior distribution? --- ## Potential prior distributions - One might consider a variety of choices centered on `\(15/10000 = 0.0015\)`: + Prior 1: .hlight[Beta(1,666)] (mean 0.0015; 1 prior bad reaction in 667 administrations) + Prior 2: .hlight[Beta(0.05,33.33)] (mean 0.0015; 0.05 prior bad reactions in 33.38 administrations) + Prior 3: .hlight[Beta(1.6, 407.4)] (mode 0.0015; 409 prior administrations) + Prior 4: .hlight[Beta(1.05, 497)] (median 0.0015; 498.05 prior administrations) -- - Does it matter which one we choose? --- ## Potential prior distributions <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Potential prior distributions <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Potential prior distributions - Let's take a closer look at properties of these four prior distributions, concentrating on the probability that `\(\theta_N < 0.0015\)`. -- - That is, new agent not more dangerous than old agent. <br /> | Be(1,666) | Be(0.05,33.33) | Be(1.6,407.4) | Be(1.05,497) :----------- | :---------: |:---------: |:---------: |:---------: | Prior prob | 0.632 | 0.882 | 0.222 | 0.500 Post prob (0 events) | 0.683 | 0.939 | 0.289 | 0.568 Post prob (1 event) | 0.319 | 0.162 | 0.074 | 0.213 -- - Suppose we have a single arm study of 100 subjects. -- - Consider the two most likely potential outcomes: + 0 adverse outcomes observed + 1 adverse outcome observed --- ## Problems with the priors - After just 100 trials with no bad reactions, the low weight (33.38 prior observations) prior indicates one should be 94% sure the new agent is equally safe as (or safer than) the old one. -- - The low weight prior largely assumes the conclusion we actually hope for (that the new agent is safer), thus it takes very little confirmatory data to reach that conclusion. -- - Is this the behavior we want? -- - Take home message: be very careful with priors that have `\(a < 1\)` with data that are all 0's (or `\(b < 1\)` with all 1's). -- - Given that we know the adverse event rate should be small, we might try a restricted prior e.g. Unif(0,0.1). -- - In all cases, how many trials would we need, assuming no adverse reactions, to be 95% sure that the new agent is as safe as (or safer than) the old one? (Homework question!) --- ## Marginal likelihood - Recall that the .hlight[marginal likelihood] is .block[ .small[ `$$L(y) = \int L(y; \theta)p(\theta)\textrm{d}\theta.$$` ] ] - What is the marginal likelihood for the Beta-binomial? -- - We have .block[ .small[ $$ `\begin{aligned} L(y) & = \int L(y; \theta)p(\theta)\textrm{d}\theta \\ & = \int_0^1 {n \choose y} \theta^y(1-\theta)^{n-y} \frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1} d\theta\\ & = {n \choose y} \frac{B(a + y,\, b + n-y)}{B(a,b)}, \end{aligned}` $$ ] ] by the integral definition of the Beta function. -- - Deriving the marginal likelihood for conjugate distributions is often relatively straightforward. --- ## Posterior predictive distribution - Let's go back to Bernoulli data. Suppose `\(y_1,\ldots,y_n \overset{iid}{\sim} \textrm{Bernoulli}(\theta)\)`. -- - We may wish to predict a new data point `\(y_{n+1}\)`. -- - We can do so using the posterior predictive distribution `\(p(y_{n+1}|y_{1:n})\)`. -- - <div class="question"> Why are we not including the parameter in the posterior predictive distribution? </div> -- - Recall that we have conditional independence of the `\(y\)`'s given `\(\theta\)`. -- - Generally, .block[ .small[ $$ `\begin{aligned} p(y_{n+1}|y_{1:n}) &= \int p(y_{n+1},\theta|y_{1:n})\,d\theta\\ &= \int p(y_{n+1}|\theta,y_{1:n}) p(\theta|y_{1:n})\,d\theta\\ & = \int p(y_{n+1}|\theta) p(\theta|y_{1:n})\,d\theta. \end{aligned}` $$ ] ] --- ## Posterior predictive distribution - When we talk about the posterior predictive distribution for Bernoulli data, we are really referring to `\(p(y_{n+1} = 1|y_{1:n})\)` and `\(p(y_{n+1} = 0|y_{1:n})\)`. -- - Then, .block[ .small[ $$ `\begin{aligned} p(y_{n+1}=1|y_{1:n}) &= \int p(y_{n+1}= 1|\theta) p(\theta|y_{1:n})\,d\theta \\ &= ... \\ &= ... \end{aligned}` $$ ] ] <div class="question"> which simplifies to what? In class! </div> -- - What then is `\(p(y_{n+1} = 0|y_{1:n})\)`? -- - Posterior predictive pmf therefore takes the form .block[ .small[ `$$p(y_{n+1}|y_{1:n}) = \dfrac{a_n^{y_{n+1}} b_n^{1-y_{n+1}}}{a_n + b_n}; \ \ \ y_{n+1}=0,1.$$` ] ] - What are `\(a_n\)` and `\(b_n\)`? --- ## Priors with restricted support - As we have seen, when dealing with rare events, we might expect the true proportion to be very small. -- - In that case, we might want to try a restricted prior, e.g. Unif(0,0.1). -- - Even when we don't have rare events, we might still desire truncation if we are certain the true proportion lies within `\((a,b)\)` with `\(0 < a < b < 1\)`. -- - It is therefore often really useful to incorporate truncation. -- - Let `\(\theta =\)` probability of a randomly-selected student making an `\(A\)` in this course. -- - You may want to rule out very low & very high values -- perhaps `\(\theta \in [0.35, 0.6]\)` with probability one. -- - How to choose a prior restricted to this interval? --- ## Uniform priors - One possibility is to just choose a uniform prior. -- - When the parameter `\(\theta\)` is a probability, the typical uniform prior would correspond to beta(1,1). -- - This is uniform on the entire (0,1) interval. -- - However, we can just as easily choose a uniform prior on a narrower interval Unif(a,b) with `\(0 < a < b < 1\)`. -- - Perhaps not flexible enough. -- - Would be nice if we could pick a flexible beta density and then truncate it to `\((a,b)\)`. --- ## Truncated random variables - Suppose we have some arbitrary random variable `\(\theta \sim f\)` with support `\(\Theta\)`. -- - For example, `\(\theta \sim \textrm{Beta}(c,d)\)` has support on (0,1). -- - Then, we can modify the density `\(f(\theta)\)` to have support on a sub-interval `\([a,b] \in \Theta\)`. -- - The density `\(f(\theta)\)` **truncated** to `\([a,b]\)` is .block[ .small[ `$$f_{[a,b]}(\theta) = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{\int^b_a f(\theta^\star)\textrm{d}\theta^\star},$$` ] ] with `\(\mathbb{1}[A]\)` being the indicator function that returns 1 if A is true & 0 otherwise. --- ## Truncated beta density - Suppose to characterize the prior probability of earning an A, you poll a sample of students from a former STA 602 course and find that 10 earned an A and 10 earned a B (or lower). -- - Therefore, you go with a beta(10,10) prior truncated to [0.35, 0.6]. -- - In R we can calculate the truncated beta density at p via ```r p <- seq(0,1,length=1000) f1 <- dbeta(p,10,10) f2 <- dbeta(p,10,10)*as.numeric(p>0.35 & p<0.6)/(pbeta(0.6,10,10) - pbeta(0.3,10,10)) f3 <- dunif(p,0.35,.6) plot(p,f2,type='l',col='green4',xlim=c(0,1),ylab='Density', xlab=expression(theta), ylim=c(0,6)) lines(p,f1,type='l',col='blue') lines(p,f3,type='l',col='red4') labels <- c("beta(10,10)", "truncated beta","unif(0.35,.6)") legend("topright", inset=.05, labels, lwd=2, lty=c(1,1,1), col=c('blue4','green4','red4')) ``` --- ## Truncated beta density What would that look like? <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## Truncated beta density The truncated density by itself would look like <img src="03-one-parameter-models-II_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## The inverse cdf method - How to sample truncated random variables? -- - First start with the pdf for an untruncated distribution such as `\(\theta \sim \textrm{Beta}(c,d)\)`. -- - Suppose we then want to sample `\(\theta \sim \textrm{Beta}_{[a,b]}(c,d)\)`. How can we do that? One popular method is the .hlight[inverse-cdf method]. -- - The inverse cdf is useful for generating random variables in general, especially for generating truncated random variables. -- - Suppose we have `\(\theta \sim f\)`, for some arbitrary continuous density `\(f\)`. -- - According to probability integral transform, for any continuous random variable `\(\theta\)`, the random variable `\(U = F(\theta)\)` has a Unif(0,1) distribution. Note that `\(F\)` is the cdf. -- - Thus, to use the inverse-cdf method to sample `\(\theta \sim f\)`, first sample `\(u \sim \textrm{Unif}(0,1)\)`, then set `\(\theta = F^{-1}(u)\)`. --- ## The inverse cdf method - As an example, suppose we want to sample `\(\theta \sim \textrm{Beta}(c,d)\)` through the inverse cdf method. -- - Very easy. Just do the following in R. ```r u <- runif (1, 0, 1) theta <- qbeta(u,c,d) ``` -- - That is, first sample from a uniform distribution. -- - Then, transform it using the inverse cdf of the `\(\textrm{Beta}(c,d)\)` distribution. -- - Viola! --- ## The inverse cdf method - Back to the original problem: how to sample `\(\theta \sim \textrm{Beta}_{[a,b]}(c,d)\)`? -- - If we had the inverse cdf of `\(\textrm{Beta}(c,d)\)` truncated to `\([a, b]\)`, then we could use the inverse cdf method. Easy enough! Let's find that inverse cdf. -- - Let `\(f\)`, `\(F\)` and `\(F^{-1}\)` denote the pdf, cdf and inverse-cdf without truncation and let `\(A=[a,b]\)`. -- - Recall that the density `\(f(\theta)\)` **truncated** to `\([a,b]\)` is .block[ .small[ `$$f_{A}(\theta) = f_{[a,b]}(\theta) = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{\int^b_a f(\theta^\star)\textrm{d}\theta^\star} = \dfrac{f(\theta)\mathbb{1}[\theta \in [a,b]]}{F(b) - F(a)}.$$` ] ] -- - Therefore, the truncated cdf .block[ .small[ `$$F_{A}(z) = \Pr[\theta \leq z] = \dfrac{F(z) - F(a)}{F(b) - F(a)}.$$` ] ] -- - Not enough though. We need the truncated inverse cdf. --- ## The inverse cdf method - To find the inverse cdf `\(F^{-1}_A(u)\)`, let `\(F_{A}(z) = u\)`. That is, set .block[ .small[ `$$u = F_{A}(z) = \dfrac{F(z) - F(a)}{F(b) - F(a)}$$` ] ] and solve for `\(z\)` as a function of `\(u\)`. -- - Re-expressing as a function of `\(F(z)\)`, .block[ .small[ `$$F(z) = \{F(b) - F(a)\}u + F(a).$$` ] ] -- - Applying the untruncated inverse cdf `\(F^{-1}\)` to both sides, we have .block[ .small[ `$$z = F^{-1}[\{F(b) - F(a)\}u + F(a)] = F^{-1}_A(u).$$` ] ] --- ## The inverse cdf method - We now have all the pieces to use the inverse-cdf method to sample `\(\theta \sim f_A\)`, that is, `\(f\)` truncated to A. -- - First draw a `\(\textrm{Unif}(0,1)\)` random variable ```r u <- runif (1, 0, 1) ``` -- - Next, apply the linear transformation: .block[ .small[ `$$u^\star = \{F(b) - F(a)\}u + F(a).$$` ] ] -- - Finally, plug `\(u^\star\)` into the untruncated cdf `\(\theta = F^{-1}(u^\star)\)`. -- - Note we can equivalently sample `\(u^\star \sim runif(1,F(a),F(b))\)`.