class: center, middle, inverse, title-slide # Probability review and one parameter models ### Dr. Olanrewaju Michael Akande ### Jan 15, 2020 --- ## Announcements - No make-up for Monday's lab. - Final exam will be either online or take home. Not in class. - Homework one soon...but here are some readings to keep you busy: 1. Efron, B., 1986. Why isn't everyone a Bayesian?. The American Statistician, 40(1), pp. 1-5. 2. Gelman, A., 2008. Objections to Bayesian statistics. Bayesian Analysis, 3(3), pp. 445-449. 3. Diaconis, P., 1977. Finite forms of de Finetti's theorem on exchangeability. Synthese, 36(2), pp. 271-281. 4. Gelman, A., Meng, X. L. and Stern, H., 1996. Posterior predictive assessment of model fitness via realized discrepancies. Statistica sinica, pp. 733-760. 5. Dunson, D. B., 2018. Statistics in the big data era: Failures of the machine. Statistics & Probability Letters, 136, pp. 4-9. --- ## Outline - Probability review + Random variables + Joint distributions + Independence and exchangeability - Introduction to Bayesian Inference (Cont'd) + Conjugacy + Kernels + Bernoulli and binomial data + Selecting priors + Truncated priors --- class: center, middle # Probability review --- ## Discrete random variables - A .hlight[random variable] is .hlight[discrete] if the set of all possible outcomes is .hlight[countable]. -- - The .hlight[probability mass function (pmf)] of a discrete random variable `\(Y\)`, `\(p(y)\)` describes the probability associated with each possible value of `\(Y\)`. -- - `\(p(y)\)` has the following properties: 1. `\(0 \leq p(y) \leq 1\)` for all values `\(y \in \mathcal{Y}\)`. 2. `\(\sum_{y \in \mathcal{Y}} p(y) = 1\)`. --- ## Bernoulli distribution - The .hlight[Bernoulli distribution] can be used to describe an experiment with two outcomes, such as + Flipping a coin (heads or tails); + Vote turnout (vote or not); and + The outcome of a basketball game (win or loss). -- - In all cases, we can represent this as a binary random variable where the probability of "success" is `\(\theta\)` and the probability of "failure" is `\(1-\theta\)`. -- - We usually write this as: `\(Y \sim \textrm{Bernoulli}(\theta)\)`, where `\(\theta \in [0,1]\)`. -- - It follows that .block[ .small[ `$$\Pr(Y=y|\theta) = \theta^y(1-\theta)^{1-y}; \ \ \ y=0,1.$$` ] ] -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## Binomial distribution - The .hlight[binomial distribution] describes the number of successes from `\(n\)` independent Bernoulli trials. -- - That is, `\(Y =\)` number of "successes" in `\(n\)` independent trials and `\(\theta\)` is the probability of success per trial. -- - We usually write this as: `\(Y \sim \textrm{Bin}(n,\theta)\)`, where `\(\theta \in [0,1]\)`. -- - The pmf is .block[ .small[ `$$\Pr(Y=y|\theta,n) = {n \choose y} \theta^y(1-\theta)^{n-y}; \ \ \ y=0,1,\ldots,n.$$` ] ] -- - **Example**: `\(Y =\)` number of individuals with type I diabetes out of a sample of `\(n\)` surveyed. -- - Binomial likelihoods are commonly used in collecting data on proportions. -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## Poisson distribution - `\(Y \sim \textrm{Po}(\theta)\)` denotes that `\(Y\)` is a .hlight[Poisson random variable]. -- - The Poisson distribution is commonly used to model count data consisting of the number of events in a given time interval. -- - The Poisson distribution is parameterized by `\(\theta\)` and the pmf is given by .block[ .small[ `$$\Pr[Y = y | \theta] = \dfrac{\theta^y e^{-\theta}}{y!}; \ \ \ \ y=0,1,2,\ldots; \ \ \ \ \theta > 0.$$` ] ] -- - Similar to binomial but with no limit on the total number of counts. -- - <div class="question"> What is the mean of this distribution? What is the variance? </div> --- ## General discrete distributions - Useful to consider general discrete distributions having an arbitrary form. -- - Suppose `\(Y \in \{y_1^\star,\ldots,y_k^\star\}\)`. Then define `\(\Pr(Y = y_h^\star) = \pi_h\)` for each `\(h = 1,\ldots, k\)`. That is, .block[ .small[ `$$\Pr[Y = y| \boldsymbol{\pi}] = \prod_h \pi_h^{\mathbb{1}[Y = y_h^\star]}; \ \ y \in \{y_1^\star,\ldots,y_k^\star\}$$` ] ] where `\(\boldsymbol{\pi} = (\pi_1,\ldots,\pi_k)\)`. -- - `\((y_1^\star,\ldots,y_k^\star)\)` are "atoms" representing possible values for `\(Y\)`. -- - For example, these may be words in a dictionary or values for education as a categorical variable. Useful for text data, categorical observations, etc. -- - Can also write as `\(Y \sim \sum^k_{h=1} \pi_h \delta_{y_h^\star}\)`, where `\(\delta_{y_h^\star}\)` denotes a unit mass at `\(y_h^\star\)`. -- - Often called the .hlight[categorical distribution] or .hlight[generalized Bernoulli distribution]. Also, see the .hlight[multinomial distribution]. --- ## Continuous random variables - The .hlight[probability density function (pdf)], `\(p(y)\)` or `\(f(y)\)`, of a continuous random variable `\(Y\)` has slightly different properties: 1. `\(0 \leq f(y)\)` for all `\(y \in \mathcal{Y}\)`. 2. `\(\int_{y \in \mathbb{R}} p(y) \textrm{d}y = 1\)`. -- - The pdf for a continuous random variable is not necessarily less than 1. -- - Also, `\(p(y)\)` is NOT the probability of value `\(y\)`. -- - However, if `\(p(y_1) > p(y_2)\)`, we say informally that `\(y_1\)` has a "higher probability" than `\(y_2\)`. --- ## Uniform density - The simplest example of a continuous density is the .hlight[uniform density]. -- - `\(Y \sim \textrm{Unif}(a,b)\)` denotes density is uniform in interval `\((a,b)\)`. -- - The pdf is simply .block[ .small[ `$$f(y) = \dfrac{1}{b-a}; \ \ \ y \in (a,b).$$` ] ] -- - The cdf is .block[ .small[ `$$F(y) = \Pr(Y \leq y) = \int^y_a \dfrac{1}{b-a} \textrm{d}z = \dfrac{y-a}{b-a}$$` ] ] -- - The mean (expectation) is .block[ .small[ `$$\dfrac{a+b}{2}$$` ] ] - <div class="question"> What is the variance? Also, can you prove the formula for the mean? </div> --- ## Beta density - The uniform density can be used as a prior for a probability if `\((a,b) \subset (0,1)\)`. -- - However, it is very inflexible clearly. <div class="question"> Why? </div> -- - An alternative for `\(y \in \mathcal{Y}\)` is the .hlight[beta density], written as `\(Y \sim \textrm{Beta}(a,b)\)`, with .block[ .small[ `$$f(y) = \frac{1}{B(a,b)} y^{a-1} (1-y)^{b-1}; \ \ \ y \in (0,1), \ a > 0, \ b > 0.$$` ] ] where `\(B(a,b) = \dfrac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\)`. `\(\Gamma(n) = (n-1)!\)` for any positive integer `\(n\)`. -- - As we have already seen, the beta density is quite flexible in characterizing a broad variety of densities on `\((0,1)\)`. -- - <div class="question"> Beta(1,1) is the same as Unif(0,1). Workout the pdfs to convince yourself! </div> --- ## Gamma density - The .hlight[gamma density] will be useful as a prior for parameters that are strictly positive. -- - For random variables `\(Y \sim \textrm{Ga}(a,b)\)`, we have the pdf .block[ .small[ `$$f(y) = \frac{b^a}{\Gamma(a)} y^{a-1}e^{-by}; \ \ \ y \in (0,\infty), \ a > 0, \ b > 0.$$` ] ] -- - Properties: .block[ .small[ `$$\mathbb{E}[Y] = \dfrac{a}{b}; \ \ \mathbb{V}[Y] = \dfrac{a}{b^2}.$$` ] ] -- - **Note**: parameterizations of the gamma distribution vary! -- - Under this parameterization, if `\(Y \sim \textrm{Ga}(1,\theta)\)`, then `\(Y \sim \textrm{Exp}(\theta)\)`, that is, the .hlight[exponential distribution]. --- ## Continuous joint distributions - Suppose we have two random variables `\(\theta = (\theta_1,\theta_2)\)`. -- - Their **joint** distribution function is .block[ .small[ `$$\Pr(\theta_1 \leq a,\theta_2 \leq b) = \int^a_{-\infty} \int^b_{-\infty} p(\theta_1,\theta_2) \textrm{d}\theta_1\textrm{d}\theta_2,$$` ] ] where `\(p(\theta_1,\theta_2)\)` is the joint probability density function (pdf). -- - The **marginal** density of `\(\theta_1\)` can be obtained by .block[ .small[ `$$p(\theta_1) = \int^\infty_{-\infty} p(\theta_1,\theta_2) \textrm{d}\theta_2,$$` ] ] which is referred to as marginalizing out `\(\theta_2\)`. -- - We will be doing a lot of "marginalizations" so take note! --- ## Factorizing joint densities and independence - The joint density `\(p(\theta_1,\theta_2)\)` can be factorized as .block[ .small[ `$$p(\theta_1,\theta_2) = p(\theta_1|\theta_2)p(\theta_2), \ \ \ \textrm{or} \ \ \ p(\theta_1,\theta_2) = p(\theta_2|\theta_1)p(\theta_1).$$` ] ] -- - For independent random variables, the joint density equals the product of the marginals: .block[ .small[ `$$p(\theta_1,\theta_2) = p(\theta_1)p(\theta_2).$$` ] ] -- - This implies that `\(p(\theta_2|\theta_1) = p(\theta_2)\)` and `\(p(\theta_1|\theta_2) = p(\theta_1)\)` under independence. -- - These relationships extend automatically to `\(\theta = (\theta_1,\ldots,\theta_p)\)`. That is, .block[ .small[ `$$p(\theta_1,\ldots,\theta_p) = \prod^p_{j=1} p(\theta_j),$$` ] ] under mutual independence of the elements of the `\(\theta\)` vector. --- ## Conditional independence - Suppose `\(y_i \overset{iid}{\sim} f(\theta)\)` for `\(i = 1,\ldots,n\)`. -- - Data `\(\{y_i\}\)` are independent & identically distributed draws from distribution `\(f(\theta)\)`. -- - The data are said to be .hlight[conditionally independent] given `\(\theta\)`. .block[ .small[ `$$L(y; \theta) = \prod^n_{i=1} f(y_i; \theta),$$` ] ] where `\(L(y; \theta) =\)` likelihood of the data conditionally on `\(\theta\)`. -- - The .hlight[marginal likelihood] of the data is .block[ .small[ `$$L(y) = \int L(y; \theta)p(\theta)\textrm{d}\theta.$$` ] ] -- - `\(L(y)\)` can no longer be written as a product of densities as in `\(\prod^n_{i=1} h(y_i)\)`; we lose independence when we marginalize out `\(\theta\)`. --- ## Exchangeability - In marginalizing out `\(\theta\)`, the observations `\(\{y_i\}\)` are no longer independent. -- - `\(\{y_i\}\)` are .hlight[exchangeable] if `\(p(y_1,\ldots,y_n) = p(y_{\pi_1},\ldots,y_{\pi_n})\)`, for all permutations `\(\pi\)` of `\(\{1,\ldots,n\}\)`. -- - .hlight[de Finetti's Theorem]: Suppose `\(\{y_i\}\)` are exchangeable under above definition for any `\(n\)`. Then .block[ .small[ `$$p(y_1,\ldots,y_n) = \int \left[ \prod^n_{i=1} f(y_i; \theta) \right] p(\theta)\textrm{d}\theta.$$` ] ] for some `\(\theta\)`, prior distribution `\(p(\theta)\)` and sampling model `\(f(y_i;\theta)\)`. -- - Simply put, de Finetti's Theorem states that exchangeable observations are conditionally independent relative to some parameter. -- - de Finetti's Theorem is critical in providing a motivation for using parameters and for putting priors on parameters. --- class: center, middle # Introduction to Bayesian Inference (Cont'd) --- ## Frequentist inference - Given **data** `\(\{y_i\}\)` and an **unknown parameter** `\(\theta\)`, estimate said `\(\theta\)`. -- - How to estimate `\(\theta\)` under the frequentist paradigm? + Maximum likelihood estimate (MLE) + Method of moments + and so on... -- - Frequentist ML estimation finds the one value of `\(\theta\)` that maximizes the likelihood. -- - Typically uses large sample (asymptotic) theory to obtain confidence intervals and do hypothesis testing. --- ## Bayesian inference - Once again, given **data** `\(\{y_i\}\)` and an **unknown parameter** `\(\theta\)`, estimate said `\(\theta\)`. -- - Bayesians update their prior information for `\(\theta\)` with information in the data `\(\{y_i\}\)`, to obtain the posterior density `\(p(\theta | y)\)`. -- - Personally, I prefer being able to obtain posterior densities that describe my parameter, instead of estimated summaries (usually measures of central tendency) along with confidence intervals. -- - Bayes' theorem - reminder: .block[ .small[ `$$p(\theta | y) = \frac{p(\theta)L(y;\theta)}{\int_{\Theta}p(\tilde{\theta})L(y; \tilde{\theta}) \textrm{d}\tilde{\theta}} = \frac{p(\theta)L(y;\theta)}{L(y)}$$` ] ] --- ## Comments on the posterior density - The posterior density is more concentrated than the prior & quantifies learning about `\(\theta\)`. -- - In fact, this is the optimal way to learn from data - see discussion in Hoff chapter 1. -- - As more & more data become available, posterior density will converge to a normal (Gaussian) density centered on the MLE (Bayes central limit theorem). -- - In finite samples for limited data, the posterior can be highly skewed & noticeably non-Gaussian. --- ## Conjugacy - Starting with an arbitrary prior density `\(p(\theta)\)` & likelihood `\(L(y;\theta)\)` we may encounter problems in calculating the posterior density `\(p(\theta | y)\)`. -- - In particular, you may notice the denominator in the Bayes' rule: .block[ .small[ `$$L(y) = \int_{\Theta}p(\tilde{\theta})L(y; \tilde{\theta}) \textrm{d}\tilde{\theta}.$$` ] ] This integral may not be analytically tractable! -- - When the prior is .hlight[conjugate] however, the marginal likelihood can be calculated analytically. -- - .hlight[Conjugacy] `\(\Rightarrow\)` the posterior has the same form as the prior. -- - Often useful to think of hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of (historical) pseudo-observations with properties specified by the parameters. -- - Conjugate priors make calculations easy but may not represent our prior information well. --- ## Kernels - In Bayesian statistics, the .hlight[kernel] of a pdf omits any multipliers that do not depend on the random variable or parameter we care about. -- - For many distributions, the kernel is in a simple form but the normalizing constant complicates calculations. -- - If one recognizes the kernel as that matching a known distribution, then the normalizing factor can be reinstated. This is a very MAJOR TRICK we will use to calculate posterior distributions. -- - For example, the normal density is given by .block[ .small[ `$$p(y|\mu,\sigma^2) = \dfrac{1}{\sqrt{2\pi\sigma^2}}e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}$$` ] ] but the kernel is just .block[ .small[ `$$p(y|\mu,\sigma^2) \propto e^{-\dfrac{(y-\mu)^2}{2\sigma^2}}.$$` ] ] --- ## Bernoulli data - Back to our example: suppose `\(\theta \in (0,1)\)` is the population proportion of individuals with diabetes in the US. -- - Suppose we take a sample of `\(n\)` individuals and record whether or not they have diabetes (as binary: 0,1). -- - Then we can use the Bernoulli distribution as the sampling distribution. also, we already established that we can use a beta prior for `\(\theta\)`. --- ## Bernoulli data - Generally, it turns out that if + `\(f(y_i; \theta): y_i \overset{iid}{\sim} \textrm{Bernoulli}(\theta)\)` for `\(i = 1,\ldots,n\)`, and + `\(p(\theta): \theta \sim \textrm{Beta}(a,b)\)`, then the posterior distribution is also a beta distribution. -- - <div class="question"> Can we derive the posterior distribution and its parameters? Let's do some work on the board! </div> -- - Updating a beta prior with a Bernoulli likelihood leads to a beta posterior - we have conjugacy! -- - Specifically, we have. .block[ .small[ `$$p(\theta | \{y_i\}): \theta | \{y_i\} \sim \textrm{Beta}(a+\sum y_i,b+n-\sum y_i).$$` ] ] -- - This is the .hlight[beta-Bernoulli model]. More generally, this is just the .hlight[beta-binomial model]. --- ## Beta-binomial in more detail - Suppose the likelihood of the data is .block[ .small[ `$$L(y; \theta) = {n \choose y} \theta^y(1-\theta)^{n-y}.$$` ] ] -- - Suppose also that we have a `\(\textrm{Beta}(a,b)\)` prior on the probability `\(\theta\)`. -- - Then the posterior density then has the beta form .block[ .small[ `$$\pi(\theta | y) = \textrm{Beta}(a+y,b+n-y).$$` ] ] -- - The posterior has expectation .block[ .small[ `$$\mathbb{E}(\theta | y) = \dfrac{a+y}{a+b+n} = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}.$$` ] ] -- - For this specification, sometimes `\(a\)` and `\(b\)` are interpreted as "prior data" with a interpreted as the prior number of 1's, `\(b\)` as the prior number of 0's, and `\(a + b\)` as the prior sample size. -- - As we get more and more data, the majority of our information about `\(\theta\)` comes from the data as opposed to the prior. --- ## Binomial data - For example, suppose you want to find the Bayesian estimate of the probability `\(\theta\)` that a coin comes up heads. -- - Before you see the data, you express your uncertainty about `\(\theta\)` through the prior `\(p(\theta) = \textrm{Beta}(2,2)\)` -- - Now suppose you observe 10 tosses, of which only 1 was heads. -- - Then, the posterior density `\(p(\theta \,|\, y, n)\)` is `\(\mbox{Beta}(3, 11)\)`. --- ## Binomial data - Recall that the mean of `\(\mbox{Beta}(a,b)\)` is `\(a/(a+b)\)`. -- - That means, before you saw the data, you thought the mean for `\(\theta\)` was 2/(2+2) = 0.5. -- - However, after seeing the data, you believe it is 3/(3+11) = 0.214. -- - The variance of `\(\mbox{Beta}(a,b)\)` is `\(ab/[(a+b)^2(a+b+1)]\)`. -- - So before you saw data, your uncertainty about `\(\theta\)` (i.e., your standard deviation) was `\(\sqrt{4/[4^2 \times 5]} = 0.22\)`. -- - However, after seeing 1 Heads in 10 tosses, your uncertainty is 0.106. -- - Clearly, as the number of tosses goes to infinity, your uncertainty goes to zero. --- ## Operationalizing data analysis We will explore another example soon but first, how should we approach data analysis in general? -- - .hlight[Step 1]. State the question. -- - .hlight[Step 2]. Collect the data. -- - .hlight[Step 3]. Explore the data. -- - .hlight[Step 4]. Formulate and state a modeling framework. -- - .hlight[Step 5]. Check your models. -- - .hlight[Step 6]. Answer the question. --- ## Example: rare events - .hlight[Step 1]. State the question: + What is the prevalence of an infectious disease in a small city? + Why? High prevalence means more public health precautions are recommended. -- - .hlight[Step 2]. Collect the data: + Suppose you collect a small random sample of 20 individuals. -- - .hlight[Step 3]. Explore the data: + Let `\(Y\)` denote the unknown number of infected individuals in the sample. --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Parameter of interest: `\(\theta\)` is the fraction of infected individuals in the city. + Sampling model: a reasonable model for `\(Y\)` can be `\(\textrm{Bin}(20,\theta)\)` <img src="img/binomial_histograms.png" width="105" height="370px" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Prior specification: information from previous studies — infection rate in “comparable cities” ranges from 0.05 to 0.20 with an average of 0.10. So maybe a standard deviation of roughly 0.05? + What is a good prior? The **expected value** of `\(\theta\)` close to 0.10 and the **variance** close to 0.05. + Possible option: `\(\mbox{Beta}(3.5,31.5)\)` or maybe even `\(\mbox{Beta}(3,32)\)`? <img src="02-one-parameter-models-I_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 4]. Formulate and state a modeling framework: + Under `\(\mbox{Beta}(3,32)\)`, `\(\Pr(\theta < 0.1) \approx 0.67\)`. + Posterior distribution for the model: `\((\theta | Y=y) = \textrm{Beta}(a+y,b+n-y)\)` + Suppose `\(Y=0\)`. Then, `\((\theta | Y=y) = \textrm{Beta}(3,32+20)\)` <img src="02-one-parameter-models-I_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Example: rare events - .hlight[Step 5]. Check your models: + Compare performance of posterior mean and posterior probability that `\(\theta < 0.1\)`. + Under `\(\mbox{Beta}(3,52)\)`, - `\(\Pr(\theta < 0.1 | Y=y) \approx 0.92\)`. More confidence in low values of `\(\theta\)`. - For `\(\mathbb{E}(\theta | Y=y)\)`, we have .block[ .small[ $$ `\begin{split} \mathbb{E}(\theta | y) & = \dfrac{a+y}{a+b+n} = \dfrac{3}{52} = 0.058.\\ \end{split}` $$ ] ] - Recall that the prior mean is `\(a/(a+b)=0.09\)`. Thus, we can see how that contributes to the prior mean. .block[ .small[ $$ `\begin{split} \mathbb{E}(\theta | y) & = \dfrac{a+b}{a+b+n} \times \textrm{prior mean} + \dfrac{n}{a+b+n} \times \textrm{sample mean}\\ & = \dfrac{a+b}{a+b+n} \times \dfrac{a}{a+b} + \dfrac{n}{a+b+n} \times \dfrac{y}{n}\\ & = \dfrac{35}{52} \times \dfrac{3}{35} + \dfrac{20}{52} \times \dfrac{0}{n} = \dfrac{3}{52} = 0.058.\\ \end{split}` $$ ] ] --- ## Example: rare events - .hlight[Step 6]. Answer the question: + People with low prior expectations are generally at least `\(90\%\)` certain that the infection rate is below 0.10. + `\(\pi(\theta | Y)\)` is to the left of `\(\pi(\theta)\)` because the observation `\(Y=0\)` provides evidence of a low value of `\(\theta\)`. + `\(\pi(\theta | Y)\)` is more peaked than `\(\pi(\theta)\)` because it combines information and so contains more information than `\(\pi(\theta)\)` alone. + The posterior expectation is 0.058. + The posterior mode is 0.04. - Note, for `\(\mbox{Beta}(a,b)\)`, the mode is `\((a-1)/(a+b-2)\)`. + The posterior probability that `\(\theta < 0.1\)` is 0.92.