class: center, middle, inverse, title-slide # The multinomial model ### Dr. Olanrewaju Michael Akande ### Feb 12, 2020 --- ## Announcements <!-- - Take Survey I. --> - Homework 4 due tomorrow. -- ## Outline - Categorical data - Dirichlet distribution - Conjugacy --- ## Categorical data (univariate) - Suppose + `\(Y \in \{1,\ldots, d\}\)`; + `\(\Pr(Y = j) = \theta_j\)` for each `\(j = 1,\ldots, d\)`; and + `\(\boldsymbol{\theta} = (\theta_1,\ldots,\theta_d)\)`. - Then the pmf of `\(Y\)` is .block[ .small[ `$$\Pr[Y = j| \boldsymbol{\theta}] = \prod_{j=1}^d \theta_j^{\mathbb{1}[Y = j]}.$$` ] ] -- - We say `\(Y\)` has a .hlight[multinomial distribution] with sample size 1, or a .hlight[categorical distribution]. -- - Write as `\(Y | \boldsymbol{\theta} \sim \textrm{Multinomial}(1,\boldsymbol{\theta})\)` or `\(Y | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})\)`. -- - Clearly, this is just an extension of the Bernoulli distribution. --- ## Dirichlet distribution - Since the elements of the probability vector `\(\boldsymbol{\theta}\)` must always sum to one, the support is often called a .hlight[simplex]. -- - A conjugate prior for categorical/multinomial data is the .hlight[Dirichlet distribution]. -- - A random variable `\(\boldsymbol{\theta}\)` has a .hlight[Dirichlet distribution] with parameter `\(\boldsymbol{\alpha}\)`, if .block[ .small[ `$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\sum_{j=1}^d \alpha_j\right)}{\prod_{j=1}^d \Gamma(\alpha_j)} \prod_{j=1}^d \theta_j^{\alpha_j-1}, \ \ \ \alpha_j > 0 \ \textrm{ for all } \ j = 1,\ldots, d.$$` ] ] where `\(\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_d)\)`, and .block[ .small[ `$$\sum_{j=1}^d \theta_j = 1, \ \ \theta_j \geq 0 \ \textrm{ for all } \ j = 1,\ldots, d.$$` ] ] -- - We write this as `\(\boldsymbol{\theta} \sim \textrm{Dirichlet}(\boldsymbol{\alpha}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_d)\)`. -- - The Dirichlet distribution is a multivariate generalization of the .hlight[beta distribution]. --- ## Dirichlet distribution - Write .block[ .small[ `$$\alpha_0 = \sum_{j=1}^d \alpha_j \ \ \ \textrm{and} \ \ \ \alpha_j^\star = \dfrac{\alpha_j}{\alpha_0}.$$` ] ] -- - Then we can re-write the pdf slightly as .block[ .small[ `$$p[\boldsymbol{\theta} | \boldsymbol{\alpha}] = \dfrac{\Gamma\left(\alpha_0\right)}{\prod_{j=1}^d \Gamma(\alpha_j)} \prod_{j=1}^d \theta_j^{\alpha_j-1}, \ \ \ \alpha_j > 0 \ \textrm{ for all } \ j = 1,\ldots, d.$$` ] ] -- - Properties: - .block[ .small[ `$$\mathbb{E}[\theta_j] = \alpha_j^\star;$$` ] ] -- - .block[ .small[ `$$\textrm{Mode}[\theta_j] = \dfrac{\alpha_j - 1}{\alpha_0 - d};$$` ] ] -- - .block[ .small[ `$$\mathbb{Var}[\theta_j] = \dfrac{\alpha_j^\star(1-\alpha_j^\star)}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_j](1-\mathbb{E}[\theta_j])}{\alpha_0 + 1};$$` ] ] -- - .block[ .small[ `$$\mathbb{Cov}[\theta_j,\theta_k] = \dfrac{\alpha_j^\star\alpha_k^\star}{\alpha_0 + 1} = \dfrac{\mathbb{E}[\theta_j]\mathbb{E}[\theta_k]}{\alpha_0 + 1}.$$` ] ] --- ## Dirichlet examples `\(\textrm{Dirichlet}(1,1,1)\)` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(10,10,10)\)` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(10,10,10)\)` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(1,10,1)\)` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Dirichlet examples `\(\textrm{Dirichlet}(50,100,10)\)` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Likelihood - Let `\(Y_i, \ldots, Y_n | \boldsymbol{\theta} \sim \textrm{Categorical}(\boldsymbol{\theta})\)`. -- - Recall .block[ .small[ `$$\Pr[Y_i = j| \boldsymbol{\theta}] = \prod_{j=1}^d \theta_j^{\mathbb{1}[Y_i = j]}.$$` ] ] -- - Then, .block[ .small[ $$ `\begin{split} L[Y; \boldsymbol{\theta}] & = \prod_{i=1}^n \prod_{j=1}^d \theta_j^{\mathbb{1}[Y_i = j]} = \prod_{j=1}^d \theta_j^{\sum_{i=1}^n \mathbb{1}[Y_i = j]} = \prod_{j=1}^d \theta_j^{n_j} \end{split}` $$ ] ] where `\(n_j\)` is just the number of individuals in category `\(j\)`. -- - Maximum likelihood estimate of `\(\theta_j\)` is .block[ .small[ `$$\hat{\theta}_j = \dfrac{n_j}{n}, \ \ j = 1,\ldots, d$$` ] ] --- ## Posterior - Set `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(\alpha_1,\ldots,\alpha_d)\)`. .block[ .small[ $$ `\begin{split} \pi(\boldsymbol{\theta} | Y) & \propto L[Y; \boldsymbol{\theta}] \pi[\boldsymbol{\theta}]\\ & \propto \prod_{j=1}^d \theta_j^{n_j} \prod_{j=1}^d \theta_j^{\alpha_j-1} \\ & \propto \prod_{j=1}^d \theta_j^{\alpha_j + n_j - 1}\\ & = \textrm{Dirichlet}(\alpha_1 + n_1,\ldots,\alpha_d + n_d) \end{split}` $$ ] ] -- - Posterior expectation: .block[ .small[ `$$\mathbb{E}[\theta_j | Y] = \dfrac{\alpha_j + n_j}{\sum_{l=1}^d (\alpha_l + n_l)}.$$` ] ] --- ## Combining information - For the prior, we have .block[ .small[ `$$\mathbb{E}[\theta_j] = \dfrac{\alpha_j}{\sum_{j=1}^d \alpha_j}$$` ] ] -- - We can think of + `\(\theta_{0j} = \mathbb{E}[\theta_j]\)` as being our **"prior guess"** about `\(\theta_j\)`, and + `\(n_0 = \sum_{j=1}^d \alpha_j\)` as being our **"prior sample size"**. -- - We can then rewrite the prior as `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(n_0\theta_{01},\ldots,n_0\theta_{0d})\)`. --- ## Combining information - We can write the posterior expectation as: .block[ .small[ $$ `\begin{split} \mathbb{E}[\theta_j | Y] & = \dfrac{\alpha_j + n_j}{\sum_{l=1}^d (\alpha_l + n_l)}\\ & = \dfrac{\alpha_j}{\sum_{l=1}^d \alpha_l + \sum_{l=1}^d n_l} + \dfrac{n_j}{\sum_{l=1}^d \alpha_l + \sum_{l=1}^d n_l}\\ & = \dfrac{n_0\theta_{0j}}{n_0 + n} + \dfrac{n \hat{\theta}_j}{n_0 + n}\\ & = \dfrac{n_0}{n_0 + n} \theta_{0j} + \dfrac{n}{n_0 + n} \hat{\theta}_j. \end{split}` $$ ] ] since MLE is .block[ .small[ `$$\hat{\theta}_j = \dfrac{n_j}{n}$$` ] ] -- - Once again, we can express our posterior expectation as a weighted average of the prior expectation and MLE. -- - <div class="question"> We can also extend the Dirichlet-multinomial model to more variables (contingency tables). </div> --- ## Example: pre-election polling - Fox News Nov 3-6 pre-election survey of 1295 likely voters for the 2016 election. -- - For those interested, [FiveThirtyEight](https://projects.fivethirtyeight.com/) is an interesting source for pre-election polls. -- - Out of 1295 respondents, 622 indicated support for Clinton, 570 for Trump, and the remaining 103 for other candidates or no opinion. -- - Drawing inference from pre-election polls is way more complicated and nuanced that this. We only use the data here for this simple illustration. -- - Assuming no other information on the respondents, we can assume simple random sampling and use a multinomial distribution with parameter `\(\boldsymbol{\theta} = (\theta_1,\theta_2,\theta_3)\)`, the proportion, in the survey population, of Clinton supporters, Trump supporters and other candidates or no opinion. --- ## Example: pre-election polling - With a noninformative uniform prior, we have `\(\pi(\boldsymbol{\theta}) = \textrm{Dirichlet}(1,1,1)\)`. -- - The resulting posterior is `\(\textrm{Dirichlet}(1 + n_1,1 + n_2,1 + n_3) = \textrm{Dirichlet}(623,571,104)\)`. -- - Suppose we wish to compare the proportion of people who would vote for Trump versus Clinton, we could examine the posterior distribution of `\(\theta_1-\theta_2\)`. -- - We can even compute the probability `\(\Pr(\theta_1 > \theta_2 | Y)\)`. --- ## Example: pre-election polling ```r #library(gtools) PostSamples <- rdirichlet(10000, alpha=c(623,571,104)) #dim(PostSamples) hist((PostSamples[,1] - PostSamples[,2]),col=rainbow(20),xlab=expression(theta[1]-theta[2]), ylab="Posterior Density",freq=F,breaks=50, main=expression(paste("Posterior density of ",theta[1]-theta[2]))) ``` <img src="10-multinomial-model_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Example: pre-election polling - Posterior probability that Clinton had more support than Trump in the survey population, that is, `\(\Pr(\theta_1 > \theta_2 | Y)\)`, is ```r #library(gtools) mean(PostSamples[,1] > PostSamples[,2]) ``` ``` ## [1] 0.9345 ``` -- - Once again, this is just a simple illustration with a very small subset of the 2016 pre-election polling data. -- - Inference for pre-election polls is way more complex and nuanced that this.