Take "Participation Quiz II - Jan 31" on Sakai.
Homework 3 now online.
Take "Participation Quiz II - Jan 31" on Sakai.
Homework 3 now online.
Inference for mean, conditional on variance (cont'd)
Noninformative and improper priors
Joint inference for mean and variance
Back to the examples
Normal data: Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
Normal data: Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
+ Normal Prior (when σ2/ τ is known):
μ|σ2∼N(μ0,σ20); or μ|τ∼N(μ0,τ−10).
Normal data: Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
+ Normal Prior (when σ2/ τ is known):
μ|σ2∼N(μ0,σ20); or μ|τ∼N(μ0,τ−10).
⇒ Normal posterior (in terms of precision):
μ|Y,τ∼N(μn,τ−1n).
Normal data: Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
+ Normal Prior (when σ2/ τ is known):
μ|σ2∼N(μ0,σ20); or μ|τ∼N(μ0,τ−10).
⇒ Normal posterior (in terms of precision):
μ|Y,τ∼N(μn,τ−1n).
μn=nτˉy+τ0μ0τn+τ0=τ0τ0+τnμ0+nττ0+τnˉy
Posterior mean is weighted sum of prior information plus data information:
μn=nτˉy+τ0μ0τn+τ0=τ0τ0+τnμ0+nττ0+τnˉy
Relatively easy to set μ0 if we have a "prior" guess of the mean. What about τ0?
Posterior mean is weighted sum of prior information plus data information:
μn=nτˉy+τ0μ0τn+τ0=τ0τ0+τnμ0+nττ0+τnˉy
Relatively easy to set μ0 if we have a "prior" guess of the mean. What about τ0?
If we think of the prior mean as being based on κ0 prior observations from a similar population as Y, then we might set σ20=σ2κ0, which implies τ0=κ0τ.
Posterior mean is weighted sum of prior information plus data information:
μn=nτˉy+τ0μ0τn+τ0=τ0τ0+τnμ0+nττ0+τnˉy
Relatively easy to set μ0 if we have a "prior" guess of the mean. What about τ0?
If we think of the prior mean as being based on κ0 prior observations from a similar population as Y, then we might set σ20=σ2κ0, which implies τ0=κ0τ.
Then the posterior mean is given by
μn=κ0κ0+nμ0+nκ0+nˉy.
In terms of variances, we have
μ|Y,σ2∼N(μn,σ2n)
where
μn=nσ2ˉy+1σ20μ0nσ2+1σ20 and σ2n=1nσ2+1σ20.
In terms of variances, we have
μ|Y,σ2∼N(μn,σ2n)
where
μn=nσ2ˉy+1σ20μ0nσ2+1σ20 and σ2n=1nσ2+1σ20.
Easy to see that we can re-express the posterior information as a sum of the prior information and the information from the data.
μn=κ0κ0+nμ0+nκ0+nˉy.
If we once again set σ20=σ2κ0, the posterior mean is still given by
μn=κ0κ0+nμ0+nκ0+nˉy.
By the way, setting σ20=σ2κ0 ⇒ prior dependence between μ and σ2, whereas an arbitrary σ20, independent on σ2, ⇒ prior independence between μ and σ2.
Generally, we must specify both μ0 and τ0 to do inference.
When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.
Generally, we must specify both μ0 and τ0 to do inference.
When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.
To that end, there is often the desire to construct noninformative priors, with the rationale being "to let the data speak for themselves".
Generally, we must specify both μ0 and τ0 to do inference.
When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.
To that end, there is often the desire to construct noninformative priors, with the rationale being "to let the data speak for themselves".
For example, we could instead assume a uniform prior on μ that is constant over the real line, i.e., π(μ)∝1 ⇒ all values on the real line are equally likely apriori.
Generally, we must specify both μ0 and τ0 to do inference.
When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.
To that end, there is often the desire to construct noninformative priors, with the rationale being "to let the data speak for themselves".
For example, we could instead assume a uniform prior on μ that is constant over the real line, i.e., π(μ)∝1 ⇒ all values on the real line are equally likely apriori.
Clearly, this is not a valid pdf since it will not integrate to 1 over the real line. Such priors are known as improper priors.
Generally, we must specify both μ0 and τ0 to do inference.
When prior distributions have no population basis, that is, there is no justification of the prior as "prior data", prior distributions can be difficult to construct.
To that end, there is often the desire to construct noninformative priors, with the rationale being "to let the data speak for themselves".
For example, we could instead assume a uniform prior on μ that is constant over the real line, i.e., π(μ)∝1 ⇒ all values on the real line are equally likely apriori.
Clearly, this is not a valid pdf since it will not integrate to 1 over the real line. Such priors are known as improper priors.
An improper prior can still be very useful, we just need to ensure it results in a proper posterior.
Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior?
Laplace proposed the uniform distribution. This proposal lacks invariance under monotone transformations of the parameter.
Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior?
Laplace proposed the uniform distribution. This proposal lacks invariance under monotone transformations of the parameter.
For example, a uniform prior on the binomial proportion parameter θ is not the same as a uniform prior on the odds parameter ϕ=θ1−θ.
Question: is there a prior pdf (for a given model) that would be universally accepted as a noninformative prior?
Laplace proposed the uniform distribution. This proposal lacks invariance under monotone transformations of the parameter.
For example, a uniform prior on the binomial proportion parameter θ is not the same as a uniform prior on the odds parameter ϕ=θ1−θ.
A more acceptable approach was introduced by Jeffreys. For single parameter models, the Jeffreys' prior defines a noninformative prior density of a parameter θ as
π(θ)∝√I(θ)
where I(θ) is the Fisher information for θ.
The Fisher information gives a way to measure the amount of information a random variable Y carries about an unknown parameter θ of a distribution that describes Y.
Formally, I(θ) is defined as
I(θ)=E[(∂∂θlogf(y;θ))2|θ]=∫Y(∂∂θlogf(y;θ))2f(y;θ)dy.
The Fisher information gives a way to measure the amount of information a random variable Y carries about an unknown parameter θ of a distribution that describes Y.
Formally, I(θ) is defined as
I(θ)=E[(∂∂θlogf(y;θ))2|θ]=∫Y(∂∂θlogf(y;θ))2f(y;θ)dy.
Alternatively,
I(θ)=−E[∂2∂2θlogf(y;θ)|θ]=−∫Y(∂2∂2θlogf(y;θ))f(y;θ)dy.
The Fisher information gives a way to measure the amount of information a random variable Y carries about an unknown parameter θ of a distribution that describes Y.
Formally, I(θ) is defined as
I(θ)=E[(∂∂θlogf(y;θ))2|θ]=∫Y(∂∂θlogf(y;θ))2f(y;θ)dy.
Alternatively,
I(θ)=−E[∂2∂2θlogf(y;θ)|θ]=−∫Y(∂2∂2θlogf(y;θ))f(y;θ)dy.
Turns out that the Jeffreys' prior for μ under the normal model, when σ2 is known, is
π(μ)∝1,
the uniform prior over the real line. You should try to derive this.
Recall that for σ2 known, the normal likelihood simplifies to
∝ exp{−12τn(μ−ˉy)2},
ignoring everything else that does not depend on μ.
With the Jeffreys' prior π(μ)∝1, can we derive the posterior distribution?
π(μ|Y,σ2) ∝ exp{−12τn(μ−ˉy)2}π(μ)∝ exp{−12τn(μ−ˉy)2}.
Posterior:
π(μ|Y,σ2) ∝ exp{−12τn(μ−ˉy)2}π(μ)∝ exp{−12τn(μ−ˉy)2}.
This is the kernel of a normal distribution with
Posterior:
π(μ|Y,σ2) ∝ exp{−12τn(μ−ˉy)2}π(μ)∝ exp{−12τn(μ−ˉy)2}.
This is the kernel of a normal distribution with
Posterior:
π(μ|Y,σ2) ∝ exp{−12τn(μ−ˉy)2}π(μ)∝ exp{−12τn(μ−ˉy)2}.
This is the kernel of a normal distribution with
Written differently, we have μ|Y,σ2∼N(ˉy,σ2n)
What happens when σ/ τ is unknown? We need a joint prior π(μ,σ2) for μ and σ2.
Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far.
What happens when σ/ τ is unknown? We need a joint prior π(μ,σ2) for μ and σ2.
Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far.
That is,
π(μ,σ2) = π(μ|σ2)π(σ2).
What happens when σ/ τ is unknown? We need a joint prior π(μ,σ2) for μ and σ2.
Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far.
That is,
π(μ,σ2) = π(μ|σ2)π(σ2).
For π(σ2), we need a distribution with support on (0,∞). One such family is the gamma family, but this is NOT conjugate for the variance of a normal distribution.
What happens when σ/ τ is unknown? We need a joint prior π(μ,σ2) for μ and σ2.
Write the joint prior distribution for the mean and variance as the product of a conditional and a marginal distribution, so we can take advantage of our work so far.
That is,
π(μ,σ2) = π(μ|σ2)π(σ2).
For π(σ2), we need a distribution with support on (0,∞). One such family is the gamma family, but this is NOT conjugate for the variance of a normal distribution.
The gamma distribution is, however, conjugate for the precision τ, and in that case, we say that σ2 has an inverse-gamma distribution.
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
Once again, suppose Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
A conjugate joint prior is given by
τ=1σ2 ∼Gamma(ν02,ν0σ202)μ|σ2∼N(μ0,σ2κ0) or μ|τ∼N(μ0,1κ0τ).
Once again, suppose Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
A conjugate joint prior is given by
τ=1σ2 ∼Gamma(ν02,ν0σ202)μ|σ2∼N(μ0,σ2κ0) or μ|τ∼N(μ0,1κ0τ).
This is often called a normal-gamma prior distribution.
Once again, suppose Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
A conjugate joint prior is given by
τ=1σ2 ∼Gamma(ν02,ν0σ202)μ|σ2∼N(μ0,σ2κ0) or μ|τ∼N(μ0,1κ0τ).
This is often called a normal-gamma prior distribution.
σ20 is the prior guess for σ2, while ν0 is often referred to as the "prior degrees of freedom", our degree of confidence in σ20.
Once again, suppose Y=(y1,y2,…,yn), where each
yi∼N(μ,σ2); or yi∼N(μ,τ−1).
A conjugate joint prior is given by
τ=1σ2 ∼Gamma(ν02,ν0σ202)μ|σ2∼N(μ0,σ2κ0) or μ|τ∼N(μ0,1κ0τ).
This is often called a normal-gamma prior distribution.
σ20 is the prior guess for σ2, while ν0 is often referred to as the "prior degrees of freedom", our degree of confidence in σ20.
We do not have conjugacy if we replace σ2κ0 in the normal prior with an arbitrary prior variance independent of σ2. To do inference in that scenario, we need Gibbs sampling (to come next week!).
Based on the normal-gamma prior, we need π(μ|Y,σ2) and π(τ|Y).
For π(μ|Y,σ2), we can leverage our previous results. We have
μ|Y,σ2∼N(μn,σ2κn) or μ|Y,τ∼N(μn,1κnτ)
where
μn=nσ2ˉy+κ0σ2μ0nσ2+κ0σ2=κ0μ0+nˉyκn=κ0κnμ0+nκnˉy and κn=κ0+n.
Based on the normal-gamma prior, we need π(μ|Y,σ2) and π(τ|Y).
For π(μ|Y,σ2), we can leverage our previous results. We have
μ|Y,σ2∼N(μn,σ2κn) or μ|Y,τ∼N(μn,1κnτ)
where
μn=nσ2ˉy+κ0σ2μ0nσ2+κ0σ2=κ0μ0+nˉyκn=κ0κnμ0+nκnˉy and κn=κ0+n.
μn is simply the sample mean of the current and prior observations, and posterior variance of μ given σ2 is σ2 divided by the total number of observations (prior and current).
Some algebra is required to get the marginal posterior of τ. Let's write the full joint posterior and go from there. We must keep some of the terms we discarded in the last lecture.
Recall the likelihood
L(Y;μ,τ)∝τn2 exp{−12τs2(n−1)} exp{−12τn(μ−ˉy)2},
Some algebra is required to get the marginal posterior of τ. Let's write the full joint posterior and go from there. We must keep some of the terms we discarded in the last lecture.
Recall the likelihood
L(Y;μ,τ)∝τn2 exp{−12τs2(n−1)} exp{−12τn(μ−ˉy)2},
Now, μ|τ∼N(μ0,1κ0τ) ⇒
π(μ|τ)∝ exp{−12κ0τ(μ−μ0)2}.
Some algebra is required to get the marginal posterior of τ. Let's write the full joint posterior and go from there. We must keep some of the terms we discarded in the last lecture.
Recall the likelihood
L(Y;μ,τ)∝τn2 exp{−12τs2(n−1)} exp{−12τn(μ−ˉy)2},
Now, μ|τ∼N(μ0,1κ0τ) ⇒
π(μ|τ)∝ exp{−12κ0τ(μ−μ0)2}.
and τ∼Ga(ν02,ν0σ202) ⇒
π(τ)∝τν02−1exp{−τν0σ202}.
\begin{split} \Rightarrow \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \pi(\mu|\sigma^2) \times \pi(\tau) \times L(Y; \mu,\sigma^2)\\ \\ & \boldsymbol{ \propto} \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\}}_{\propto \ \pi(\mu|\sigma^2)} \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\}}_{\propto \ \pi(\tau)}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\propto \ L(Y; \mu,\sigma^2)}\\ & \boldsymbol{=} \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu - \mu_0)^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu - \bar{y})^2 \right\}}_{\textrm{Terms involving } \mu}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0}{2}-1} \textrm{exp}\left\{-\dfrac{\tau\nu_0\sigma_0^2}{2}\right\} \tau^{\dfrac{n}{2}} \ \textrm{exp}\left\{-\dfrac{1}{2} \tau s^2(n-1) \right\}}_{\textrm{Terms NOT involving } \mu}\\ & \boldsymbol{=} \textrm{exp}\left\{-\dfrac{1}{2} \kappa_0 \tau (\mu^2 - 2\mu\mu_0 + \mu_0^2) \right\}\ \textrm{exp}\left\{-\dfrac{1}{2} \tau n(\mu^2 - 2\mu\bar{y} + \bar{y}^2) \right\}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}
\begin{split} \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\kappa_0 \tau (\mu^2 - 2\mu\mu_0) + \tau n(\mu^2 - 2\mu\bar{y}) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2(n\tau + \kappa_0 \tau) - 2\mu(n\tau \bar{y} + \kappa_0 \tau\mu_0) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}
\begin{split} \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\kappa_0 \tau (\mu^2 - 2\mu\mu_0) + \tau n(\mu^2 - 2\mu\bar{y}) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2(n\tau + \kappa_0 \tau) - 2\mu(n\tau \bar{y} + \kappa_0 \tau\mu_0) \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}
\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \mu^2 a^\star - 2\mu b^\star \right] \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}
\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 + \dfrac{(b^\star)^2}{2a^\star} \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(b^\star)^2}{a^\star}\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} \left[ \kappa_0 \tau \mu_0^2 + \tau n \bar{y}^2 - \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)^2}{(n\tau + \kappa_0 \tau)}\right] \right\}}_{\textrm{Expand terms and recombine}} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ & \boldsymbol{=} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{1}{2} \left[\dfrac{n\kappa_0\tau^2 (\mu_0^2- 2\mu_0\bar{y} + \bar{y}^2)}{\tau(n + \kappa_0)}\right] \right\}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \end{split}
\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \textrm{exp}\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\} \ \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\}\\ \\ & \boldsymbol{=} \ \ \textrm{exp} \underbrace{\left\{-\dfrac{1}{2} a^\star \left[ \mu- \dfrac{b^\star}{a^\star} \right]^2 \right\}}_{\textrm{Substitute the values for } a^\star \textrm{ and } b^\star \textrm{ back}} \\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau \left[\nu_0\sigma_0^2 + s^2(n-1) \right] }{2}\right\} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\dfrac{n\kappa_0 (\bar{y} - \mu_0)^2}{(n + \kappa_0)}\right] \right\}\\ \\ & \boldsymbol{=} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} (n\tau + \kappa_0 \tau) \left[ \mu^2- \dfrac{(n\tau \bar{y} + \kappa_0 \tau\mu_0)}{(n\tau + \kappa_0 \tau)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ \end{split}
\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \tau(n + \kappa_0) \left[ \mu^2- \dfrac{(n \bar{y} + \kappa_0 \mu_0)}{(n + \kappa_0)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ & \boldsymbol{=} \ \ \mathcal{N}\left(\mu_n, \dfrac{1}{\kappa_n\tau} \right) \times \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}\right) = \pi(\mu|Y, \tau) \pi(\tau|Y), \end{split}
where
\begin{split} \kappa_n & = \kappa_0 + n\\ \mu_n & = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y}\\ \nu_n & = \nu_0 + n\\ \sigma_n^2 & = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right] = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + \sum_{i=1}^n (y_i-\bar{y})^2 + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right]\\ \end{split}
\begin{split} \Rightarrow \ \pi(\mu,\tau | Y) \ & \boldsymbol{ \propto} \ \ \underbrace{\textrm{exp}\left\{-\dfrac{1}{2} \tau(n + \kappa_0) \left[ \mu^2- \dfrac{(n \bar{y} + \kappa_0 \mu_0)}{(n + \kappa_0)} \right]^2 \right\}}_{\textrm{Normal Kernel}}\\\ & \ \ \ \ \ \ \ \ \ \ \ \ \times \underbrace{\tau^{\dfrac{\nu_0 + n}{2}-1} \textrm{exp}\left\{-\dfrac{\tau}{2} \left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{(n + \kappa_0)} (\bar{y} - \mu_0)^2 \right] \right\}}_{\textrm{Gamma Kernel}}\\ & \boldsymbol{=} \ \ \mathcal{N}\left(\mu_n, \dfrac{1}{\kappa_n\tau} \right) \times \textrm{Gamma}\left(\dfrac{\nu_n}{2}, \dfrac{\nu_n \sigma_n^2}{2}\right) = \pi(\mu|Y, \tau) \pi(\tau|Y), \end{split}
where
\begin{split} \kappa_n & = \kappa_0 + n\\ \mu_n & = \dfrac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_n} = \dfrac{\kappa_0}{\kappa_n} \mu_0 + \dfrac{n}{\kappa_n} \bar{y}\\ \nu_n & = \nu_0 + n\\ \sigma_n^2 & = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + s^2(n-1) + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right] = \dfrac{1}{\nu_n}\left[\nu_0\sigma_0^2 + \sum_{i=1}^n (y_i-\bar{y})^2 + \dfrac{n\kappa_0}{\kappa_n} (\bar{y} - \mu_0)^2 \right]\\ \end{split}
We will focus on the Pygmalion study. Follow the same approach for the job training data.
Suppose you have no idea whether students would improve IQ on average. Set \mu_{0A} = \mu_{0N} = 0.
We will focus on the Pygmalion study. Follow the same approach for the job training data.
Suppose you have no idea whether students would improve IQ on average. Set \mu_{0A} = \mu_{0N} = 0.
Suppose you don't have any faith in this belief, and think it is the equivalent of having only 1 prior observation in each group. Set \kappa_{0A} = \kappa_{0N} = 1.
We will focus on the Pygmalion study. Follow the same approach for the job training data.
Suppose you have no idea whether students would improve IQ on average. Set \mu_{0A} = \mu_{0N} = 0.
Suppose you don't have any faith in this belief, and think it is the equivalent of having only 1 prior observation in each group. Set \kappa_{0A} = \kappa_{0N} = 1.
Based on the literature, SD of change scores should be around 10 in each group, but still you don't have a lot of faith in this belief. Set \nu_{0A} = \nu_{0N} = 1 and \sigma^2_{0A} = \sigma^2_{0N} = 100.
We will focus on the Pygmalion study. Follow the same approach for the job training data.
Suppose you have no idea whether students would improve IQ on average. Set \mu_{0A} = \mu_{0N} = 0.
Suppose you don't have any faith in this belief, and think it is the equivalent of having only 1 prior observation in each group. Set \kappa_{0A} = \kappa_{0N} = 1.
Based on the literature, SD of change scores should be around 10 in each group, but still you don't have a lot of faith in this belief. Set \nu_{0A} = \nu_{0N} = 1 and \sigma^2_{0A} = \sigma^2_{0N} = 100.
Graph priors to see if they accord with your beliefs. Sampling new values of Y from the priors offers a good check.
Data:
Accelerated group (A): 20, 10, 19, 15, 9, 18.
No growth group (N): 3, 2, 6, 10, 11, 5.
Data:
Accelerated group (A): 20, 10, 19, 15, 9, 18.
No growth group (N): 3, 2, 6, 10, 11, 5.
Summary statistics:
\bar{y}_A = 15.2; s_A = 4.71.
\bar{y}_N = 6.2; s_N = 3.65.
\begin{split} \kappa_{nA} & = \kappa_{0A} + n_A = 1 + 6 = 7\\ \kappa_{nN} & = \kappa_{0N} + n_N = 1 + 6 = 7\\ \nu_{nA} & = \nu_{0A} + n_A = 1 + 6 = 7\\ \nu_{nN} & = \nu_{0N} + n_N = 1 + 6 = 7\\ \mu_{nA} & = \dfrac{\kappa_{0A} \mu_{0A} + n_A\bar{y}_A}{\kappa_{nA}} = \dfrac{ (1)(0) + (6)(15.2) }{7} \approx 13.03\\ \mu_{nN} & = \dfrac{\kappa_{0N} \mu_{0N} + n_N\bar{y}_N}{\kappa_{nN}} = \dfrac{ (1)(0) + (6)(6.2) }{7} \approx 5.31\\ \sigma_{nA}^2 & = \dfrac{1}{\nu_{nA}}\left[\nu_{0A}\sigma_{0A}^2 + s^2_A(n_A-1) + \dfrac{n_A\kappa_{0A}}{\kappa_{nA}} (\bar{y}_A - \mu_{0A})^2\right] \\ & = \dfrac{1}{7}\left[(1)(100) + (22.17)(5) + \dfrac{(6)(1)}{(7)} (15.2- 0)^2\right] \approx 58.41\\ \sigma_{nN}^2 & = \dfrac{1}{\nu_{nN}}\left[\nu_{0N}\sigma_{0N}^2 + s^2_N(n_N-1) + \dfrac{n_N\kappa_{0N}}{\kappa_{nN}} (\bar{y}_N - \mu_{0N})^2\right] \\ & = \dfrac{1}{7}\left[(1)(100) + (13.37)(5) + \dfrac{(6)(1)}{(7)} (6.2- 0)^2\right] \approx 28.54\\ \end{split}
\begin{split} \mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\mu_{nA}, \dfrac{1}{\kappa_{nA}\tau_A} \right) = \mathcal{N}\left(13.03, \dfrac{1}{7\tau_A} \right)\\ \tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{\nu_{nA}}{2}, \dfrac{\nu_{nA} \sigma_{nA}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (58.41)}{2}\right)\\ \mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\mu_{nN}, \dfrac{1}{\kappa_{nN}\tau_N} \right) = \mathcal{N}\left(5.31, \dfrac{1}{7\tau_N} \right)\\ \tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{\nu_{nN}}{2}, \dfrac{\nu_{nN} \sigma_{nN}^2}{2}\right) = \textrm{Gamma}\left(\dfrac{7}{2}, \dfrac{7 (28.54)}{2}\right)\\ \end{split}
To evaluate whether the accelerated group has larger IQ gains than the normal group, we would like to estimate quantities like \Pr[\mu_A > \mu_N | Y_A, Y_N) which are based on the marginal posterior of \mu rather than the conditional distribution.
Fortunately, this is easy to do by generating samples of \mu and \sigma^2 from their joint posterior.
\begin{split} \tau^{(1)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(1)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(1)}} \right)\\ \tau^{(2)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(2)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(2)}} \right)\\ & \ \ \vdots \\ & \ \ \vdots \\ & \ \ \vdots \\ \tau^{(m)} & \sim \textrm{Gamma}\left(\dfrac{\nu_{n}}{2}, \dfrac{\nu_{n} \sigma_{n}^2}{2}\right)\\ \mu^{(m)} & \sim \ \mathcal{N}\left(\mu_{n}, \dfrac{1}{\kappa_{n}\tau^{(m)}} \right)\\ \end{split}
Note that we are sampling each \mu^{(j)}, j=1,\ldots,m, from its conditional distribution, not from the marginal.
The sequence of pairs \{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\} simulated using this method are independent samples from the joint posterior \pi(\mu,\tau | Y).
Note that we are sampling each \mu^{(j)}, j=1,\ldots,m, from its conditional distribution, not from the marginal.
The sequence of pairs \{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\} simulated using this method are independent samples from the joint posterior \pi(\mu,\tau | Y).
Additionally, the simulated sequence \{\mu^{(1)}, \ldots, \mu^{(m)}\} are independent samples from the marginal posterior distribution.
Note that we are sampling each \mu^{(j)}, j=1,\ldots,m, from its conditional distribution, not from the marginal.
The sequence of pairs \{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\} simulated using this method are independent samples from the joint posterior \pi(\mu,\tau | Y).
Additionally, the simulated sequence \{\mu^{(1)}, \ldots, \mu^{(m)}\} are independent samples from the marginal posterior distribution.
While this may seem odd, keep in mind that while we drew the \mu's as conditional samples, each was conditional on a different value of \tau.
Note that we are sampling each \mu^{(j)}, j=1,\ldots,m, from its conditional distribution, not from the marginal.
The sequence of pairs \{(\tau, \mu)^{(1)}, \ldots, (\tau, \mu)^{(m)}\} simulated using this method are independent samples from the joint posterior \pi(\mu,\tau | Y).
Additionally, the simulated sequence \{\mu^{(1)}, \ldots, \mu^{(m)}\} are independent samples from the marginal posterior distribution.
While this may seem odd, keep in mind that while we drew the \mu's as conditional samples, each was conditional on a different value of \tau.
Thus, together they constitute marginal samples of \mu.
It is easy to sample from these posteriors:
aA <- 7/2aN <- 7/2bA <- (7/2)*58.41bN <- (7/2)*28.54muA <- 13.03muN <- 5.31kappaA <- 7kappaN <- 7tauA_postsample <- rgamma(10000,aA,bA)thetaA_postsample <- rnorm(10000,muA,sqrt(1/(kappaA*tauA_postsample)))tauN_postsample <- rgamma(10000,aN,bN)thetaN_postsample <- rnorm(10000,muN,sqrt(1/(kappaN*tauN_postsample)))sigma2A_postsample <- 1/tauA_postsamplesigma2N_postsample <- 1/tauN_postsample
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample > thetaN_postsample)
## [1] 0.9721
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample > thetaN_postsample)
## [1] 0.9721
Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
mean(sigma2A_postsample > sigma2N_postsample)
## [1] 0.8185
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample > thetaN_postsample)
## [1] 0.9721
Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
mean(sigma2A_postsample > sigma2N_postsample)
## [1] 0.8185
Let's be very objective with the prior selection. In fact, let's be extreme!
\pi(\mu,\sigma^2) \propto \dfrac{1}{\sigma^2} is improper (does not integrate to 1) but does lead to a proper posterior distribution that yields inferences similar to frequentist ones.
Let's be very objective with the prior selection. In fact, let's be extreme!
\pi(\mu,\sigma^2) \propto \dfrac{1}{\sigma^2} is improper (does not integrate to 1) but does lead to a proper posterior distribution that yields inferences similar to frequentist ones.
For that choice, we have
\begin{split} \mu|Y,\tau & \sim \mathcal{N}\left(\bar{y},\dfrac{1}{n \tau}\right)\\ \tau | Y & \sim \textrm{Gamma}\left(\dfrac{n-1}{2}, \dfrac{(n-1)s^2}{2}\right)\\ \end{split}
\begin{split} \mu_A | Y_A, \tau_A & \sim \ \mathcal{N}\left(\bar{y}_A,\dfrac{1}{n_A \tau_A}\right) = \mathcal{N}\left(15.2, \dfrac{1}{6\tau_A} \right)\\ \tau_A | Y_A & \sim \textrm{Gamma}\left(\dfrac{n_A-1}{2}, \dfrac{(n_A-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(22.17)}{2}\right)\\ \mu_N | Y_N, \tau_N & \sim \ \mathcal{N}\left(\bar{y}_N,\dfrac{1}{n_N \tau_N}\right) = \mathcal{N}\left(6.2, \dfrac{1}{6\tau_N} \right)\\ \tau_N | Y_N & \sim \textrm{Gamma}\left(\dfrac{n_N-1}{2}, \dfrac{(n_N-1)s^2_A}{2}\right) = \textrm{Gamma}\left(\dfrac{6-1}{2}, \dfrac{(6-1)(13.37)}{2}\right)\\ \end{split}
It is easy to sample from these posteriors:
aA <- (6-1)/2aN <- (6-1)/2bA <- (6-1)*22.17/2bN <- (6-1)*13.37/2muA <- 15.2muN <- 6.2tauA_postsample_impr <- rgamma(10000,aA,bA)thetaA_postsample_impr <- rnorm(10000,muA,sqrt(1/(6*tauA_postsample_impr)))tauN_postsample_impr <- rgamma(10000,aN,bN)thetaN_postsample_impr <- rnorm(10000,muN,sqrt(1/(6*tauN_postsample_impr)))sigma2A_postsample_impr <- 1/tauA_postsample_imprsigma2N_postsample_impr <- 1/tauN_postsample_impr
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample_impr > thetaN_postsample_impr)
## [1] 0.9941
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample_impr > thetaN_postsample_impr)
## [1] 0.9941
Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
mean(sigma2A_postsample_impr > sigma2N_postsample_impr)
## [1] 0.7113
Is the average improvement for the accelerated group larger than that for the no growth group?
mean(thetaA_postsample_impr > thetaN_postsample_impr)
## [1] 0.9941
Is the variance of improvement scores for the accelerated group larger than that for the no growth group?
mean(sigma2A_postsample_impr > sigma2N_postsample_impr)
## [1] 0.7113
Data:
No training group (N): sample size n_N = 429.
Training group (T): sample size n_A = 185.
Data:
No training group (N): sample size n_N = 429.
Training group (T): sample size n_A = 185.
Summary statistics for change in annual earnings:
\bar{y}_N = 1364.93; s_N = 7460.05
\bar{y}_T = 4253.57; s_T = 8926.99
Data:
No training group (N): sample size n_N = 429.
Training group (T): sample size n_A = 185.
Summary statistics for change in annual earnings:
\bar{y}_N = 1364.93; s_N = 7460.05
\bar{y}_T = 4253.57; s_T = 8926.99
Take "Participation Quiz II - Jan 31" on Sakai.
Homework 3 now online.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |