Skip to content

Monte Carlo and Gibbs Sampling

Source: Lecture 7 — Monte Carlo and Gibbs sampling (BDA Ch. 10—11).

Bayesian inference often ends with an integral.

We may want a posterior mean, a posterior probability, a predictive quantity, or a marginal distribution. In simple conjugate models these quantities can sometimes be computed exactly. In realistic models they usually cannot.

The main idea of this chapter is:

When an integral is hard, simulate from the distribution and average the simulated values.

The chapter builds this idea in three steps:

  1. Use ordinary Monte Carlo when we can draw independent samples directly.
  2. Use Gibbs sampling when we can draw from full conditional distributions.
  3. Use data augmentation to create conditional distributions that are easy to sample from.

Suppose we want

E[g(θ)].E[g(\theta)].

If we can draw independent samples

θ(1),,θ(N)p(θ),\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta),

then we can estimate the expectation by averaging the function over the draws:

E[g(θ)]^=1Nt=1Ng(θ(t)).\widehat{E[g(\theta)]} = \frac{1}{N}\sum_{t=1}^{N}g(\theta^{(t)}).

This is the Monte Carlo principle.

The target might be a prior, a posterior, or a predictive distribution. The only requirement for ordinary Monte Carlo is that the draws are independent samples from the distribution of interest.

For example, if the goal is the posterior mean of θ\theta, draw

θ(1),,θ(N)p(θy).\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta \mid y).

Step 2: Average the Quantity You Care About

Section titled “Step 2: Average the Quantity You Care About”

For the posterior mean, use

θˉ=1Nt=1Nθ(t).\bar\theta = \frac{1}{N}\sum_{t=1}^{N}\theta^{(t)}.

For a transformed quantity, use

gˉ=1Nt=1Ng(θ(t)).\bar g = \frac{1}{N}\sum_{t=1}^{N}g(\theta^{(t)}).

The law of large numbers says that these averages converge to the corresponding expectations as NN grows.

Step 3: Estimate Probabilities With Indicators

Section titled “Step 3: Estimate Probabilities With Indicators”

A probability is also an expectation. For example,

Pr(θc)=E[1(θc)].\Pr(\theta \le c) = E[\mathbf{1}(\theta \le c)].

So we estimate it by the fraction of simulated draws satisfying the event:

Pr(θc)^=1Nt=1N1(θ(t)c).\widehat{\Pr(\theta \le c)} = \frac{1}{N}\sum_{t=1}^{N}\mathbf{1}(\theta^{(t)} \le c).

This is why posterior intervals, tail probabilities, and predictive probabilities are natural simulation outputs.

For independent draws,

Var(gˉ)=Var(g(θ))N.\mathrm{Var}(\bar g) = \frac{\mathrm{Var}(g(\theta))}{N}.

The important message is practical:

Monte Carlo error decreases at rate 1/N1/\sqrt{N}.

To cut simulation error in half, we need about four times as many draws.

Monte Carlo is easy once we can sample from the target. Sometimes we can do that directly.

The inverse CDF method turns a Uniform(0,1)(0,1) draw into a draw from a chosen distribution.

Let FF be the cumulative distribution function of XX.

  1. Draw

    uUniform(0,1).u \sim \mathrm{Uniform}(0,1).

  2. Set

    x=F1(u).x = F^{-1}(u).

Then xx has distribution FF.

The uniform draw chooses a probability level. The inverse CDF maps that probability level back to the corresponding value on the xx scale.

For an Exponential(λ)(\lambda) distribution,

F(x)=1eλx.F(x)=1-e^{-\lambda x}.

Set u=F(x)u=F(x) and solve for xx:

x=log(1u)λ.x=-\frac{\log(1-u)}{\lambda}.

For a standard Cauchy distribution,

F(x)=12+1πarctan(x),F(x)=\frac{1}{2}+\frac{1}{\pi}\arctan(x),

so

x=tan{π(u1/2)}.x=\tan\{\pi(u-1/2)\}.

For a discrete distribution, the inverse idea becomes a threshold rule.

  1. Draw uUniform(0,1)u \sim \mathrm{Uniform}(0,1).

  2. Find the smallest value xx such that

    F(x)u.F(x)\ge u.

  3. Return that xx.

This is how categorical sampling works: divide the unit interval into probability-sized pieces and see where uu lands.

Sometimes a distribution can be generated from simpler distributions.

Examples:

  • If yy and zz are independent standard Normal variables, then

    yzCauchy(0,1).\frac{y}{z} \sim \mathrm{Cauchy}(0,1).

  • If x1,,xνx_1,\ldots,x_\nu are independent standard Normal variables, then

    i=1νxi2χν2.\sum_{i=1}^{\nu}x_i^2 \sim \chi^2_\nu.

These relationships are useful because many software libraries already sample accurately from basic distributions.

Direct sampling is not always available. A posterior distribution may be high-dimensional:

p(θ1,,θky).p(\theta_1,\ldots,\theta_k \mid y).

Sampling from the full joint distribution may be hard. But each parameter may be easy to sample once the others are fixed.

Gibbs sampling uses this situation.

The full conditional distribution for θj\theta_j is

p(θjθj,y),p(\theta_j \mid \theta_{-j}, y),

where θj\theta_{-j} means all parameters except θj\theta_j.

The full conditional answers:

If all other parameters were temporarily fixed, what is the distribution of this one parameter?

Gibbs sampling cycles through these conditional distributions. Each update is easy, but together the updates explore the joint posterior.

Choose initial values

θ(0)=(θ1(0),,θk(0)).\theta^{(0)}=(\theta_1^{(0)},\ldots,\theta_k^{(0)}).

The starting value does not need to be a typical posterior value, but poor starting values may require a longer warm-up period.

At iteration tt, draw

θ1(t)p(θ1θ2(t1),,θk(t1),y),\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)},\ldots,\theta_k^{(t-1)},y),

then

θ2(t)p(θ2θ1(t),θ3(t1),,θk(t1),y),\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)},\theta_3^{(t-1)},\ldots,\theta_k^{(t-1)},y),

and continue until the last block:

θk(t)p(θkθ1(t),,θk1(t),y).\theta_k^{(t)} \sim p(\theta_k \mid \theta_1^{(t)},\ldots,\theta_{k-1}^{(t)},y).

Each new value is used immediately in later updates in the same iteration.

After many cycles, the draws

θ(1),θ(2),\theta^{(1)},\theta^{(2)},\ldots

behave like dependent draws from the joint posterior distribution.

They are not iid. They are a Markov chain whose stationary distribution is the posterior.

Even though the sampler updates conditionally, the long-run draws of each component approximate its marginal posterior:

θj(1),,θj(N)p(θjy).\theta_j^{(1)},\ldots,\theta_j^{(N)} \approx p(\theta_j \mid y).

So posterior means, intervals, and probabilities are computed by the same simulation averages as before.

Gibbs draws are dependent. Consecutive draws often look similar, especially when parameters are strongly correlated.

For an autocorrelated chain,

Var(θˉ)=σ2N(1+2=1ρ),\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N} \left( 1+2\sum_{\ell=1}^{\infty}\rho_{\ell} \right),

where ρ\rho_{\ell} is the lag-\ell autocorrelation.

The factor

IF=1+2=1ρ\mathrm{IF} = 1+2\sum_{\ell=1}^{\infty}\rho_{\ell}

is the inefficiency factor. Larger autocorrelation means fewer effective independent draws.

Suppose

(θ1θ2)N2((μ1μ2),(1ρρ1)).\begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix} \sim N_2 \left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right).

For this model,

θ1θ2N ⁣(μ1+ρ(θ2μ2),1ρ2),\theta_1 \mid \theta_2 \sim N\!\left(\mu_1+\rho(\theta_2-\mu_2),\,1-\rho^2\right),

and

θ2θ1N ⁣(μ2+ρ(θ1μ1),1ρ2).\theta_2 \mid \theta_1 \sim N\!\left(\mu_2+\rho(\theta_1-\mu_1),\,1-\rho^2\right).

If ρ|\rho| is small, the two parameters are weakly related and the chain moves easily.

If ρ|\rho| is close to 1, the posterior mass lies along a narrow diagonal shape. Updating one coordinate at a time moves slowly along that shape. The sampler is still valid, but it can mix poorly.

Let

x1,,xnμ,σ2N(μ,σ2).x_1,\ldots,x_n \mid \mu,\sigma^2 \sim N(\mu,\sigma^2).

Use conditionally conjugate priors:

μN(μ0,τ02),σ2Inv-χ2(ν0,σ02).\mu \sim N(\mu_0,\tau_0^2), \qquad \sigma^2 \sim \mathrm{Inv\text{-}}\chi^2(\nu_0,\sigma_0^2).

The joint posterior for (μ,σ2)(\mu,\sigma^2) may not be from one simple standard family. But the full conditionals are standard:

μσ2,xN(μn,τn2),\mu \mid \sigma^2,x \sim N(\mu_n,\tau_n^2),

and

σ2μ,xInv-χ2(ν0+n,ν0σ02+i=1n(xiμ)2ν0+n).\sigma^2 \mid \mu,x \sim \mathrm{Inv\text{-}}\chi^2 \left( \nu_0+n,\, \frac{\nu_0\sigma_0^2+\sum_{i=1}^{n}(x_i-\mu)^2}{\nu_0+n} \right).

The sampler alternates:

  1. Draw μ\mu given the current σ2\sigma^2.
  2. Draw σ2\sigma^2 given the new μ\mu.
  3. Repeat.

Some models are hard because the likelihood is not conditionally conjugate. Data augmentation introduces latent variables that make the conditional steps easier.

The pattern is:

  1. Add unobserved variables zz.
  2. Sample parameters given zz and the data.
  3. Sample zz given parameters and the data.

The marginal posterior for the original parameters is unchanged after we ignore the auxiliary variables.

A KK-component mixture has density

p(xi)=k=1Kπkϕ(xi;μk,σk2),p(x_i) = \sum_{k=1}^{K}\pi_k\,\phi(x_i;\mu_k,\sigma_k^2),

where πk\pi_k are mixture weights and ϕ(;μk,σk2)\phi(\cdot;\mu_k,\sigma_k^2) is a Normal density.

Let

Ii{1,,K}I_i \in \{1,\ldots,K\}

be the unknown component that generated observation xix_i.

Given the indicators, the data are separated into ordinary Normal groups.

Step 2: Update Parameters Given Indicators

Section titled “Step 2: Update Parameters Given Indicators”

Conditional on I1,,InI_1,\ldots,I_n, update:

  • mixture weights π\pi from a Dirichlet distribution;
  • each component’s (μk,σk2)(\mu_k,\sigma_k^2) from its Normal—Inverse-χ2\chi^2 conditional distribution.

Step 3: Update Indicators Given Parameters

Section titled “Step 3: Update Indicators Given Parameters”

For each observation,

Pr(Ii=kxi,π,μ,σ2)πkϕ(xi;μk,σk2).\Pr(I_i=k \mid x_i,\pi,\mu,\sigma^2) \propto \pi_k\,\phi(x_i;\mu_k,\sigma_k^2).

Normalize these KK values so they sum to 1, then draw IiI_i from the resulting categorical distribution.

For binary data, probit regression can be written with a latent utility:

uiβN(xiβ,1),yi=1(ui>0).u_i \mid \beta \sim N(x_i^\top\beta,1), \qquad y_i=\mathbf{1}(u_i>0).

We observe yiy_i, not uiu_i.

  1. Given uu, update β\beta as in a Normal linear regression.

  2. Given β\beta and yiy_i, update each uiu_i from a truncated Normal distribution:

    • if yi=1y_i=1, draw ui>0u_i>0;
    • if yi=0y_i=0, draw ui0u_i\le 0.

The latent utility turns a binary-response problem into repeated Normal-regression steps.

Example: Logistic Regression With Polya-Gamma Variables

Section titled “Example: Logistic Regression With Polya-Gamma Variables”

Logistic regression is not conjugate to a Normal prior in the same direct way as linear regression. Polya-Gamma augmentation creates a conditionally Normal update for β\beta.

Introduce latent variables

ωiβPG(1,xiβ).\omega_i \mid \beta \sim \mathrm{PG}(1,x_i^\top\beta).

Given ω=(ω1,,ωn)\omega=(\omega_1,\ldots,\omega_n), the coefficient update is Normal:

βy,ωN(mω,Vω),\beta \mid y,\omega \sim N(m_{\omega},V_{\omega}),

where

Vω=(XΩX+B1)1,V_{\omega} = (X^\top\Omega X+B^{-1})^{-1},

and

mω=Vω(Xκ+B1b).m_{\omega} = V_{\omega}(X^\top\kappa+B^{-1}b).

Here Ω=diag(ω1,,ωn)\Omega=\mathrm{diag}(\omega_1,\ldots,\omega_n) and κi=yi1/2\kappa_i=y_i-1/2.

You do not need to memorize the Polya-Gamma distribution for this course. The important idea is the augmentation pattern: add latent variables so that the parameter update becomes a familiar Normal draw.

Shrinkage can be written as a hierarchical prior. For regression coefficients,

βσ2,λN(0,σ2λ1I),\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I),

with

λGamma(η02,η02λ0).\lambda \sim \mathrm{Gamma}\left(\frac{\eta_0}{2},\frac{\eta_0}{2\lambda_0}\right).

The Gibbs sampler includes an update for λ\lambda:

λβ,σ2,yGamma(m+η02,σ2j=1mβj2+η0/λ02).\lambda \mid \beta,\sigma^2,y \sim \mathrm{Gamma} \left( \frac{m+\eta_0}{2}, \frac{\sigma^{-2}\sum_{j=1}^{m}\beta_j^2+\eta_0/\lambda_0}{2} \right).

The conditional mean is

E(λβ,σ2,y)=m+η0σ2j=1mβj2+η0/λ0.E(\lambda \mid \beta,\sigma^2,y) = \frac{m+\eta_0} {\sigma^{-2}\sum_{j=1}^{m}\beta_j^2+\eta_0/\lambda_0}.

If the coefficients are large, the denominator is large and the posterior mean of λ\lambda decreases. Smaller λ\lambda means less shrinkage. If the coefficients are close to zero, the sampler supports stronger shrinkage.

If two parameters are strongly correlated, updating them separately can be slow. A blocked Gibbs update samples them together.

The rule of thumb is:

Put strongly correlated parameters in the same block when a joint conditional update is available.

Sometimes the same model can be written with parameters that are less correlated in the posterior. A better parametrization can improve mixing without changing the underlying model.

Augmentation can make each conditional draw easy. But it can also add dependence and increase autocorrelation. A sampler can be simple per iteration but inefficient per effective draw.

Monte Carlo approximates expectations by simulation. If draws are independent, averages converge with variance decreasing like 1/N1/N. Gibbs sampling extends simulation to high-dimensional posteriors by repeatedly sampling from full conditional distributions. The resulting draws are dependent, so autocorrelation and effective sample size matter. Data augmentation introduces latent variables to make difficult conditional distributions tractable, with applications to mixtures, probit and logistic regression, and regularized regression.

  1. Why can a posterior probability be estimated by averaging indicator functions?
  2. What is a full conditional distribution?
  3. In a Gibbs sampler, why are the draws usually autocorrelated?
  4. Why does high correlation slow down one-at-a-time Gibbs updates in the bivariate Normal example?
  5. How does data augmentation help in mixture models or probit regression?