Monte Carlo and Gibbs Sampling

Source: Lecture 7 — Monte Carlo and Gibbs sampling (BDA Ch. 10—11).

The Thread of This Chapter

Bayesian inference often ends with an integral.

We may want a posterior mean, a posterior probability, a predictive quantity, or a marginal distribution. In simple conjugate models these quantities can sometimes be computed exactly. In realistic models they usually cannot.

The main idea of this chapter is:

When an integral is hard, simulate from the distribution and average the simulated values.

The chapter builds this idea in three steps:

Use ordinary Monte Carlo when we can draw independent samples directly.
Use Gibbs sampling when we can draw from full conditional distributions.
Use data augmentation to create conditional distributions that are easy to sample from.

Monte Carlo Simulation

What Problem Does It Solve?

Suppose we want

$E[g(\theta)].$

If we can draw independent samples

$\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta),$

then we can estimate the expectation by averaging the function over the draws:

\widehat{E[g(\theta)]} = \frac{1}{N}\sum_{t=1}^{N}g(\theta^{(t)}).

This is the Monte Carlo principle.

Step 1: Draw From the Target Distribution

The target might be a prior, a posterior, or a predictive distribution. The only requirement for ordinary Monte Carlo is that the draws are independent samples from the distribution of interest.

For example, if the goal is the posterior mean of $\theta$ , draw

$\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta \mid y).$

Step 2: Average the Quantity You Care About

For the posterior mean, use

\bar\theta = \frac{1}{N}\sum_{t=1}^{N}\theta^{(t)}.

For a transformed quantity, use

\bar g = \frac{1}{N}\sum_{t=1}^{N}g(\theta^{(t)}).

The law of large numbers says that these averages converge to the corresponding expectations as $N$ grows.

Step 3: Estimate Probabilities With Indicators

A probability is also an expectation. For example,

$\Pr(\theta \le c) = E[\mathbf{1}(\theta \le c)].$

So we estimate it by the fraction of simulated draws satisfying the event:

\widehat{\Pr(\theta \le c)} = \frac{1}{N}\sum_{t=1}^{N}\mathbf{1}(\theta^{(t)} \le c).

This is why posterior intervals, tail probabilities, and predictive probabilities are natural simulation outputs.

Monte Carlo Error

For independent draws,

\mathrm{Var}(\bar g) = \frac{\mathrm{Var}(g(\theta))}{N}.

The important message is practical:

Monte Carlo error decreases at rate $1/\sqrt{N}$ .

To cut simulation error in half, we need about four times as many draws.

Direct Sampling Methods

Monte Carlo is easy once we can sample from the target. Sometimes we can do that directly.

Inverse CDF Sampling

Purpose

The inverse CDF method turns a Uniform $(0,1)$ draw into a draw from a chosen distribution.

The Algorithm

Let $F$ be the cumulative distribution function of $X$ .

Draw

$u \sim \mathrm{Uniform}(0,1).$
Set

$x = F^{-1}(u).$

Then $x$ has distribution $F$ .

How to Read It

The uniform draw chooses a probability level. The inverse CDF maps that probability level back to the corresponding value on the $x$ scale.

For an Exponential $(\lambda)$ distribution,

F(x)=1-e^{-\lambda x}.

Set $u=F(x)$ and solve for $x$ :

x=-\frac{\log(1-u)}{\lambda}.

For a standard Cauchy distribution,

F(x)=\frac{1}{2}+\frac{1}{\pi}\arctan(x),

x=\tan\{\pi(u-1/2)\}.

Discrete Sampling

For a discrete distribution, the inverse idea becomes a threshold rule.

Draw $u \sim \mathrm{Uniform}(0,1)$ .
Find the smallest value $x$ such that

$F(x)\ge u.$
Return that $x$ .

This is how categorical sampling works: divide the unit interval into probability-sized pieces and see where $u$ lands.

Sampling From Relationships

Sometimes a distribution can be generated from simpler distributions.

Examples:

If $y$ and $z$ are independent standard Normal variables, then

$\frac{y}{z} \sim \mathrm{Cauchy}(0,1).$
If $x_1,\ldots,x_\nu$ are independent standard Normal variables, then

$\sum_{i=1}^{\nu}x_i^2 \sim \chi^2_\nu.$

These relationships are useful because many software libraries already sample accurately from basic distributions.

Why We Need Gibbs Sampling

Direct sampling is not always available. A posterior distribution may be high-dimensional:

$p(\theta_1,\ldots,\theta_k \mid y).$

Sampling from the full joint distribution may be hard. But each parameter may be easy to sample once the others are fixed.

Gibbs sampling uses this situation.

Full Conditional Distributions

Definition

The full conditional distribution for $\theta_j$ is

$p(\theta_j \mid \theta_{-j}, y),$

where $\theta_{-j}$ means all parameters except $\theta_j$ .

What It Means

The full conditional answers:

If all other parameters were temporarily fixed, what is the distribution of this one parameter?

Gibbs sampling cycles through these conditional distributions. Each update is easy, but together the updates explore the joint posterior.

The Gibbs Sampling Algorithm

Step 1: Start Somewhere

Choose initial values

$\theta^{(0)}=(\theta_1^{(0)},\ldots,\theta_k^{(0)}).$

The starting value does not need to be a typical posterior value, but poor starting values may require a longer warm-up period.

Step 2: Update One Block at a Time

At iteration $t$ , draw

\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)},\ldots,\theta_k^{(t-1)},y),

then

\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)},\theta_3^{(t-1)},\ldots,\theta_k^{(t-1)},y),

and continue until the last block:

\theta_k^{(t)} \sim p(\theta_k \mid \theta_1^{(t)},\ldots,\theta_{k-1}^{(t)},y).

Each new value is used immediately in later updates in the same iteration.

Step 3: Repeat

After many cycles, the draws

$\theta^{(1)},\theta^{(2)},\ldots$

behave like dependent draws from the joint posterior distribution.

They are not iid. They are a Markov chain whose stationary distribution is the posterior.

How to Read Gibbs Output

Marginal Posterior Draws

Even though the sampler updates conditionally, the long-run draws of each component approximate its marginal posterior:

$\theta_j^{(1)},\ldots,\theta_j^{(N)} \approx p(\theta_j \mid y).$

So posterior means, intervals, and probabilities are computed by the same simulation averages as before.

Autocorrelation

Gibbs draws are dependent. Consecutive draws often look similar, especially when parameters are strongly correlated.

For an autocorrelated chain,

\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N} \left( 1+2\sum_{\ell=1}^{\infty}\rho_{\ell} \right),

where $\rho_{\ell}$ is the lag- $\ell$ autocorrelation.

The factor

\mathrm{IF} = 1+2\sum_{\ell=1}^{\infty}\rho_{\ell}

is the inefficiency factor. Larger autocorrelation means fewer effective independent draws.

Example: Bivariate Normal Gibbs Sampler

The Target Distribution

Suppose

\begin{pmatrix} \theta_1 \\ \theta_2 \end{pmatrix} \sim N_2 \left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right).

The Full Conditionals

For this model,

\theta_1 \mid \theta_2 \sim N\!\left(\mu_1+\rho(\theta_2-\mu_2),\,1-\rho^2\right),

and

\theta_2 \mid \theta_1 \sim N\!\left(\mu_2+\rho(\theta_1-\mu_1),\,1-\rho^2\right).

What the Example Teaches

If $|\rho|$ is small, the two parameters are weakly related and the chain moves easily.

If $|\rho|$ is close to 1, the posterior mass lies along a narrow diagonal shape. Updating one coordinate at a time moves slowly along that shape. The sampler is still valid, but it can mix poorly.

Gibbs Sampling for a Normal Model

The Model

Let

$x_1,\ldots,x_n \mid \mu,\sigma^2 \sim N(\mu,\sigma^2).$

Use conditionally conjugate priors:

\mu \sim N(\mu_0,\tau_0^2), \qquad \sigma^2 \sim \mathrm{Inv\text{-}}\chi^2(\nu_0,\sigma_0^2).

Why Gibbs Helps

The joint posterior for $(\mu,\sigma^2)$ may not be from one simple standard family. But the full conditionals are standard:

\mu \mid \sigma^2,x \sim N(\mu_n,\tau_n^2),

and

\sigma^2 \mid \mu,x \sim \mathrm{Inv\text{-}}\chi^2 \left( \nu_0+n,\, \frac{\nu_0\sigma_0^2+\sum_{i=1}^{n}(x_i-\mu)^2}{\nu_0+n} \right).

The sampler alternates:

Draw $\mu$ given the current $\sigma^2$ .
Draw $\sigma^2$ given the new $\mu$ .
Repeat.

Data Augmentation

What Problem Does It Solve?

Some models are hard because the likelihood is not conditionally conjugate. Data augmentation introduces latent variables that make the conditional steps easier.

The pattern is:

Add unobserved variables $z$ .
Sample parameters given $z$ and the data.
Sample $z$ given parameters and the data.

The marginal posterior for the original parameters is unchanged after we ignore the auxiliary variables.

Example: Mixture of Normals

The Model

A $K$ -component mixture has density

p(x_i) = \sum_{k=1}^{K}\pi_k\,\phi(x_i;\mu_k,\sigma_k^2),

where $\pi_k$ are mixture weights and $\phi(\cdot;\mu_k,\sigma_k^2)$ is a Normal density.

Step 1: Introduce Membership Indicators

Let

$I_i \in \{1,\ldots,K\}$

be the unknown component that generated observation $x_i$ .

Given the indicators, the data are separated into ordinary Normal groups.

Step 2: Update Parameters Given Indicators

Conditional on $I_1,\ldots,I_n$ , update:

mixture weights $\pi$ from a Dirichlet distribution;
each component’s $(\mu_k,\sigma_k^2)$ from its Normal—Inverse- $\chi^2$ conditional distribution.

Step 3: Update Indicators Given Parameters

For each observation,

\Pr(I_i=k \mid x_i,\pi,\mu,\sigma^2) \propto \pi_k\,\phi(x_i;\mu_k,\sigma_k^2).

Normalize these $K$ values so they sum to 1, then draw $I_i$ from the resulting categorical distribution.

Example: Probit Regression

The Model

For binary data, probit regression can be written with a latent utility:

u_i \mid \beta \sim N(x_i^\top\beta,1), \qquad y_i=\mathbf{1}(u_i>0).

We observe $y_i$ , not $u_i$ .

Gibbs Updates

Given $u$ , update $\beta$ as in a Normal linear regression.
Given $\beta$ and $y_i$ , update each $u_i$ from a truncated Normal distribution:
- if $y_i=1$ , draw $u_i>0$ ;
- if $y_i=0$ , draw $u_i\le 0$ .

The latent utility turns a binary-response problem into repeated Normal-regression steps.

Example: Logistic Regression With Polya-Gamma Variables

Purpose

Logistic regression is not conjugate to a Normal prior in the same direct way as linear regression. Polya-Gamma augmentation creates a conditionally Normal update for $\beta$ .

The Updates

Introduce latent variables

$\omega_i \mid \beta \sim \mathrm{PG}(1,x_i^\top\beta).$

Given $\omega=(\omega_1,\ldots,\omega_n)$ , the coefficient update is Normal:

\beta \mid y,\omega \sim N(m_{\omega},V_{\omega}),

where

V_{\omega} = (X^\top\Omega X+B^{-1})^{-1},

and

m_{\omega} = V_{\omega}(X^\top\kappa+B^{-1}b).

Here $\Omega=\mathrm{diag}(\omega_1,\ldots,\omega_n)$ and $\kappa_i=y_i-1/2$ .

You do not need to memorize the Polya-Gamma distribution for this course. The important idea is the augmentation pattern: add latent variables so that the parameter update becomes a familiar Normal draw.

Example: Regularized Regression

The Prior

Shrinkage can be written as a hierarchical prior. For regression coefficients,

\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I),

with

\lambda \sim \mathrm{Gamma}\left(\frac{\eta_0}{2},\frac{\eta_0}{2\lambda_0}\right).

The Conditional Update

The Gibbs sampler includes an update for $\lambda$ :

\lambda \mid \beta,\sigma^2,y \sim \mathrm{Gamma} \left( \frac{m+\eta_0}{2}, \frac{\sigma^{-2}\sum_{j=1}^{m}\beta_j^2+\eta_0/\lambda_0}{2} \right).

The conditional mean is

E(\lambda \mid \beta,\sigma^2,y) = \frac{m+\eta_0} {\sigma^{-2}\sum_{j=1}^{m}\beta_j^2+\eta_0/\lambda_0}.

How to Read It

If the coefficients are large, the denominator is large and the posterior mean of $\lambda$ decreases. Smaller $\lambda$ means less shrinkage. If the coefficients are close to zero, the sampler supports stronger shrinkage.

Improving Gibbs Sampling

Blocking

If two parameters are strongly correlated, updating them separately can be slow. A blocked Gibbs update samples them together.

The rule of thumb is:

Put strongly correlated parameters in the same block when a joint conditional update is available.

Reparametrization

Sometimes the same model can be written with parameters that are less correlated in the posterior. A better parametrization can improve mixing without changing the underlying model.

Data Augmentation Tradeoff

Augmentation can make each conditional draw easy. But it can also add dependence and increase autocorrelation. A sampler can be simple per iteration but inefficient per effective draw.

Chapter Summary

Monte Carlo approximates expectations by simulation. If draws are independent, averages converge with variance decreasing like $1/N$ . Gibbs sampling extends simulation to high-dimensional posteriors by repeatedly sampling from full conditional distributions. The resulting draws are dependent, so autocorrelation and effective sample size matter. Data augmentation introduces latent variables to make difficult conditional distributions tractable, with applications to mixtures, probit and logistic regression, and regularized regression.

Check Your Understanding

Why can a posterior probability be estimated by averaging indicator functions?
What is a full conditional distribution?
In a Gibbs sampler, why are the draws usually autocorrelated?
Why does high correlation slow down one-at-a-time Gibbs updates in the bivariate Normal example?
How does data augmentation help in mixture models or probit regression?