Skip to content

MCMC and Metropolis--Hastings

Source: Lecture 8 — MCMC and Metropolis—Hastings (BDA Ch. 11).

The previous chapter used simulation to approximate posterior expectations. Gibbs sampling worked when each full conditional distribution was easy to sample from.

This chapter handles a more general problem:

How can we sample from a posterior distribution when direct sampling is not available?

The answer is Markov chain Monte Carlo, or MCMC.

The idea is not to draw independent samples immediately. Instead, we build a Markov chain whose long-run distribution is the posterior distribution we want. After the chain has run long enough, its states can be used as dependent posterior draws.

A Markov chain is a sequence of random states

X1,X2,X_1,X_2,\ldots

where the next state depends on the current state, but not on the earlier history:

Pr(Xt+1=sjXt=si,Xt1,,X1)=Pr(Xt+1=sjXt=si).\Pr(X_{t+1}=s_j \mid X_t=s_i,X_{t-1},\ldots,X_1) = \Pr(X_{t+1}=s_j \mid X_t=s_i).

For a finite state space

S={s1,,sk},S=\{s_1,\ldots,s_k\},

write the one-step transition probabilities as

pij=Pr(Xt+1=sjXt=si).p_{ij} = \Pr(X_{t+1}=s_j \mid X_t=s_i).

The transition matrix PP contains these probabilities, and each row sums to 1.

The hh-step transition matrix is

P(h)=Ph.P^{(h)}=P^h.

Its (i,j)(i,j) entry gives the probability of being in state sjs_j after hh steps, starting from sis_i.

For MCMC, we want the chain to settle into the target distribution. That long-run distribution is called a stationary distribution.

A distribution

π=(π1,,πk)\pi=(\pi_1,\ldots,\pi_k)

is stationary for PP if

π=πP.\pi=\pi P.

This means that if XtX_t has distribution π\pi, then Xt+1X_{t+1} also has distribution π\pi.

A finite-state chain has a unique useful limiting distribution under standard regularity conditions:

  • Irreducible: every state can eventually be reached from every other state.
  • Aperiodic: the chain is not trapped in a fixed cycle.
  • Positive recurrent: the chain returns to states in finite expected time.

Under these conditions, the distribution of XtX_t approaches π\pi as tt grows.

The target distribution is usually a posterior:

p(θy).p(\theta \mid y).

MCMC constructs a Markov chain

θ(0),θ(1),θ(2),\theta^{(0)},\theta^{(1)},\theta^{(2)},\ldots

whose stationary distribution is p(θy)p(\theta \mid y).

The practical workflow is:

  1. Start at some value θ(0)\theta^{(0)}.
  2. Repeatedly propose a move.
  3. Accept or reject the move using a rule that preserves the target distribution.
  4. Use the later draws to approximate posterior summaries.

The most important accept/reject rule is Metropolis—Hastings.

Before MCMC, it helps to remember rejection sampling.

Suppose we want to sample from a target density proportional to p(x)p(x) and can find an envelope density g(x)g(x) and constant MM such that

p(x)Mg(x).p(x)\le M g(x).

The rejection sampler:

  1. draws xgx \sim g;

  2. accepts with probability

    p(x)Mg(x);\frac{p(x)}{M g(x)};

  3. repeats until enough accepted draws are collected.

Accepted draws are independent. But in high dimensions, good envelope distributions are hard to find and acceptance rates can be very low.

MCMC avoids the global envelope requirement by making local moves.

Random walk Metropolis is the simplest MCMC algorithm for continuous parameters. It proposes a new value near the current value.

Choose an initial value

θ(0).\theta^{(0)}.

At iteration ii, propose

θpθ(i1)N(θ(i1),cΣ).\theta^p \mid \theta^{(i-1)} \sim N(\theta^{(i-1)},c\Sigma).

Here Σ\Sigma controls the direction and shape of proposals, while cc controls their size.

The acceptance probability is

α=min(1,p(θpy)p(θ(i1)y)).\alpha = \min \left( 1,\, \frac{p(\theta^p \mid y)} {p(\theta^{(i-1)} \mid y)} \right).

If the proposal has higher posterior density, the ratio is above 1 and the proposal is accepted. If it has lower posterior density, it may still be accepted, but with smaller probability.

Draw uUniform(0,1)u \sim \mathrm{Uniform}(0,1).

If uαu\le \alpha, set

θ(i)=θp.\theta^{(i)}=\theta^p.

Otherwise set

θ(i)=θ(i1).\theta^{(i)}=\theta^{(i-1)}.

Rejected proposals are not discarded from the chain. They produce repeated values.

Why the Normalizing Constant Is Not Needed

Section titled “Why the Normalizing Constant Is Not Needed”

Bayesian posteriors often have the form

p(θy)=p(yθ)p(θ)p(y).p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}.

The denominator p(y)p(y) can be hard to compute. Random walk Metropolis only needs a ratio:

p(θpy)p(θ(i1)y)=p(yθp)p(θp)p(yθ(i1))p(θ(i1)).\frac{p(\theta^p \mid y)} {p(\theta^{(i-1)} \mid y)} = \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) }.

The marginal likelihood p(y)p(y) cancels. This is one reason MCMC is so useful: we can sample from a distribution known only up to proportionality.

If the proposal steps are too small, almost every proposal is accepted, but the chain moves slowly.

If the proposal steps are too large, most proposals are rejected, and the chain stays in place too often.

For many random walk Metropolis problems, an average acceptance rate around 25—30% is a useful target, especially in moderate or high dimensions.

Common choices include:

  • Σ=I\Sigma=I, giving equal proposal scale in all directions;
  • Σ=Jθ^1\Sigma=J_{\hat\theta}^{-1}, using a curvature estimate near the posterior mode;
  • an adaptive estimate based on an initial run.

A good proposal is easy to sample from, easy to evaluate in the acceptance rule, and large enough to explore the posterior efficiently.

Random walk Metropolis uses a symmetric proposal:

q(ab)=q(ba).q(a \mid b)=q(b \mid a).

Metropolis—Hastings allows asymmetric proposals. This is useful when proposals are drawn from an approximation, a conditional distribution, or another distribution that does not move equally in both directions.

At iteration ii:

  1. Propose

    θpq(θ(i1)).\theta^p \sim q(\cdot \mid \theta^{(i-1)}).

  2. Compute

    α=min(1,p(yθp)p(θp)p(yθ(i1))p(θ(i1))q(θ(i1)θp)q(θpθ(i1))).\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) } \frac{ q(\theta^{(i-1)} \mid \theta^p) }{ q(\theta^p \mid \theta^{(i-1)}) } \right).
  3. Accept the proposal with probability α\alpha; otherwise stay at the current value.

The first ratio compares target density:

p(yθp)p(θp)p(yθ(i1))p(θ(i1)).\frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) }.

The second ratio corrects for proposal asymmetry:

q(θ(i1)θp)q(θpθ(i1)).\frac{ q(\theta^{(i-1)} \mid \theta^p) }{ q(\theta^p \mid \theta^{(i-1)}) }.

If it is easier to propose moves from the old state to the new state than from the new state back to the old state, the correction accounts for that imbalance.

The independence sampler is a Metropolis—Hastings method where the proposal does not depend on the current state:

q(θpθ(i1))=q(θp).q(\theta^p \mid \theta^{(i-1)})=q(\theta^p).

For example, qq might be a multivariate tt approximation centered at a posterior mode.

The acceptance probability becomes

α=min(1,p(yθp)p(θp)p(yθ(i1))p(θ(i1))q(θ(i1))q(θp)).\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) } \frac{ q(\theta^{(i-1)}) }{ q(\theta^p) } \right).

The proposal must have tails at least as heavy as the posterior. If qq misses important posterior regions, the chain can get stuck because proposals into those regions are too rare.

Many models have some easy full conditionals and some hard ones. We do not need to choose between pure Gibbs and pure Metropolis—Hastings.

Within each Gibbs cycle:

  1. For easy full conditionals, draw directly.
  2. For hard full conditionals, use a Metropolis—Hastings update targeting that conditional distribution.

This is called Metropolis—Hastings within Gibbs.

If the proposal for a block is exactly its full conditional,

q(θjθj,y)=p(θjθj,y),q(\theta_j \mid \theta_{-j},y)=p(\theta_j \mid \theta_{-j},y),

then the Metropolis—Hastings acceptance probability is 1. A Gibbs update is therefore an always-accepted Metropolis—Hastings move.

For independent draws,

Var(θˉ)=σ2N.\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N}.

For MCMC draws,

Var(θˉ)=σ2N(1+2k=1ρk),\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N} \left( 1+2\sum_{k=1}^{\infty}\rho_k \right),

where

ρk=Corr(θ(i),θ(i+k)).\rho_k = \mathrm{Corr}(\theta^{(i)},\theta^{(i+k)}).

The term

IF=1+2k=1ρk\mathrm{IF} = 1+2\sum_{k=1}^{\infty}\rho_k

is the inefficiency factor.

The effective sample size is

ESS=NIF.\mathrm{ESS} = \frac{N}{\mathrm{IF}}.

If autocorrelations are high, many MCMC draws contain the information of far fewer independent draws. A long chain is not automatically a good chain.

Early draws may still reflect the starting value. These draws are often discarded as burn-in or warm-up.

The purpose is not to improve the stationary distribution. It is to avoid using draws from the transient period before the chain has reached the typical posterior region.

A trace plot shows sampled values against iteration number.

Useful trace plots should show chains moving around a stable region without visible drift. A chain that slowly trends upward, gets stuck, or explores different regions in different runs needs attention.

ESS estimates how many independent draws would give roughly the same Monte Carlo precision as the dependent MCMC output.

High ESS is especially important for posterior means, quantiles, and tail probabilities.

The Gelman—Rubin statistic, usually written as R^\hat R, compares variation within chains to variation between chains.

A value close to 1 suggests that multiple chains are mixing similarly. Values clearly above 1 suggest that chains have not yet settled into the same target distribution.

Thinning keeps every hhth draw.

It reduces storage and visible autocorrelation, but it also throws away draws. For estimating posterior summaries, running longer chains is often more useful than thinning unless storage or workflow constraints matter.

Suppose a random walk Metropolis chain has an acceptance rate of 95%. This may sound good, but it often means the proposal steps are too small. The chain is accepting many tiny moves and exploring slowly.

Suppose another chain has an acceptance rate of 1%. It is proposing moves that are usually too far away or in low posterior density regions. The chain repeats the same values too often.

The goal is not maximum acceptance. The goal is efficient exploration.

MCMC constructs a Markov chain whose stationary distribution is the target posterior. Random walk Metropolis proposes local symmetric moves and accepts them according to a posterior density ratio. Metropolis—Hastings generalizes this idea to asymmetric proposals by adding a proposal correction. Independence samplers, hybrid MH-within-Gibbs algorithms, and diagnostic tools all follow the same logic: valid simulation is not enough; the chain must also explore efficiently.

  1. What does it mean for a distribution to be stationary for a Markov chain?
  2. Why does Metropolis—Hastings not require the posterior normalizing constant?
  3. What does the proposal ratio correct for in the Metropolis—Hastings acceptance probability?
  4. Why can a very high random walk Metropolis acceptance rate indicate poor exploration?
  5. What is effective sample size, and why is it smaller than NN for autocorrelated chains?