MCMC and Metropolis--Hastings

Source: Lecture 8 — MCMC and Metropolis—Hastings (BDA Ch. 11).

The Thread of This Chapter

The previous chapter used simulation to approximate posterior expectations. Gibbs sampling worked when each full conditional distribution was easy to sample from.

This chapter handles a more general problem:

How can we sample from a posterior distribution when direct sampling is not available?

The answer is Markov chain Monte Carlo, or MCMC.

The idea is not to draw independent samples immediately. Instead, we build a Markov chain whose long-run distribution is the posterior distribution we want. After the chain has run long enough, its states can be used as dependent posterior draws.

Markov Chains

What Is a Markov Chain?

A Markov chain is a sequence of random states

$X_1,X_2,\ldots$

where the next state depends on the current state, but not on the earlier history:

\Pr(X_{t+1}=s_j \mid X_t=s_i,X_{t-1},\ldots,X_1) = \Pr(X_{t+1}=s_j \mid X_t=s_i).

For a finite state space

$S=\{s_1,\ldots,s_k\},$

write the one-step transition probabilities as

p_{ij} = \Pr(X_{t+1}=s_j \mid X_t=s_i).

The transition matrix $P$ contains these probabilities, and each row sums to 1.

Multiple Steps

The $h$ -step transition matrix is

$P^{(h)}=P^h.$

Its $(i,j)$ entry gives the probability of being in state $s_j$ after $h$ steps, starting from $s_i$ .

Stationary Distributions

Purpose

For MCMC, we want the chain to settle into the target distribution. That long-run distribution is called a stationary distribution.

Definition

A distribution

$\pi=(\pi_1,\ldots,\pi_k)$

is stationary for $P$ if

\pi=\pi P.

This means that if $X_t$ has distribution $\pi$ , then $X_{t+1}$ also has distribution $\pi$ .

When Does the Chain Converge?

A finite-state chain has a unique useful limiting distribution under standard regularity conditions:

Irreducible: every state can eventually be reached from every other state.
Aperiodic: the chain is not trapped in a fixed cycle.
Positive recurrent: the chain returns to states in finite expected time.

Under these conditions, the distribution of $X_t$ approaches $\pi$ as $t$ grows.

The Basic MCMC Idea

The target distribution is usually a posterior:

$p(\theta \mid y).$

MCMC constructs a Markov chain

$\theta^{(0)},\theta^{(1)},\theta^{(2)},\ldots$

whose stationary distribution is $p(\theta \mid y)$ .

The practical workflow is:

Start at some value $\theta^{(0)}$ .
Repeatedly propose a move.
Accept or reject the move using a rule that preserves the target distribution.
Use the later draws to approximate posterior summaries.

The most important accept/reject rule is Metropolis—Hastings.

Rejection Sampling as a Contrast

Before MCMC, it helps to remember rejection sampling.

Suppose we want to sample from a target density proportional to $p(x)$ and can find an envelope density $g(x)$ and constant $M$ such that

$p(x)\le M g(x).$

The rejection sampler:

draws $x \sim g$ ;
accepts with probability

$\frac{p(x)}{M g(x)};$
repeats until enough accepted draws are collected.

Accepted draws are independent. But in high dimensions, good envelope distributions are hard to find and acceptance rates can be very low.

MCMC avoids the global envelope requirement by making local moves.

Random Walk Metropolis

Purpose

Random walk Metropolis is the simplest MCMC algorithm for continuous parameters. It proposes a new value near the current value.

Step 1: Start the Chain

Choose an initial value

$\theta^{(0)}.$

Step 2: Propose a Move

At iteration $i$ , propose

\theta^p \mid \theta^{(i-1)} \sim N(\theta^{(i-1)},c\Sigma).

Here $\Sigma$ controls the direction and shape of proposals, while $c$ controls their size.

Step 3: Compare Posterior Density

The acceptance probability is

\alpha = \min \left( 1,\, \frac{p(\theta^p \mid y)} {p(\theta^{(i-1)} \mid y)} \right).

If the proposal has higher posterior density, the ratio is above 1 and the proposal is accepted. If it has lower posterior density, it may still be accepted, but with smaller probability.

Step 4: Accept or Stay

Draw $u \sim \mathrm{Uniform}(0,1)$ .

If $u\le \alpha$ , set

$\theta^{(i)}=\theta^p.$

Otherwise set

$\theta^{(i)}=\theta^{(i-1)}.$

Rejected proposals are not discarded from the chain. They produce repeated values.

Why the Normalizing Constant Is Not Needed

Bayesian posteriors often have the form

p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}.

The denominator $p(y)$ can be hard to compute. Random walk Metropolis only needs a ratio:

\frac{p(\theta^p \mid y)} {p(\theta^{(i-1)} \mid y)} = \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) }.

The marginal likelihood $p(y)$ cancels. This is one reason MCMC is so useful: we can sample from a distribution known only up to proportionality.

Choosing the Random Walk Proposal

Proposal Scale

If the proposal steps are too small, almost every proposal is accepted, but the chain moves slowly.

If the proposal steps are too large, most proposals are rejected, and the chain stays in place too often.

For many random walk Metropolis problems, an average acceptance rate around 25—30% is a useful target, especially in moderate or high dimensions.

Proposal Shape

Common choices include:

$\Sigma=I$ , giving equal proposal scale in all directions;
$\Sigma=J_{\hat\theta}^{-1}$ , using a curvature estimate near the posterior mode;
an adaptive estimate based on an initial run.

A good proposal is easy to sample from, easy to evaluate in the acceptance rule, and large enough to explore the posterior efficiently.

Metropolis—Hastings

Why Generalize?

Random walk Metropolis uses a symmetric proposal:

$q(a \mid b)=q(b \mid a).$

Metropolis—Hastings allows asymmetric proposals. This is useful when proposals are drawn from an approximation, a conditional distribution, or another distribution that does not move equally in both directions.

The Algorithm

At iteration $i$ :

Propose

$\theta^p \sim q(\cdot \mid \theta^{(i-1)}).$
Compute
$\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) } \frac{ q(\theta^{(i-1)} \mid \theta^p) }{ q(\theta^p \mid \theta^{(i-1)}) } \right).$
Accept the proposal with probability $\alpha$ ; otherwise stay at the current value.

How to Read the Formula

The first ratio compares target density:

\frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) }.

The second ratio corrects for proposal asymmetry:

\frac{ q(\theta^{(i-1)} \mid \theta^p) }{ q(\theta^p \mid \theta^{(i-1)}) }.

If it is easier to propose moves from the old state to the new state than from the new state back to the old state, the correction accounts for that imbalance.

The Independence Sampler

Purpose

The independence sampler is a Metropolis—Hastings method where the proposal does not depend on the current state:

$q(\theta^p \mid \theta^{(i-1)})=q(\theta^p).$

For example, $q$ might be a multivariate $t$ approximation centered at a posterior mode.

Acceptance Probability

The acceptance probability becomes

\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)}) } \frac{ q(\theta^{(i-1)}) }{ q(\theta^p) } \right).

Main Warning

The proposal must have tails at least as heavy as the posterior. If $q$ misses important posterior regions, the chain can get stuck because proposals into those regions are too rare.

Metropolis—Hastings Within Gibbs

Purpose

Many models have some easy full conditionals and some hard ones. We do not need to choose between pure Gibbs and pure Metropolis—Hastings.

The Hybrid Algorithm

Within each Gibbs cycle:

For easy full conditionals, draw directly.
For hard full conditionals, use a Metropolis—Hastings update targeting that conditional distribution.

This is called Metropolis—Hastings within Gibbs.

Gibbs as a Special Case

If the proposal for a block is exactly its full conditional,

$q(\theta_j \mid \theta_{-j},y)=p(\theta_j \mid \theta_{-j},y),$

then the Metropolis—Hastings acceptance probability is 1. A Gibbs update is therefore an always-accepted Metropolis—Hastings move.

Efficiency of MCMC

Independent Draws

For independent draws,

\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N}.

Dependent Draws

For MCMC draws,

\mathrm{Var}(\bar\theta) = \frac{\sigma^2}{N} \left( 1+2\sum_{k=1}^{\infty}\rho_k \right),

where

\rho_k = \mathrm{Corr}(\theta^{(i)},\theta^{(i+k)}).

The term

\mathrm{IF} = 1+2\sum_{k=1}^{\infty}\rho_k

is the inefficiency factor.

The effective sample size is

\mathrm{ESS} = \frac{N}{\mathrm{IF}}.

How to Read It

If autocorrelations are high, many MCMC draws contain the information of far fewer independent draws. A long chain is not automatically a good chain.

Burn-in, Warm-up, and Diagnostics

Burn-in or Warm-up

Early draws may still reflect the starting value. These draws are often discarded as burn-in or warm-up.

The purpose is not to improve the stationary distribution. It is to avoid using draws from the transient period before the chain has reached the typical posterior region.

Trace Plots

A trace plot shows sampled values against iteration number.

Useful trace plots should show chains moving around a stable region without visible drift. A chain that slowly trends upward, gets stuck, or explores different regions in different runs needs attention.

Effective Sample Size

ESS estimates how many independent draws would give roughly the same Monte Carlo precision as the dependent MCMC output.

High ESS is especially important for posterior means, quantiles, and tail probabilities.

Potential Scale Reduction

The Gelman—Rubin statistic, usually written as $\hat R$ , compares variation within chains to variation between chains.

A value close to 1 suggests that multiple chains are mixing similarly. Values clearly above 1 suggest that chains have not yet settled into the same target distribution.

Thinning

Thinning keeps every $h$ th draw.

It reduces storage and visible autocorrelation, but it also throws away draws. For estimating posterior summaries, running longer chains is often more useful than thinning unless storage or workflow constraints matter.

A Small Reading Example

Suppose a random walk Metropolis chain has an acceptance rate of 95%. This may sound good, but it often means the proposal steps are too small. The chain is accepting many tiny moves and exploring slowly.

Suppose another chain has an acceptance rate of 1%. It is proposing moves that are usually too far away or in low posterior density regions. The chain repeats the same values too often.

The goal is not maximum acceptance. The goal is efficient exploration.

Chapter Summary

MCMC constructs a Markov chain whose stationary distribution is the target posterior. Random walk Metropolis proposes local symmetric moves and accepts them according to a posterior density ratio. Metropolis—Hastings generalizes this idea to asymmetric proposals by adding a proposal correction. Independence samplers, hybrid MH-within-Gibbs algorithms, and diagnostic tools all follow the same logic: valid simulation is not enough; the chain must also explore efficiently.

Check Your Understanding

What does it mean for a distribution to be stationary for a Markov chain?
Why does Metropolis—Hastings not require the posterior normalizing constant?
What does the proposal ratio correct for in the Metropolis—Hastings acceptance probability?
Why can a very high random walk Metropolis acceptance rate indicate poor exploration?
What is effective sample size, and why is it smaller than $N$ for autocorrelated chains?