MCMC and Metropolis--Hastings
Source: Lecture 8 — MCMC and Metropolis—Hastings (BDA Ch. 11).
The Thread of This Chapter
Section titled “The Thread of This Chapter”The previous chapter used simulation to approximate posterior expectations. Gibbs sampling worked when each full conditional distribution was easy to sample from.
This chapter handles a more general problem:
How can we sample from a posterior distribution when direct sampling is not available?
The answer is Markov chain Monte Carlo, or MCMC.
The idea is not to draw independent samples immediately. Instead, we build a Markov chain whose long-run distribution is the posterior distribution we want. After the chain has run long enough, its states can be used as dependent posterior draws.
Markov Chains
Section titled “Markov Chains”What Is a Markov Chain?
Section titled “What Is a Markov Chain?”A Markov chain is a sequence of random states
where the next state depends on the current state, but not on the earlier history:
For a finite state space
write the one-step transition probabilities as
The transition matrix contains these probabilities, and each row sums to 1.
Multiple Steps
Section titled “Multiple Steps”The -step transition matrix is
Its entry gives the probability of being in state after steps, starting from .
Stationary Distributions
Section titled “Stationary Distributions”Purpose
Section titled “Purpose”For MCMC, we want the chain to settle into the target distribution. That long-run distribution is called a stationary distribution.
Definition
Section titled “Definition”A distribution
is stationary for if
This means that if has distribution , then also has distribution .
When Does the Chain Converge?
Section titled “When Does the Chain Converge?”A finite-state chain has a unique useful limiting distribution under standard regularity conditions:
- Irreducible: every state can eventually be reached from every other state.
- Aperiodic: the chain is not trapped in a fixed cycle.
- Positive recurrent: the chain returns to states in finite expected time.
Under these conditions, the distribution of approaches as grows.
The Basic MCMC Idea
Section titled “The Basic MCMC Idea”The target distribution is usually a posterior:
MCMC constructs a Markov chain
whose stationary distribution is .
The practical workflow is:
- Start at some value .
- Repeatedly propose a move.
- Accept or reject the move using a rule that preserves the target distribution.
- Use the later draws to approximate posterior summaries.
The most important accept/reject rule is Metropolis—Hastings.
Rejection Sampling as a Contrast
Section titled “Rejection Sampling as a Contrast”Before MCMC, it helps to remember rejection sampling.
Suppose we want to sample from a target density proportional to and can find an envelope density and constant such that
The rejection sampler:
-
draws ;
-
accepts with probability
-
repeats until enough accepted draws are collected.
Accepted draws are independent. But in high dimensions, good envelope distributions are hard to find and acceptance rates can be very low.
MCMC avoids the global envelope requirement by making local moves.
Random Walk Metropolis
Section titled “Random Walk Metropolis”Purpose
Section titled “Purpose”Random walk Metropolis is the simplest MCMC algorithm for continuous parameters. It proposes a new value near the current value.
Step 1: Start the Chain
Section titled “Step 1: Start the Chain”Choose an initial value
Step 2: Propose a Move
Section titled “Step 2: Propose a Move”At iteration , propose
Here controls the direction and shape of proposals, while controls their size.
Step 3: Compare Posterior Density
Section titled “Step 3: Compare Posterior Density”The acceptance probability is
If the proposal has higher posterior density, the ratio is above 1 and the proposal is accepted. If it has lower posterior density, it may still be accepted, but with smaller probability.
Step 4: Accept or Stay
Section titled “Step 4: Accept or Stay”Draw .
If , set
Otherwise set
Rejected proposals are not discarded from the chain. They produce repeated values.
Why the Normalizing Constant Is Not Needed
Section titled “Why the Normalizing Constant Is Not Needed”Bayesian posteriors often have the form
The denominator can be hard to compute. Random walk Metropolis only needs a ratio:
The marginal likelihood cancels. This is one reason MCMC is so useful: we can sample from a distribution known only up to proportionality.
Choosing the Random Walk Proposal
Section titled “Choosing the Random Walk Proposal”Proposal Scale
Section titled “Proposal Scale”If the proposal steps are too small, almost every proposal is accepted, but the chain moves slowly.
If the proposal steps are too large, most proposals are rejected, and the chain stays in place too often.
For many random walk Metropolis problems, an average acceptance rate around 25—30% is a useful target, especially in moderate or high dimensions.
Proposal Shape
Section titled “Proposal Shape”Common choices include:
- , giving equal proposal scale in all directions;
- , using a curvature estimate near the posterior mode;
- an adaptive estimate based on an initial run.
A good proposal is easy to sample from, easy to evaluate in the acceptance rule, and large enough to explore the posterior efficiently.
Metropolis—Hastings
Section titled “Metropolis—Hastings”Why Generalize?
Section titled “Why Generalize?”Random walk Metropolis uses a symmetric proposal:
Metropolis—Hastings allows asymmetric proposals. This is useful when proposals are drawn from an approximation, a conditional distribution, or another distribution that does not move equally in both directions.
The Algorithm
Section titled “The Algorithm”At iteration :
-
Propose
-
Compute
-
Accept the proposal with probability ; otherwise stay at the current value.
How to Read the Formula
Section titled “How to Read the Formula”The first ratio compares target density:
The second ratio corrects for proposal asymmetry:
If it is easier to propose moves from the old state to the new state than from the new state back to the old state, the correction accounts for that imbalance.
The Independence Sampler
Section titled “The Independence Sampler”Purpose
Section titled “Purpose”The independence sampler is a Metropolis—Hastings method where the proposal does not depend on the current state:
For example, might be a multivariate approximation centered at a posterior mode.
Acceptance Probability
Section titled “Acceptance Probability”The acceptance probability becomes
Main Warning
Section titled “Main Warning”The proposal must have tails at least as heavy as the posterior. If misses important posterior regions, the chain can get stuck because proposals into those regions are too rare.
Metropolis—Hastings Within Gibbs
Section titled “Metropolis—Hastings Within Gibbs”Purpose
Section titled “Purpose”Many models have some easy full conditionals and some hard ones. We do not need to choose between pure Gibbs and pure Metropolis—Hastings.
The Hybrid Algorithm
Section titled “The Hybrid Algorithm”Within each Gibbs cycle:
- For easy full conditionals, draw directly.
- For hard full conditionals, use a Metropolis—Hastings update targeting that conditional distribution.
This is called Metropolis—Hastings within Gibbs.
Gibbs as a Special Case
Section titled “Gibbs as a Special Case”If the proposal for a block is exactly its full conditional,
then the Metropolis—Hastings acceptance probability is 1. A Gibbs update is therefore an always-accepted Metropolis—Hastings move.
Efficiency of MCMC
Section titled “Efficiency of MCMC”Independent Draws
Section titled “Independent Draws”For independent draws,
Dependent Draws
Section titled “Dependent Draws”For MCMC draws,
where
The term
is the inefficiency factor.
The effective sample size is
How to Read It
Section titled “How to Read It”If autocorrelations are high, many MCMC draws contain the information of far fewer independent draws. A long chain is not automatically a good chain.
Burn-in, Warm-up, and Diagnostics
Section titled “Burn-in, Warm-up, and Diagnostics”Burn-in or Warm-up
Section titled “Burn-in or Warm-up”Early draws may still reflect the starting value. These draws are often discarded as burn-in or warm-up.
The purpose is not to improve the stationary distribution. It is to avoid using draws from the transient period before the chain has reached the typical posterior region.
Trace Plots
Section titled “Trace Plots”A trace plot shows sampled values against iteration number.
Useful trace plots should show chains moving around a stable region without visible drift. A chain that slowly trends upward, gets stuck, or explores different regions in different runs needs attention.
Effective Sample Size
Section titled “Effective Sample Size”ESS estimates how many independent draws would give roughly the same Monte Carlo precision as the dependent MCMC output.
High ESS is especially important for posterior means, quantiles, and tail probabilities.
Potential Scale Reduction
Section titled “Potential Scale Reduction”The Gelman—Rubin statistic, usually written as , compares variation within chains to variation between chains.
A value close to 1 suggests that multiple chains are mixing similarly. Values clearly above 1 suggest that chains have not yet settled into the same target distribution.
Thinning
Section titled “Thinning”Thinning keeps every th draw.
It reduces storage and visible autocorrelation, but it also throws away draws. For estimating posterior summaries, running longer chains is often more useful than thinning unless storage or workflow constraints matter.
A Small Reading Example
Section titled “A Small Reading Example”Suppose a random walk Metropolis chain has an acceptance rate of 95%. This may sound good, but it often means the proposal steps are too small. The chain is accepting many tiny moves and exploring slowly.
Suppose another chain has an acceptance rate of 1%. It is proposing moves that are usually too far away or in low posterior density regions. The chain repeats the same values too often.
The goal is not maximum acceptance. The goal is efficient exploration.
Chapter Summary
Section titled “Chapter Summary”MCMC constructs a Markov chain whose stationary distribution is the target posterior. Random walk Metropolis proposes local symmetric moves and accepts them according to a posterior density ratio. Metropolis—Hastings generalizes this idea to asymmetric proposals by adding a proposal correction. Independence samplers, hybrid MH-within-Gibbs algorithms, and diagnostic tools all follow the same logic: valid simulation is not enough; the chain must also explore efficiently.
Check Your Understanding
Section titled “Check Your Understanding”- What does it mean for a distribution to be stationary for a Markov chain?
- Why does Metropolis—Hastings not require the posterior normalizing constant?
- What does the proposal ratio correct for in the Metropolis—Hastings acceptance probability?
- Why can a very high random walk Metropolis acceptance rate indicate poor exploration?
- What is effective sample size, and why is it smaller than for autocorrelated chains?