Hamiltonian Monte Carlo and Stan
Source: Lecture 9 — HMC and Stan (BDA Ch. 12).
The Thread of This Chapter
Section titled “The Thread of This Chapter”Random walk Metropolis is general, but it can be slow in high dimensions. It proposes a local move without knowing much about the shape of the posterior. If the posterior is narrow, curved, or strongly correlated, many proposed moves are either too small to be useful or too large to be accepted.
Hamiltonian Monte Carlo, or HMC, uses more information.
The main idea is:
Use gradients of the log posterior to propose distant moves that still have high acceptance probability.
This chapter builds HMC in small steps:
- Add artificial momentum variables.
- Define an energy function whose density has the desired posterior as a marginal.
- Simulate Hamiltonian dynamics with the leapfrog integrator.
- Correct numerical error with a Metropolis accept/reject step.
- Use Stan to automate much of this workflow.
Why Random Walk Metropolis Struggles
Section titled “Why Random Walk Metropolis Struggles”Suppose
is high-dimensional. The posterior distribution may occupy a small and curved region of .
Random walk Metropolis has two common problems:
- Small proposals are accepted often, but move slowly.
- Large proposals move farther, but are rejected often.
In high dimensions this tradeoff becomes severe. The chain can spend many iterations taking short, undirected steps.
HMC replaces undirected random walks with guided trajectories.
Step 1: Add Momentum
Section titled “Step 1: Add Momentum”Purpose
Section titled “Purpose”HMC augments the parameter vector with a momentum vector
The momentum is artificial. It is not part of the scientific model. It is introduced to help move through the posterior.
The Joint Distribution
Section titled “The Joint Distribution”Choose a momentum distribution, usually
where is called the mass matrix.
Define the joint density
If we sample from this joint distribution and then ignore , the remaining draws of have the desired posterior distribution.
Step 2: Define Energy
Section titled “Step 2: Define Energy”Potential Energy
Section titled “Potential Energy”The potential energy is the negative log unnormalized posterior:
Low potential energy corresponds to high posterior density.
Kinetic Energy
Section titled “Kinetic Energy”The kinetic energy is determined by the momentum distribution:
The constant does not affect the algorithm because it cancels in ratios.
Hamiltonian
Section titled “Hamiltonian”The total energy is
The joint density can be written as proportional to
So preserving the Hamiltonian approximately preserves the joint density.
Step 3: Follow Hamiltonian Dynamics
Section titled “Step 3: Follow Hamiltonian Dynamics”The Differential Equations
Section titled “The Differential Equations”Hamiltonian dynamics move the pair according to
and
The first equation says momentum moves the parameters. The second says the log posterior gradient changes the momentum.
How to Read the Dynamics
Section titled “How to Read the Dynamics”If the posterior density rises in a direction, the gradient points that way. HMC uses this slope information to curve the path through parameter space instead of wandering blindly.
In exact mathematics, Hamiltonian dynamics preserve volume and energy. That is why they can propose distant moves without immediately destroying the target distribution.
Step 4: Approximate the Dynamics With Leapfrog
Section titled “Step 4: Approximate the Dynamics With Leapfrog”Exact continuous dynamics are usually impossible to compute. HMC uses a numerical approximation called the leapfrog integrator.
Let be the step size.
Half Step for Momentum
Section titled “Half Step for Momentum”Full Step for Position
Section titled “Full Step for Position”Another Half Step for Momentum
Section titled “Another Half Step for Momentum”Why Leapfrog Is Used
Section titled “Why Leapfrog Is Used”Leapfrog is popular because it is reversible and volume-preserving. These properties make the final Metropolis correction valid.
The approximation is not exact. Larger step sizes create more numerical error. The accept/reject step corrects that error.
The HMC Algorithm
Section titled “The HMC Algorithm”Step 1: Start at the Current Parameter Value
Section titled “Step 1: Start at the Current Parameter Value”At iteration , begin from
Step 2: Sample Fresh Momentum
Section titled “Step 2: Sample Fresh Momentum”Draw
This gives the trajectory a new direction.
Step 3: Simulate a Trajectory
Section titled “Step 3: Simulate a Trajectory”Starting from
run leapfrog steps with step size . This produces a proposal
Step 4: Accept or Reject
Section titled “Step 4: Accept or Reject”Use the Metropolis probability
If accepted, set
If rejected, set
As in other Metropolis methods, the normalizing constant of the posterior cancels.
How to Interpret the HMC Proposal
Section titled “How to Interpret the HMC Proposal”Random walk Metropolis asks:
What if we step randomly from the current point?
HMC asks:
If the posterior were a smooth surface and the current state had momentum, where would the trajectory go?
The gradient tells the sampler about local posterior geometry. The momentum prevents it from simply moving uphill to a mode. Together they create long, directed proposals that can cross the typical set more efficiently than a random walk.
Tuning Parameters
Section titled “Tuning Parameters”Step Size
Section titled “Step Size”The step size controls how fine the numerical simulation is.
- Too large: leapfrog error is large and proposals are rejected.
- Too small: trajectories are accurate but expensive and slow to move.
Number of Leapfrog Steps
Section titled “Number of Leapfrog Steps”The number of steps controls how long the trajectory runs.
The trajectory length is
If is too small, HMC behaves too locally. If it is too large, the trajectory may waste computation or begin to double back.
Mass Matrix
Section titled “Mass Matrix”The mass matrix controls how momentum is scaled in different parameter directions.
A good mass matrix reflects posterior scale and correlation. In many applications it is adapted during warm-up.
Purpose
Section titled “Purpose”The No-U-Turn Sampler, or NUTS, is an adaptive version of HMC used by Stan.
It avoids choosing a fixed by extending the trajectory until it detects that the path is starting to turn back toward where it came from.
What It Automates
Section titled “What It Automates”Stan’s NUTS implementation adapts:
- the step size ;
- the mass matrix ;
- the effective trajectory length.
This does not remove the need for model checking. It reduces the amount of manual tuning required to obtain useful posterior draws.
Stan as Probabilistic Programming
Section titled “Stan as Probabilistic Programming”What Stan Does
Section titled “What Stan Does”Stan is a language and computation system for Bayesian inference. The user writes the model. Stan computes gradients and runs HMC/NUTS.
A Stan program separates:
- data: observed quantities supplied by the user;
- parameters: unknown quantities to infer;
- model: priors and likelihood.
Example 1: iid Normal Model
Section titled “Example 1: iid Normal Model”The Statistical Model
Section titled “The Statistical Model”Suppose
Use priors for and .
Stan Program
Section titled “Stan Program”data { int<lower=0> N; vector[N] y;}parameters { real mu; real<lower=0> sigma;}model { mu ~ normal(0, 100); sigma ~ cauchy(0, 5); y ~ normal(mu, sigma);}How to Read It
Section titled “How to Read It”The data block declares what is observed. The parameters block declares what Stan samples. The model block adds log-density terms for priors and likelihood.
The line
y ~ normal(mu, sigma);is vectorized: it applies the Normal likelihood to all observations in y.
Example 2: Multilevel Normal Model
Section titled “Example 2: Multilevel Normal Model”The Statistical Model
Section titled “The Statistical Model”Suppose observations are grouped by person, school, region, or another unit. Let be the group for observation .
One simple hierarchical model is
with group means
Interpretation
Section titled “Interpretation”Each group gets its own mean, but the group means are partially pooled toward the population mean . Groups with little data borrow more strength from the population distribution.
In Stan, this requires a vector of group-level parameters and a prior tying them together.
Example 3: Multilevel Poisson Model
Section titled “Example 3: Multilevel Poisson Model”The Statistical Model
Section titled “The Statistical Model”For grouped count data, use a Poisson likelihood:
To keep rates positive, model group rates on the log scale:
Equivalently,
Interpretation
Section titled “Interpretation”The Normal distribution describes variation in log rates across groups. After exponentiation, the rates are positive and can vary multiplicatively.
Running Stan From R
Section titled “Running Stan From R”Basic Workflow
Section titled “Basic Workflow”In R, a typical workflow is:
library(rstan)
data <- list(N = N, y = y)
fit <- stan( file = "normal_model.stan", data = data, warmup = 1000, iter = 2000, chains = 4, cores = 2)
print(fit, digits_summary = 3)postDraws <- extract(fit)traceplot(fit)pairs(fit)What the Arguments Mean
Section titled “What the Arguments Mean”warmup: iterations used for adaptation and not kept as posterior draws.iter: total iterations per chain, including warm-up.chains: independent chains from different starting points.cores: parallel computation resources.
The printed output and diagnostic plots should be read before interpreting posterior summaries.
Reading Stan Diagnostics
Section titled “Reading Stan Diagnostics”Divergent Transitions
Section titled “Divergent Transitions”A divergent transition means the numerical Hamiltonian trajectory had trouble following the posterior geometry. Divergences can indicate step-size problems, strong curvature, or a model parametrization that is hard for HMC.
Effective Sample Size
Section titled “Effective Sample Size”Effective sample size measures how much independent-sample information the autocorrelated chains contain.
Low ESS means Monte Carlo error may be large for the reported posterior summary.
compares between-chain and within-chain variation.
Values close to 1 suggest the chains have mixed similarly. Values clearly above 1 indicate that the chains may not be sampling from the same stationary distribution.
What Stan Does Not Solve
Section titled “What Stan Does Not Solve”Stan automates HMC, but it does not choose a scientifically correct model.
The user still must:
- specify a meaningful likelihood;
- choose priors;
- check convergence diagnostics;
- perform posterior predictive checks;
- interpret the model in context.
Good computation is necessary for Bayesian inference, but it is not the same as model adequacy.
Chapter Summary
Section titled “Chapter Summary”Hamiltonian Monte Carlo improves on random walk proposals by using gradients of the log posterior to simulate directed trajectories through parameter space. Momentum variables are artificial helpers: after sampling from the joint distribution of parameters and momentum, we keep the parameter draws and discard the momentum. Leapfrog steps approximate Hamiltonian dynamics, and a Metropolis correction preserves the target distribution. Stan implements HMC through NUTS, automates much of the tuning, and provides diagnostics, but the model and its interpretation remain the analyst’s responsibility.
Check Your Understanding
Section titled “Check Your Understanding”- Why can HMC move farther than random walk Metropolis while still keeping high acceptance probabilities?
- What is the role of the momentum variable?
- What do the step size, number of leapfrog steps, and mass matrix control?
- Why does HMC still need an accept/reject step after leapfrog simulation?
- What parts of Bayesian analysis does Stan automate, and what parts remain the user’s responsibility?