Hamiltonian Monte Carlo and Stan

Source: Lecture 9 — HMC and Stan (BDA Ch. 12).

The Thread of This Chapter

Random walk Metropolis is general, but it can be slow in high dimensions. It proposes a local move without knowing much about the shape of the posterior. If the posterior is narrow, curved, or strongly correlated, many proposed moves are either too small to be useful or too large to be accepted.

Hamiltonian Monte Carlo, or HMC, uses more information.

The main idea is:

Use gradients of the log posterior to propose distant moves that still have high acceptance probability.

This chapter builds HMC in small steps:

Add artificial momentum variables.
Define an energy function whose density has the desired posterior as a marginal.
Simulate Hamiltonian dynamics with the leapfrog integrator.
Correct numerical error with a Metropolis accept/reject step.
Use Stan to automate much of this workflow.

Why Random Walk Metropolis Struggles

Suppose

$\theta=(\theta_1,\ldots,\theta_p)$

is high-dimensional. The posterior distribution may occupy a small and curved region of $\mathbb{R}^p$ .

Random walk Metropolis has two common problems:

Small proposals are accepted often, but move slowly.
Large proposals move farther, but are rejected often.

In high dimensions this tradeoff becomes severe. The chain can spend many iterations taking short, undirected steps.

HMC replaces undirected random walks with guided trajectories.

Step 1: Add Momentum

Purpose

HMC augments the parameter vector $\theta$ with a momentum vector

$\phi=(\phi_1,\ldots,\phi_p).$

The momentum is artificial. It is not part of the scientific model. It is introduced to help move through the posterior.

The Joint Distribution

Choose a momentum distribution, usually

$\phi \sim N(0,M),$

where $M$ is called the mass matrix.

Define the joint density

p(\theta,\phi \mid y) = p(\theta \mid y)p(\phi).

If we sample from this joint distribution and then ignore $\phi$ , the remaining draws of $\theta$ have the desired posterior distribution.

Step 2: Define Energy

Potential Energy

The potential energy is the negative log unnormalized posterior:

U(\theta) = -\log\{p(y \mid \theta)p(\theta)\}.

Low potential energy corresponds to high posterior density.

Kinetic Energy

The kinetic energy is determined by the momentum distribution:

K(\phi) = \frac{1}{2}\phi^\top M^{-1}\phi + \mathrm{constant}.

The constant does not affect the algorithm because it cancels in ratios.

Hamiltonian

The total energy is

H(\theta,\phi) = U(\theta)+K(\phi).

The joint density can be written as proportional to

\exp\{-H(\theta,\phi)\}.

So preserving the Hamiltonian approximately preserves the joint density.

Step 3: Follow Hamiltonian Dynamics

The Differential Equations

Hamiltonian dynamics move the pair $(\theta,\phi)$ according to

\frac{d\theta}{dt} = M^{-1}\phi,

and

\frac{d\phi}{dt} = \nabla_{\theta}\log p(\theta \mid y).

The first equation says momentum moves the parameters. The second says the log posterior gradient changes the momentum.

How to Read the Dynamics

If the posterior density rises in a direction, the gradient points that way. HMC uses this slope information to curve the path through parameter space instead of wandering blindly.

In exact mathematics, Hamiltonian dynamics preserve volume and energy. That is why they can propose distant moves without immediately destroying the target distribution.

Step 4: Approximate the Dynamics With Leapfrog

Exact continuous dynamics are usually impossible to compute. HMC uses a numerical approximation called the leapfrog integrator.

Let $\varepsilon$ be the step size.

Half Step for Momentum

\phi\left(t+\frac{\varepsilon}{2}\right) = \phi(t) + \frac{\varepsilon}{2} \nabla_{\theta}\log p(\theta(t) \mid y).

Full Step for Position

\theta(t+\varepsilon) = \theta(t) + \varepsilon M^{-1} \phi\left(t+\frac{\varepsilon}{2}\right).

Another Half Step for Momentum

\phi(t+\varepsilon) = \phi\left(t+\frac{\varepsilon}{2}\right) + \frac{\varepsilon}{2} \nabla_{\theta}\log p(\theta(t+\varepsilon) \mid y).

Why Leapfrog Is Used

Leapfrog is popular because it is reversible and volume-preserving. These properties make the final Metropolis correction valid.

The approximation is not exact. Larger step sizes create more numerical error. The accept/reject step corrects that error.

The HMC Algorithm

Step 1: Start at the Current Parameter Value

At iteration $i$ , begin from

$\theta^{(i-1)}.$

Step 2: Sample Fresh Momentum

Draw

$\phi_s \sim N(0,M).$

This gives the trajectory a new direction.

Step 3: Simulate a Trajectory

Starting from

$(\theta^{(i-1)},\phi_s),$

run $L$ leapfrog steps with step size $\varepsilon$ . This produces a proposal

$(\theta^p,\phi^p).$

Step 4: Accept or Reject

Use the Metropolis probability

\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p)p(\phi^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)})p(\phi_s) } \right).

If accepted, set

$\theta^{(i)}=\theta^p.$

If rejected, set

$\theta^{(i)}=\theta^{(i-1)}.$

As in other Metropolis methods, the normalizing constant of the posterior cancels.

How to Interpret the HMC Proposal

Random walk Metropolis asks:

What if we step randomly from the current point?

HMC asks:

If the posterior were a smooth surface and the current state had momentum, where would the trajectory go?

The gradient tells the sampler about local posterior geometry. The momentum prevents it from simply moving uphill to a mode. Together they create long, directed proposals that can cross the typical set more efficiently than a random walk.

Tuning Parameters

Step Size

The step size $\varepsilon$ controls how fine the numerical simulation is.

Too large: leapfrog error is large and proposals are rejected.
Too small: trajectories are accurate but expensive and slow to move.

Number of Leapfrog Steps

The number of steps $L$ controls how long the trajectory runs.

The trajectory length is

$\varepsilon L.$

If $L$ is too small, HMC behaves too locally. If it is too large, the trajectory may waste computation or begin to double back.

Mass Matrix

The mass matrix $M$ controls how momentum is scaled in different parameter directions.

A good mass matrix reflects posterior scale and correlation. In many applications it is adapted during warm-up.

NUTS

Purpose

The No-U-Turn Sampler, or NUTS, is an adaptive version of HMC used by Stan.

It avoids choosing a fixed $L$ by extending the trajectory until it detects that the path is starting to turn back toward where it came from.

What It Automates

Stan’s NUTS implementation adapts:

the step size $\varepsilon$ ;
the mass matrix $M$ ;
the effective trajectory length.

This does not remove the need for model checking. It reduces the amount of manual tuning required to obtain useful posterior draws.

Stan as Probabilistic Programming

What Stan Does

Stan is a language and computation system for Bayesian inference. The user writes the model. Stan computes gradients and runs HMC/NUTS.

A Stan program separates:

data: observed quantities supplied by the user;
parameters: unknown quantities to infer;
model: priors and likelihood.

Example 1: iid Normal Model

The Statistical Model

Suppose

y_i \mid \mu,\sigma^2 \sim N(\mu,\sigma^2), \qquad i=1,\ldots,N.

Use priors for $\mu$ and $\sigma^2$ .

Stan Program

data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  mu ~ normal(0, 100);
  sigma ~ cauchy(0, 5);
  y ~ normal(mu, sigma);
}

How to Read It

The data block declares what is observed. The parameters block declares what Stan samples. The model block adds log-density terms for priors and likelihood.

The line

y ~ normal(mu, sigma);

is vectorized: it applies the Normal likelihood to all observations in y.

Example 2: Multilevel Normal Model

The Statistical Model

Suppose observations are grouped by person, school, region, or another unit. Let $p[i]$ be the group for observation $i$ .

One simple hierarchical model is

y_i \mid \mu_{p[i]},\sigma \sim N(\mu_{p[i]},\sigma^2),

with group means

\mu_j \mid \mu,\tau \sim N(\mu,\tau^2).

Interpretation

Each group gets its own mean, but the group means are partially pooled toward the population mean $\mu$ . Groups with little data borrow more strength from the population distribution.

In Stan, this requires a vector of group-level parameters and a prior tying them together.

Example 3: Multilevel Poisson Model

The Statistical Model

For grouped count data, use a Poisson likelihood:

y_i \mid \lambda_{p[i]} \sim \mathrm{Poisson}(\lambda_{p[i]}).

To keep rates positive, model group rates on the log scale:

\log \lambda_j \mid \mu,\tau \sim N(\mu,\tau^2).

Equivalently,

\lambda_j \sim \mathrm{LogNormal}(\mu,\tau^2).

Interpretation

The Normal distribution describes variation in log rates across groups. After exponentiation, the rates are positive and can vary multiplicatively.

Running Stan From R

Basic Workflow

In R, a typical workflow is:

library(rstan)

data <- list(N = N, y = y)

fit <- stan(
  file = "normal_model.stan",
  data = data,
  warmup = 1000,
  iter = 2000,
  chains = 4,
  cores = 2
)

print(fit, digits_summary = 3)
postDraws <- extract(fit)
traceplot(fit)
pairs(fit)

What the Arguments Mean

warmup: iterations used for adaptation and not kept as posterior draws.
iter: total iterations per chain, including warm-up.
chains: independent chains from different starting points.
cores: parallel computation resources.

The printed output and diagnostic plots should be read before interpreting posterior summaries.

Reading Stan Diagnostics

Divergent Transitions

A divergent transition means the numerical Hamiltonian trajectory had trouble following the posterior geometry. Divergences can indicate step-size problems, strong curvature, or a model parametrization that is hard for HMC.

Effective Sample Size

Effective sample size measures how much independent-sample information the autocorrelated chains contain.

Low ESS means Monte Carlo error may be large for the reported posterior summary.

R-hat

$\hat R$ compares between-chain and within-chain variation.

Values close to 1 suggest the chains have mixed similarly. Values clearly above 1 indicate that the chains may not be sampling from the same stationary distribution.

What Stan Does Not Solve

Stan automates HMC, but it does not choose a scientifically correct model.

The user still must:

specify a meaningful likelihood;
choose priors;
check convergence diagnostics;
perform posterior predictive checks;
interpret the model in context.

Good computation is necessary for Bayesian inference, but it is not the same as model adequacy.

Chapter Summary

Hamiltonian Monte Carlo improves on random walk proposals by using gradients of the log posterior to simulate directed trajectories through parameter space. Momentum variables are artificial helpers: after sampling from the joint distribution of parameters and momentum, we keep the parameter draws and discard the momentum. Leapfrog steps approximate Hamiltonian dynamics, and a Metropolis correction preserves the target distribution. Stan implements HMC through NUTS, automates much of the tuning, and provides diagnostics, but the model and its interpretation remain the analyst’s responsibility.

Check Your Understanding

Why can HMC move farther than random walk Metropolis while still keeping high acceptance probabilities?
What is the role of the momentum variable?
What do the step size, number of leapfrog steps, and mass matrix control?
Why does HMC still need an accept/reject step after leapfrog simulation?
What parts of Bayesian analysis does Stan automate, and what parts remain the user’s responsibility?