Skip to content

Hamiltonian Monte Carlo and Stan

Source: Lecture 9 — HMC and Stan (BDA Ch. 12).

Random walk Metropolis is general, but it can be slow in high dimensions. It proposes a local move without knowing much about the shape of the posterior. If the posterior is narrow, curved, or strongly correlated, many proposed moves are either too small to be useful or too large to be accepted.

Hamiltonian Monte Carlo, or HMC, uses more information.

The main idea is:

Use gradients of the log posterior to propose distant moves that still have high acceptance probability.

This chapter builds HMC in small steps:

  1. Add artificial momentum variables.
  2. Define an energy function whose density has the desired posterior as a marginal.
  3. Simulate Hamiltonian dynamics with the leapfrog integrator.
  4. Correct numerical error with a Metropolis accept/reject step.
  5. Use Stan to automate much of this workflow.

Suppose

θ=(θ1,,θp)\theta=(\theta_1,\ldots,\theta_p)

is high-dimensional. The posterior distribution may occupy a small and curved region of Rp\mathbb{R}^p.

Random walk Metropolis has two common problems:

  • Small proposals are accepted often, but move slowly.
  • Large proposals move farther, but are rejected often.

In high dimensions this tradeoff becomes severe. The chain can spend many iterations taking short, undirected steps.

HMC replaces undirected random walks with guided trajectories.

HMC augments the parameter vector θ\theta with a momentum vector

ϕ=(ϕ1,,ϕp).\phi=(\phi_1,\ldots,\phi_p).

The momentum is artificial. It is not part of the scientific model. It is introduced to help move through the posterior.

Choose a momentum distribution, usually

ϕN(0,M),\phi \sim N(0,M),

where MM is called the mass matrix.

Define the joint density

p(θ,ϕy)=p(θy)p(ϕ).p(\theta,\phi \mid y) = p(\theta \mid y)p(\phi).

If we sample from this joint distribution and then ignore ϕ\phi, the remaining draws of θ\theta have the desired posterior distribution.

The potential energy is the negative log unnormalized posterior:

U(θ)=log{p(yθ)p(θ)}.U(\theta) = -\log\{p(y \mid \theta)p(\theta)\}.

Low potential energy corresponds to high posterior density.

The kinetic energy is determined by the momentum distribution:

K(ϕ)=12ϕM1ϕ+constant.K(\phi) = \frac{1}{2}\phi^\top M^{-1}\phi + \mathrm{constant}.

The constant does not affect the algorithm because it cancels in ratios.

The total energy is

H(θ,ϕ)=U(θ)+K(ϕ).H(\theta,\phi) = U(\theta)+K(\phi).

The joint density can be written as proportional to

exp{H(θ,ϕ)}.\exp\{-H(\theta,\phi)\}.

So preserving the Hamiltonian approximately preserves the joint density.

Hamiltonian dynamics move the pair (θ,ϕ)(\theta,\phi) according to

dθdt=M1ϕ,\frac{d\theta}{dt} = M^{-1}\phi,

and

dϕdt=θlogp(θy).\frac{d\phi}{dt} = \nabla_{\theta}\log p(\theta \mid y).

The first equation says momentum moves the parameters. The second says the log posterior gradient changes the momentum.

If the posterior density rises in a direction, the gradient points that way. HMC uses this slope information to curve the path through parameter space instead of wandering blindly.

In exact mathematics, Hamiltonian dynamics preserve volume and energy. That is why they can propose distant moves without immediately destroying the target distribution.

Step 4: Approximate the Dynamics With Leapfrog

Section titled “Step 4: Approximate the Dynamics With Leapfrog”

Exact continuous dynamics are usually impossible to compute. HMC uses a numerical approximation called the leapfrog integrator.

Let ε\varepsilon be the step size.

ϕ(t+ε2)=ϕ(t)+ε2θlogp(θ(t)y).\phi\left(t+\frac{\varepsilon}{2}\right) = \phi(t) + \frac{\varepsilon}{2} \nabla_{\theta}\log p(\theta(t) \mid y). θ(t+ε)=θ(t)+εM1ϕ(t+ε2).\theta(t+\varepsilon) = \theta(t) + \varepsilon M^{-1} \phi\left(t+\frac{\varepsilon}{2}\right). ϕ(t+ε)=ϕ(t+ε2)+ε2θlogp(θ(t+ε)y).\phi(t+\varepsilon) = \phi\left(t+\frac{\varepsilon}{2}\right) + \frac{\varepsilon}{2} \nabla_{\theta}\log p(\theta(t+\varepsilon) \mid y).

Leapfrog is popular because it is reversible and volume-preserving. These properties make the final Metropolis correction valid.

The approximation is not exact. Larger step sizes create more numerical error. The accept/reject step corrects that error.

Step 1: Start at the Current Parameter Value

Section titled “Step 1: Start at the Current Parameter Value”

At iteration ii, begin from

θ(i1).\theta^{(i-1)}.

Draw

ϕsN(0,M).\phi_s \sim N(0,M).

This gives the trajectory a new direction.

Starting from

(θ(i1),ϕs),(\theta^{(i-1)},\phi_s),

run LL leapfrog steps with step size ε\varepsilon. This produces a proposal

(θp,ϕp).(\theta^p,\phi^p).

Use the Metropolis probability

α=min(1,p(yθp)p(θp)p(ϕp)p(yθ(i1))p(θ(i1))p(ϕs)).\alpha = \min \left( 1,\, \frac{ p(y \mid \theta^p)p(\theta^p)p(\phi^p) }{ p(y \mid \theta^{(i-1)})p(\theta^{(i-1)})p(\phi_s) } \right).

If accepted, set

θ(i)=θp.\theta^{(i)}=\theta^p.

If rejected, set

θ(i)=θ(i1).\theta^{(i)}=\theta^{(i-1)}.

As in other Metropolis methods, the normalizing constant of the posterior cancels.

Random walk Metropolis asks:

What if we step randomly from the current point?

HMC asks:

If the posterior were a smooth surface and the current state had momentum, where would the trajectory go?

The gradient tells the sampler about local posterior geometry. The momentum prevents it from simply moving uphill to a mode. Together they create long, directed proposals that can cross the typical set more efficiently than a random walk.

The step size ε\varepsilon controls how fine the numerical simulation is.

  • Too large: leapfrog error is large and proposals are rejected.
  • Too small: trajectories are accurate but expensive and slow to move.

The number of steps LL controls how long the trajectory runs.

The trajectory length is

εL.\varepsilon L.

If LL is too small, HMC behaves too locally. If it is too large, the trajectory may waste computation or begin to double back.

The mass matrix MM controls how momentum is scaled in different parameter directions.

A good mass matrix reflects posterior scale and correlation. In many applications it is adapted during warm-up.

The No-U-Turn Sampler, or NUTS, is an adaptive version of HMC used by Stan.

It avoids choosing a fixed LL by extending the trajectory until it detects that the path is starting to turn back toward where it came from.

Stan’s NUTS implementation adapts:

  • the step size ε\varepsilon;
  • the mass matrix MM;
  • the effective trajectory length.

This does not remove the need for model checking. It reduces the amount of manual tuning required to obtain useful posterior draws.

Stan is a language and computation system for Bayesian inference. The user writes the model. Stan computes gradients and runs HMC/NUTS.

A Stan program separates:

  • data: observed quantities supplied by the user;
  • parameters: unknown quantities to infer;
  • model: priors and likelihood.

Suppose

yiμ,σ2N(μ,σ2),i=1,,N.y_i \mid \mu,\sigma^2 \sim N(\mu,\sigma^2), \qquad i=1,\ldots,N.

Use priors for μ\mu and σ2\sigma^2.

data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, 100);
sigma ~ cauchy(0, 5);
y ~ normal(mu, sigma);
}

The data block declares what is observed. The parameters block declares what Stan samples. The model block adds log-density terms for priors and likelihood.

The line

y ~ normal(mu, sigma);

is vectorized: it applies the Normal likelihood to all observations in y.

Suppose observations are grouped by person, school, region, or another unit. Let p[i]p[i] be the group for observation ii.

One simple hierarchical model is

yiμp[i],σN(μp[i],σ2),y_i \mid \mu_{p[i]},\sigma \sim N(\mu_{p[i]},\sigma^2),

with group means

μjμ,τN(μ,τ2).\mu_j \mid \mu,\tau \sim N(\mu,\tau^2).

Each group gets its own mean, but the group means are partially pooled toward the population mean μ\mu. Groups with little data borrow more strength from the population distribution.

In Stan, this requires a vector of group-level parameters and a prior tying them together.

For grouped count data, use a Poisson likelihood:

yiλp[i]Poisson(λp[i]).y_i \mid \lambda_{p[i]} \sim \mathrm{Poisson}(\lambda_{p[i]}).

To keep rates positive, model group rates on the log scale:

logλjμ,τN(μ,τ2).\log \lambda_j \mid \mu,\tau \sim N(\mu,\tau^2).

Equivalently,

λjLogNormal(μ,τ2).\lambda_j \sim \mathrm{LogNormal}(\mu,\tau^2).

The Normal distribution describes variation in log rates across groups. After exponentiation, the rates are positive and can vary multiplicatively.

In R, a typical workflow is:

library(rstan)
data <- list(N = N, y = y)
fit <- stan(
file = "normal_model.stan",
data = data,
warmup = 1000,
iter = 2000,
chains = 4,
cores = 2
)
print(fit, digits_summary = 3)
postDraws <- extract(fit)
traceplot(fit)
pairs(fit)
  • warmup: iterations used for adaptation and not kept as posterior draws.
  • iter: total iterations per chain, including warm-up.
  • chains: independent chains from different starting points.
  • cores: parallel computation resources.

The printed output and diagnostic plots should be read before interpreting posterior summaries.

A divergent transition means the numerical Hamiltonian trajectory had trouble following the posterior geometry. Divergences can indicate step-size problems, strong curvature, or a model parametrization that is hard for HMC.

Effective sample size measures how much independent-sample information the autocorrelated chains contain.

Low ESS means Monte Carlo error may be large for the reported posterior summary.

R^\hat R compares between-chain and within-chain variation.

Values close to 1 suggest the chains have mixed similarly. Values clearly above 1 indicate that the chains may not be sampling from the same stationary distribution.

Stan automates HMC, but it does not choose a scientifically correct model.

The user still must:

  • specify a meaningful likelihood;
  • choose priors;
  • check convergence diagnostics;
  • perform posterior predictive checks;
  • interpret the model in context.

Good computation is necessary for Bayesian inference, but it is not the same as model adequacy.

Hamiltonian Monte Carlo improves on random walk proposals by using gradients of the log posterior to simulate directed trajectories through parameter space. Momentum variables are artificial helpers: after sampling from the joint distribution of parameters and momentum, we keep the parameter draws and discard the momentum. Leapfrog steps approximate Hamiltonian dynamics, and a Metropolis correction preserves the target distribution. Stan implements HMC through NUTS, automates much of the tuning, and provides diagnostics, but the model and its interpretation remain the analyst’s responsibility.

  1. Why can HMC move farther than random walk Metropolis while still keeping high acceptance probabilities?
  2. What is the role of the momentum variable?
  3. What do the step size, number of leapfrog steps, and mass matrix control?
  4. Why does HMC still need an accept/reject step after leapfrog simulation?
  5. What parts of Bayesian analysis does Stan automate, and what parts remain the user’s responsibility?