Monte Carlo and Gibbs Sampling
Source: Lecture 7 — Monte Carlo and Gibbs sampling (BDA Ch. 10—11).
The Thread of This Chapter
Section titled “The Thread of This Chapter”Bayesian inference often ends with an integral.
We may want a posterior mean, a posterior probability, a predictive quantity, or a marginal distribution. In simple conjugate models these quantities can sometimes be computed exactly. In realistic models they usually cannot.
The main idea of this chapter is:
When an integral is hard, simulate from the distribution and average the simulated values.
The chapter builds this idea in three steps:
- Use ordinary Monte Carlo when we can draw independent samples directly.
- Use Gibbs sampling when we can draw from full conditional distributions.
- Use data augmentation to create conditional distributions that are easy to sample from.
Monte Carlo Simulation
Section titled “Monte Carlo Simulation”What Problem Does It Solve?
Section titled “What Problem Does It Solve?”Suppose we want
If we can draw independent samples
then we can estimate the expectation by averaging the function over the draws:
This is the Monte Carlo principle.
Step 1: Draw From the Target Distribution
Section titled “Step 1: Draw From the Target Distribution”The target might be a prior, a posterior, or a predictive distribution. The only requirement for ordinary Monte Carlo is that the draws are independent samples from the distribution of interest.
For example, if the goal is the posterior mean of , draw
Step 2: Average the Quantity You Care About
Section titled “Step 2: Average the Quantity You Care About”For the posterior mean, use
For a transformed quantity, use
The law of large numbers says that these averages converge to the corresponding expectations as grows.
Step 3: Estimate Probabilities With Indicators
Section titled “Step 3: Estimate Probabilities With Indicators”A probability is also an expectation. For example,
So we estimate it by the fraction of simulated draws satisfying the event:
This is why posterior intervals, tail probabilities, and predictive probabilities are natural simulation outputs.
Monte Carlo Error
Section titled “Monte Carlo Error”For independent draws,
The important message is practical:
Monte Carlo error decreases at rate .
To cut simulation error in half, we need about four times as many draws.
Direct Sampling Methods
Section titled “Direct Sampling Methods”Monte Carlo is easy once we can sample from the target. Sometimes we can do that directly.
Inverse CDF Sampling
Section titled “Inverse CDF Sampling”Purpose
Section titled “Purpose”The inverse CDF method turns a Uniform draw into a draw from a chosen distribution.
The Algorithm
Section titled “The Algorithm”Let be the cumulative distribution function of .
-
Draw
-
Set
Then has distribution .
How to Read It
Section titled “How to Read It”The uniform draw chooses a probability level. The inverse CDF maps that probability level back to the corresponding value on the scale.
For an Exponential distribution,
Set and solve for :
For a standard Cauchy distribution,
so
Discrete Sampling
Section titled “Discrete Sampling”For a discrete distribution, the inverse idea becomes a threshold rule.
-
Draw .
-
Find the smallest value such that
-
Return that .
This is how categorical sampling works: divide the unit interval into probability-sized pieces and see where lands.
Sampling From Relationships
Section titled “Sampling From Relationships”Sometimes a distribution can be generated from simpler distributions.
Examples:
-
If and are independent standard Normal variables, then
-
If are independent standard Normal variables, then
These relationships are useful because many software libraries already sample accurately from basic distributions.
Why We Need Gibbs Sampling
Section titled “Why We Need Gibbs Sampling”Direct sampling is not always available. A posterior distribution may be high-dimensional:
Sampling from the full joint distribution may be hard. But each parameter may be easy to sample once the others are fixed.
Gibbs sampling uses this situation.
Full Conditional Distributions
Section titled “Full Conditional Distributions”Definition
Section titled “Definition”The full conditional distribution for is
where means all parameters except .
What It Means
Section titled “What It Means”The full conditional answers:
If all other parameters were temporarily fixed, what is the distribution of this one parameter?
Gibbs sampling cycles through these conditional distributions. Each update is easy, but together the updates explore the joint posterior.
The Gibbs Sampling Algorithm
Section titled “The Gibbs Sampling Algorithm”Step 1: Start Somewhere
Section titled “Step 1: Start Somewhere”Choose initial values
The starting value does not need to be a typical posterior value, but poor starting values may require a longer warm-up period.
Step 2: Update One Block at a Time
Section titled “Step 2: Update One Block at a Time”At iteration , draw
then
and continue until the last block:
Each new value is used immediately in later updates in the same iteration.
Step 3: Repeat
Section titled “Step 3: Repeat”After many cycles, the draws
behave like dependent draws from the joint posterior distribution.
They are not iid. They are a Markov chain whose stationary distribution is the posterior.
How to Read Gibbs Output
Section titled “How to Read Gibbs Output”Marginal Posterior Draws
Section titled “Marginal Posterior Draws”Even though the sampler updates conditionally, the long-run draws of each component approximate its marginal posterior:
So posterior means, intervals, and probabilities are computed by the same simulation averages as before.
Autocorrelation
Section titled “Autocorrelation”Gibbs draws are dependent. Consecutive draws often look similar, especially when parameters are strongly correlated.
For an autocorrelated chain,
where is the lag- autocorrelation.
The factor
is the inefficiency factor. Larger autocorrelation means fewer effective independent draws.
Example: Bivariate Normal Gibbs Sampler
Section titled “Example: Bivariate Normal Gibbs Sampler”The Target Distribution
Section titled “The Target Distribution”Suppose
The Full Conditionals
Section titled “The Full Conditionals”For this model,
and
What the Example Teaches
Section titled “What the Example Teaches”If is small, the two parameters are weakly related and the chain moves easily.
If is close to 1, the posterior mass lies along a narrow diagonal shape. Updating one coordinate at a time moves slowly along that shape. The sampler is still valid, but it can mix poorly.
Gibbs Sampling for a Normal Model
Section titled “Gibbs Sampling for a Normal Model”The Model
Section titled “The Model”Let
Use conditionally conjugate priors:
Why Gibbs Helps
Section titled “Why Gibbs Helps”The joint posterior for may not be from one simple standard family. But the full conditionals are standard:
and
The sampler alternates:
- Draw given the current .
- Draw given the new .
- Repeat.
Data Augmentation
Section titled “Data Augmentation”What Problem Does It Solve?
Section titled “What Problem Does It Solve?”Some models are hard because the likelihood is not conditionally conjugate. Data augmentation introduces latent variables that make the conditional steps easier.
The pattern is:
- Add unobserved variables .
- Sample parameters given and the data.
- Sample given parameters and the data.
The marginal posterior for the original parameters is unchanged after we ignore the auxiliary variables.
Example: Mixture of Normals
Section titled “Example: Mixture of Normals”The Model
Section titled “The Model”A -component mixture has density
where are mixture weights and is a Normal density.
Step 1: Introduce Membership Indicators
Section titled “Step 1: Introduce Membership Indicators”Let
be the unknown component that generated observation .
Given the indicators, the data are separated into ordinary Normal groups.
Step 2: Update Parameters Given Indicators
Section titled “Step 2: Update Parameters Given Indicators”Conditional on , update:
- mixture weights from a Dirichlet distribution;
- each component’s from its Normal—Inverse- conditional distribution.
Step 3: Update Indicators Given Parameters
Section titled “Step 3: Update Indicators Given Parameters”For each observation,
Normalize these values so they sum to 1, then draw from the resulting categorical distribution.
Example: Probit Regression
Section titled “Example: Probit Regression”The Model
Section titled “The Model”For binary data, probit regression can be written with a latent utility:
We observe , not .
Gibbs Updates
Section titled “Gibbs Updates”-
Given , update as in a Normal linear regression.
-
Given and , update each from a truncated Normal distribution:
- if , draw ;
- if , draw .
The latent utility turns a binary-response problem into repeated Normal-regression steps.
Example: Logistic Regression With Polya-Gamma Variables
Section titled “Example: Logistic Regression With Polya-Gamma Variables”Purpose
Section titled “Purpose”Logistic regression is not conjugate to a Normal prior in the same direct way as linear regression. Polya-Gamma augmentation creates a conditionally Normal update for .
The Updates
Section titled “The Updates”Introduce latent variables
Given , the coefficient update is Normal:
where
and
Here and .
You do not need to memorize the Polya-Gamma distribution for this course. The important idea is the augmentation pattern: add latent variables so that the parameter update becomes a familiar Normal draw.
Example: Regularized Regression
Section titled “Example: Regularized Regression”The Prior
Section titled “The Prior”Shrinkage can be written as a hierarchical prior. For regression coefficients,
with
The Conditional Update
Section titled “The Conditional Update”The Gibbs sampler includes an update for :
The conditional mean is
How to Read It
Section titled “How to Read It”If the coefficients are large, the denominator is large and the posterior mean of decreases. Smaller means less shrinkage. If the coefficients are close to zero, the sampler supports stronger shrinkage.
Improving Gibbs Sampling
Section titled “Improving Gibbs Sampling”Blocking
Section titled “Blocking”If two parameters are strongly correlated, updating them separately can be slow. A blocked Gibbs update samples them together.
The rule of thumb is:
Put strongly correlated parameters in the same block when a joint conditional update is available.
Reparametrization
Section titled “Reparametrization”Sometimes the same model can be written with parameters that are less correlated in the posterior. A better parametrization can improve mixing without changing the underlying model.
Data Augmentation Tradeoff
Section titled “Data Augmentation Tradeoff”Augmentation can make each conditional draw easy. But it can also add dependence and increase autocorrelation. A sampler can be simple per iteration but inefficient per effective draw.
Chapter Summary
Section titled “Chapter Summary”Monte Carlo approximates expectations by simulation. If draws are independent, averages converge with variance decreasing like . Gibbs sampling extends simulation to high-dimensional posteriors by repeatedly sampling from full conditional distributions. The resulting draws are dependent, so autocorrelation and effective sample size matter. Data augmentation introduces latent variables to make difficult conditional distributions tractable, with applications to mixtures, probit and logistic regression, and regularized regression.
Check Your Understanding
Section titled “Check Your Understanding”- Why can a posterior probability be estimated by averaging indicator functions?
- What is a full conditional distribution?
- In a Gibbs sampler, why are the draws usually autocorrelated?
- Why does high correlation slow down one-at-a-time Gibbs updates in the bivariate Normal example?
- How does data augmentation help in mixture models or probit regression?