Bayesian Model Comparison
Source: Lecture 10 — Bayesian Model Comparison (BDA Ch. 7), with BDA Ch. 7’s emphasis on predictive accuracy before discrete model choice.
From Checking to Comparing
Section titled “From Checking to Comparing”The previous chapter asked:
Can one fitted model reproduce the features of the data that matter?
Now suppose we have several models that have survived basic checks. We need to compare them.
The tempting shortcut is to fit each model and compare its best likelihood. For two models, this would compare
with
This is not fair. Bigger or more flexible models usually fit the observed data better because they had more freedom to adapt to those same data.
The main idea of this chapter is:
A model should be rewarded for predicting data well, not merely for fitting data after seeing them.
Bayesian model comparison develops this idea through marginal likelihood, Bayes factors, sequential prediction, and Log Predictive Score.
Marginal Likelihood
Section titled “Marginal Likelihood”What Is It For?
Section titled “What Is It For?”The marginal likelihood asks:
Before seeing the data, how much probability did this model assign to data like this?
This is different from asking whether the model contains one parameter value that fits well. A model gets credit only if its prior and likelihood together made the observed data plausible.
Step 1: Start With a Model
Section titled “Step 1: Start With a Model”For model , let be its parameters.
The likelihood is
The prior is
Step 2: Average Over the Prior
Section titled “Step 2: Average Over the Prior”The model’s prior predictive probability of the observed data is
This is the marginal likelihood.
Step 3: Interpret the Average
Section titled “Step 3: Interpret the Average”The integral has two roles:
- It rewards parameter values that predict the data well.
- It penalizes models that spread prior mass over many parameter values that would not have predicted the data.
This is why marginal likelihood can penalize complexity automatically. A flexible model must spend its prior probability across many possible datasets. It wins only if enough of that prior predictive mass is near the observed data.
The warning is equally important:
Priors directly affect model comparison.
Bayes Factors
Section titled “Bayes Factors”What Are They For?
Section titled “What Are They For?”A Bayes factor compares two models using their marginal likelihoods.
If model assigned more prior predictive probability to the observed data than model , the Bayes factor favors .
Definition
Section titled “Definition”Compute the marginal likelihood for model 1:
Compute the marginal likelihood for model 2:
Take the ratio:
If , the data favor over on the marginal-likelihood scale. If it is below 1, they favor .
Posterior Model Probabilities
Section titled “Posterior Model Probabilities”What Extra Ingredient Is Needed?
Section titled “What Extra Ingredient Is Needed?”A Bayes factor compares evidence from the data. Posterior model probabilities also need prior probabilities over models.
Let
be the prior probability of model .
From Prior Odds to Posterior Odds
Section titled “From Prior Odds to Posterior Odds”Prior odds are
The Bayes factor updates those odds:
For many models, the same idea can be written as
This formula has two pieces:
- : how well model predicted the data before seeing them.
- : how plausible model was before seeing the data.
Bayes factors and posterior model probabilities answer a discrete-model question: how should probability be allocated across the listed models? They do not answer whether the models fit the data well in an absolute sense. That is why posterior predictive checking came first.
Example: Bayesian Hypothesis Testing with Bernoulli Data
Section titled “Example: Bayesian Hypothesis Testing with Bernoulli Data”What Is the Question?
Section titled “What Is the Question?”Hypothesis testing can be written as model comparison.
Suppose the data are Bernoulli observations. Let
Compare two models.
Model 0: Fixed Success Probability
Section titled “Model 0: Fixed Success Probability”The point-null model fixes the success probability:
Its marginal likelihood is just the likelihood at the fixed value:
Model 1: Unknown Success Probability
Section titled “Model 1: Unknown Success Probability”The alternative model learns :
Use a Beta prior:
After integrating over , the marginal likelihood is
The Bayes Factor
Section titled “The Bayes Factor”The Bayes factor comparing to is
Substitute the two pieces:
Read this as a comparison between:
- the probability of the data under the fixed value ;
- the prior predictive probability of the data under the uncertain- model.
Why Priors Matter
Section titled “Why Priors Matter”The marginal likelihood averages the likelihood over the prior. Therefore the prior affects model comparison directly.
This can surprise us because weak priors are often harmless for parameter estimation. For Bayes factors, a very wide prior can be harmful: it spreads probability mass over many values that predict data far away from what was observed.
The practical rule is:
A prior used for Bayes factors should represent real prior predictive beliefs, not just weak regularization.
Improper priors cannot be used for Bayes factors because the marginal likelihood is not defined up to a meaningful constant.
Example: Geometric vs. Poisson Counts
Section titled “Example: Geometric vs. Poisson Counts”The Competing Models
Section titled “The Competing Models”Model 1 is geometric:
Model 2 is Poisson:
The two models can be matched to have similar prior predictive means. Even then, their shapes differ.
What the Example Teaches
Section titled “What the Example Teaches”If the data are
the geometric model is favored because it naturally puts more mass near repeated zeros.
If the data are
the Poisson model is favored because repeated moderate counts are more typical under that model.
As the sample size grows, the posterior model probability concentrates on the true model if it is in the list. If the true model is not in the list, it concentrates on the model closest in KL divergence.
Useful Properties and Warnings
Section titled “Useful Properties and Warnings”Bayesian model comparison has attractive theoretical properties:
- Coherence: pairwise Bayes factors are consistent with one another, so .
- Consistency: if the true model is in the list, its posterior probability tends to 1 as grows.
- KL-consistency: if the true model is not in the list, the best approximating model wins asymptotically.
But there are important warnings:
- Vague priors can favor smaller models.
- Improper priors cannot be used for Bayes factors.
- Severe misspecification can make posterior model probabilities misleading.
- Outliers and multivariate responses can make comparisons fragile.
Marginal Likelihood as Sequential Prediction
Section titled “Marginal Likelihood as Sequential Prediction”Why This View Helps
Section titled “Why This View Helps”The marginal likelihood can feel abstract because it is an integral. A more intuitive view is sequential prediction.
Imagine observing the data one point at a time. The model must predict each new observation using only the observations already seen.
Step 1: Decompose the Joint Density
Section titled “Step 1: Decompose the Joint Density”For ordered data,
This is just the chain rule for probability.
Step 2: Interpret One Factor
Section titled “Step 2: Interpret One Factor”The prediction for observation is
Because is unknown, average over the posterior based on the previous observations:
Step 3: Notice the Prior Sensitivity
Section titled “Step 3: Notice the Prior Sensitivity”The first factor
uses only the prior. It can be very sensitive to prior choices.
Later factors use increasingly much data:
These predictions are usually less sensitive to the original prior.
This explains why marginal likelihood can be prior-sensitive: the first predictions are pure prior predictive checks.
Normal Example
Section titled “Normal Example”The Model
Section titled “The Model”Assume
with known , and prior
The parameter controls prior spread. Large means a diffuse prior.
Posterior Mean Before Observation i
Section titled “Posterior Mean Before Observation i”Before seeing , we have observed .
Let be their average. The posterior mean for is a shrinkage version of that average:
The multiplier is small early and close to 1 later.
Posterior Variance Before Observation i
Section titled “Posterior Variance Before Observation i”The posterior variance for is
As grows, more data have been observed and this variance shrinks.
Predictive Distribution
Section titled “Predictive Distribution”A new observation has two sources of uncertainty:
- noise in the new observation, ;
- uncertainty about , .
Therefore
Substituting the two pieces gives
For , the prediction uses only the prior and is sensitive to . For large , the prediction is close to
which is much less sensitive to .
Log Predictive Score
Section titled “Log Predictive Score”What Problem Does It Solve?
Section titled “What Problem Does It Solve?”If marginal likelihood is too sensitive to early prior-only predictions, we can evaluate predictions only after the model has been trained on some data.
This is the idea behind Log Predictive Score.
Step 1: Split the Sequence
Section titled “Step 1: Split the Sequence”Use the first observations for training:
Use the remaining observations for testing:
Step 2: Predict the Test Observations Sequentially
Section titled “Step 2: Predict the Test Observations Sequentially”After training on the first observations, predict
Then include that observation and predict the next one:
Continue until .
Step 3: Add Log Predictive Densities
Section titled “Step 3: Add Log Predictive Densities”The Log Predictive Score is
Higher LPS means better predictive performance on the held-out part.
For time series, the ordering is natural: train on the past and predict the future. For cross-sectional data, use cross-validation folds instead of a single time order.
Be Careful with Bayes Factors
Section titled “Be Careful with Bayes Factors”Use Bayes factors with extra care when:
- The compared models are very different in structure.
- The models are severely misspecified.
- The models are very complex or black-box.
- Priors are weak defaults rather than carefully chosen prior predictive distributions.
- The data contain outliers that none of the models handle.
In such cases, posterior predictive checks and predictive scores are often easier to interpret than posterior probabilities over a fragile list of models.
Chapter Summary
Section titled “Chapter Summary”Model comparison should not begin with in-sample fit. The marginal likelihood asks how well a model predicted the data before seeing them, averaging over the prior. Bayes factors compare marginal likelihoods and update prior model odds into posterior model odds. This is coherent for a fixed list of proper models, but sensitive to priors. The sequential predictive view explains why: early factors in the marginal likelihood are prior predictive. Log Predictive Score reduces that sensitivity by training first and evaluating later.
Check Your Understanding
Section titled “Check Your Understanding”- Why is comparing maximum likelihood values usually unfair to smaller models?
- What does the marginal likelihood average over?
- How does a Bayes factor update prior model odds?
- Why can vague priors favor smaller models in Bayes-factor comparisons?
- Why is Log Predictive Score less sensitive to the prior than the full marginal likelihood?