Skip to content

Bayesian Model Comparison

Source: Lecture 10 — Bayesian Model Comparison (BDA Ch. 7), with BDA Ch. 7’s emphasis on predictive accuracy before discrete model choice.

The previous chapter asked:

Can one fitted model reproduce the features of the data that matter?

Now suppose we have several models that have survived basic checks. We need to compare them.

The tempting shortcut is to fit each model and compare its best likelihood. For two models, this would compare

p1(yθ^1)p_1(y \mid \hat\theta_1)

with

p2(yθ^2).p_2(y \mid \hat\theta_2).

This is not fair. Bigger or more flexible models usually fit the observed data better because they had more freedom to adapt to those same data.

The main idea of this chapter is:

A model should be rewarded for predicting data well, not merely for fitting data after seeing them.

Bayesian model comparison develops this idea through marginal likelihood, Bayes factors, sequential prediction, and Log Predictive Score.

The marginal likelihood asks:

Before seeing the data, how much probability did this model assign to data like this?

This is different from asking whether the model contains one parameter value that fits well. A model gets credit only if its prior and likelihood together made the observed data plausible.

For model MkM_k, let θk\theta_k be its parameters.

The likelihood is

pk(yθk).p_k(y \mid \theta_k).

The prior is

pk(θk).p_k(\theta_k).

The model’s prior predictive probability of the observed data is

pk(y)=pk(yθk)pk(θk)dθk.p_k(y) = \int p_k(y \mid \theta_k)\,p_k(\theta_k)\,d\theta_k.

This is the marginal likelihood.

The integral has two roles:

  1. It rewards parameter values that predict the data well.
  2. It penalizes models that spread prior mass over many parameter values that would not have predicted the data.

This is why marginal likelihood can penalize complexity automatically. A flexible model must spend its prior probability across many possible datasets. It wins only if enough of that prior predictive mass is near the observed data.

The warning is equally important:

Priors directly affect model comparison.

A Bayes factor compares two models using their marginal likelihoods.

If model M1M_1 assigned more prior predictive probability to the observed data than model M2M_2, the Bayes factor favors M1M_1.

Compute the marginal likelihood for model 1:

p1(y).p_1(y).

Compute the marginal likelihood for model 2:

p2(y).p_2(y).

Take the ratio:

B12(y)=p1(y)p2(y).B_{12}(y) = \frac{p_1(y)}{p_2(y)}.

If B12(y)>1B_{12}(y) > 1, the data favor M1M_1 over M2M_2 on the marginal-likelihood scale. If it is below 1, they favor M2M_2.

A Bayes factor compares evidence from the data. Posterior model probabilities also need prior probabilities over models.

Let

Pr(Mk)\Pr(M_k)

be the prior probability of model MkM_k.

Prior odds are

Pr(M1)Pr(M2).\frac{\Pr(M_1)}{\Pr(M_2)}.

The Bayes factor updates those odds:

Pr(M1y)Pr(M2y)=Pr(M1)Pr(M2)B12(y).\frac{\Pr(M_1 \mid y)}{\Pr(M_2 \mid y)} = \frac{\Pr(M_1)}{\Pr(M_2)}\,B_{12}(y).

For many models, the same idea can be written as

Pr(Mky)p(yMk)Pr(Mk).\Pr(M_k \mid y) \propto p(y \mid M_k)\,\Pr(M_k).

This formula has two pieces:

  • p(yMk)p(y \mid M_k): how well model MkM_k predicted the data before seeing them.
  • Pr(Mk)\Pr(M_k): how plausible model MkM_k was before seeing the data.

Bayes factors and posterior model probabilities answer a discrete-model question: how should probability be allocated across the listed models? They do not answer whether the models fit the data well in an absolute sense. That is why posterior predictive checking came first.

Example: Bayesian Hypothesis Testing with Bernoulli Data

Section titled “Example: Bayesian Hypothesis Testing with Bernoulli Data”

Hypothesis testing can be written as model comparison.

Suppose the data are Bernoulli observations. Let

s=number of successes,f=number of failures.s = \text{number of successes}, \qquad f = \text{number of failures}.

Compare two models.

The point-null model fixes the success probability:

M0:x1,,xnθ0Bernoulli(θ0).M_0: x_1,\ldots,x_n \mid \theta_0 \sim \mathrm{Bernoulli}(\theta_0).

Its marginal likelihood is just the likelihood at the fixed value:

p(xM0)=θ0s(1θ0)f.p(x \mid M_0) = \theta_0^s(1-\theta_0)^f.

The alternative model learns θ\theta:

M1:x1,,xnθBernoulli(θ).M_1: x_1,\ldots,x_n \mid \theta \sim \mathrm{Bernoulli}(\theta).

Use a Beta prior:

θBeta(α,β).\theta \sim \mathrm{Beta}(\alpha,\beta).

After integrating over θ\theta, the marginal likelihood is

p(xM1)=B(α+s,β+f)B(α,β).p(x \mid M_1) = \frac{B(\alpha+s,\beta+f)}{B(\alpha,\beta)}.

The Bayes factor comparing M0M_0 to M1M_1 is

BF(M0;M1)=p(xM0)p(xM1).BF(M_0;M_1) = \frac{p(x \mid M_0)}{p(x \mid M_1)}.

Substitute the two pieces:

BF(M0;M1)=θ0s(1θ0)fB(α,β)B(α+s,β+f).BF(M_0;M_1) = \frac{\theta_0^s(1-\theta_0)^f\,B(\alpha,\beta)} {B(\alpha+s,\beta+f)}.

Read this as a comparison between:

  • the probability of the data under the fixed value θ0\theta_0;
  • the prior predictive probability of the data under the uncertain-θ\theta model.

The marginal likelihood averages the likelihood over the prior. Therefore the prior affects model comparison directly.

This can surprise us because weak priors are often harmless for parameter estimation. For Bayes factors, a very wide prior can be harmful: it spreads probability mass over many values that predict data far away from what was observed.

The practical rule is:

A prior used for Bayes factors should represent real prior predictive beliefs, not just weak regularization.

Improper priors cannot be used for Bayes factors because the marginal likelihood is not defined up to a meaningful constant.

Model 1 is geometric:

yiθ1Geo(θ1),θ1Beta(α1,β1).y_i \mid \theta_1 \sim \mathrm{Geo}(\theta_1), \qquad \theta_1 \sim \mathrm{Beta}(\alpha_1,\beta_1).

Model 2 is Poisson:

yiθ2Pois(θ2),θ2Gamma(α2,β2).y_i \mid \theta_2 \sim \mathrm{Pois}(\theta_2), \qquad \theta_2 \sim \mathrm{Gamma}(\alpha_2,\beta_2).

The two models can be matched to have similar prior predictive means. Even then, their shapes differ.

If the data are

y=(0,0),y=(0,0),

the geometric model is favored because it naturally puts more mass near repeated zeros.

If the data are

y=(3,3),y=(3,3),

the Poisson model is favored because repeated moderate counts are more typical under that model.

As the sample size grows, the posterior model probability concentrates on the true model if it is in the list. If the true model is not in the list, it concentrates on the model closest in KL divergence.

Bayesian model comparison has attractive theoretical properties:

  • Coherence: pairwise Bayes factors are consistent with one another, so B12=B13B32B_{12} = B_{13}B_{32}.
  • Consistency: if the true model is in the list, its posterior probability tends to 1 as nn grows.
  • KL-consistency: if the true model is not in the list, the best approximating model wins asymptotically.

But there are important warnings:

  • Vague priors can favor smaller models.
  • Improper priors cannot be used for Bayes factors.
  • Severe misspecification can make posterior model probabilities misleading.
  • Outliers and multivariate responses can make comparisons fragile.

Marginal Likelihood as Sequential Prediction

Section titled “Marginal Likelihood as Sequential Prediction”

The marginal likelihood can feel abstract because it is an integral. A more intuitive view is sequential prediction.

Imagine observing the data one point at a time. The model must predict each new observation using only the observations already seen.

For ordered data,

p(y1,,yn)=p(y1)p(y2y1)p(yny1,,yn1).p(y_1,\ldots,y_n) = p(y_1)\,p(y_2 \mid y_1)\cdots p(y_n \mid y_1,\ldots,y_{n-1}).

This is just the chain rule for probability.

The prediction for observation ii is

p(yiy1,,yi1).p(y_i \mid y_1,\ldots,y_{i-1}).

Because θ\theta is unknown, average over the posterior based on the previous observations:

p(yiy1,,yi1)=p(yiθ)p(θy1,,yi1)dθ.p(y_i \mid y_1,\ldots,y_{i-1}) = \int p(y_i \mid \theta)\,p(\theta \mid y_1,\ldots,y_{i-1})\,d\theta.

The first factor

p(y1)p(y_1)

uses only the prior. It can be very sensitive to prior choices.

Later factors use increasingly much data:

p(yiy1,,yi1)for large i.p(y_i \mid y_1,\ldots,y_{i-1}) \quad \text{for large } i.

These predictions are usually less sensitive to the original prior.

This explains why marginal likelihood can be prior-sensitive: the first predictions are pure prior predictive checks.

Assume

y1,,ynθN(θ,σ2),y_1,\ldots,y_n \mid \theta \sim N(\theta,\sigma^2),

with known σ2\sigma^2, and prior

θN(0,κ2σ2).\theta \sim N(0,\kappa^2\sigma^2).

The parameter κ\kappa controls prior spread. Large κ\kappa means a diffuse prior.

Before seeing yiy_i, we have observed y1,,yi1y_1,\ldots,y_{i-1}.

Let yˉi1\bar y_{i-1} be their average. The posterior mean for θ\theta is a shrinkage version of that average:

mi1=i1κ2+i1yˉi1.m_{i-1} = \frac{i-1}{\kappa^{-2}+i-1}\bar y_{i-1}.

The multiplier is small early and close to 1 later.

The posterior variance for θ\theta is

vi1=σ2κ2+i1.v_{i-1} = \frac{\sigma^2}{\kappa^{-2}+i-1}.

As ii grows, more data have been observed and this variance shrinks.

A new observation has two sources of uncertainty:

  1. noise in the new observation, σ2\sigma^2;
  2. uncertainty about θ\theta, vi1v_{i-1}.

Therefore

yiy1,,yi1N(mi1,σ2+vi1).y_i \mid y_1,\ldots,y_{i-1} \sim N(m_{i-1},\,\sigma^2+v_{i-1}).

Substituting the two pieces gives

yiy1,,yi1N ⁣(i1κ2+i1yˉi1,σ2(1+1κ2+i1)).y_i \mid y_1,\ldots,y_{i-1} \sim N\!\left( \frac{i-1}{\kappa^{-2}+i-1}\bar y_{i-1}, \sigma^2\left(1+\frac{1}{\kappa^{-2}+i-1}\right) \right).

For i=1i=1, the prediction uses only the prior and is sensitive to κ\kappa. For large ii, the prediction is close to

N(yˉi1,σ2),N(\bar y_{i-1},\sigma^2),

which is much less sensitive to κ\kappa.

If marginal likelihood is too sensitive to early prior-only predictions, we can evaluate predictions only after the model has been trained on some data.

This is the idea behind Log Predictive Score.

Use the first nn^* observations for training:

y1,,yn.y_1,\ldots,y_{n^*}.

Use the remaining observations for testing:

yn+1,,yn.y_{n^*+1},\ldots,y_n.

Step 2: Predict the Test Observations Sequentially

Section titled “Step 2: Predict the Test Observations Sequentially”

After training on the first nn^* observations, predict

p(yn+1y1,,yn).p(y_{n^*+1} \mid y_1,\ldots,y_{n^*}).

Then include that observation and predict the next one:

p(yn+2y1,,yn+1).p(y_{n^*+2} \mid y_1,\ldots,y_{n^*+1}).

Continue until yny_n.

The Log Predictive Score is

LPS=i=n+1nlogp(yiy1,,yi1).\mathrm{LPS} = \sum_{i=n^*+1}^{n} \log p(y_i \mid y_1,\ldots,y_{i-1}).

Higher LPS means better predictive performance on the held-out part.

For time series, the ordering is natural: train on the past and predict the future. For cross-sectional data, use cross-validation folds instead of a single time order.

Use Bayes factors with extra care when:

  • The compared models are very different in structure.
  • The models are severely misspecified.
  • The models are very complex or black-box.
  • Priors are weak defaults rather than carefully chosen prior predictive distributions.
  • The data contain outliers that none of the models handle.

In such cases, posterior predictive checks and predictive scores are often easier to interpret than posterior probabilities over a fragile list of models.

Model comparison should not begin with in-sample fit. The marginal likelihood asks how well a model predicted the data before seeing them, averaging over the prior. Bayes factors compare marginal likelihoods and update prior model odds into posterior model odds. This is coherent for a fixed list of proper models, but sensitive to priors. The sequential predictive view explains why: early factors in the marginal likelihood are prior predictive. Log Predictive Score reduces that sensitivity by training first and evaluating later.

  1. Why is comparing maximum likelihood values usually unfair to smaller models?
  2. What does the marginal likelihood average over?
  3. How does a Bayes factor update prior model odds?
  4. Why can vague priors favor smaller models in Bayes-factor comparisons?
  5. Why is Log Predictive Score less sensitive to the prior than the full marginal likelihood?