Bayesian Model Comparison

Source: Lecture 10 — Bayesian Model Comparison (BDA Ch. 7), with BDA Ch. 7’s emphasis on predictive accuracy before discrete model choice.

From Checking to Comparing

The previous chapter asked:

Can one fitted model reproduce the features of the data that matter?

Now suppose we have several models that have survived basic checks. We need to compare them.

The tempting shortcut is to fit each model and compare its best likelihood. For two models, this would compare

$p_1(y \mid \hat\theta_1)$

with

$p_2(y \mid \hat\theta_2).$

This is not fair. Bigger or more flexible models usually fit the observed data better because they had more freedom to adapt to those same data.

The main idea of this chapter is:

A model should be rewarded for predicting data well, not merely for fitting data after seeing them.

Bayesian model comparison develops this idea through marginal likelihood, Bayes factors, sequential prediction, and Log Predictive Score.

Marginal Likelihood

What Is It For?

The marginal likelihood asks:

Before seeing the data, how much probability did this model assign to data like this?

This is different from asking whether the model contains one parameter value that fits well. A model gets credit only if its prior and likelihood together made the observed data plausible.

Step 1: Start With a Model

For model $M_k$ , let $\theta_k$ be its parameters.

The likelihood is

$p_k(y \mid \theta_k).$

The prior is

$p_k(\theta_k).$

Step 2: Average Over the Prior

The model’s prior predictive probability of the observed data is

$p_k(y) = \int p_k(y \mid \theta_k)\,p_k(\theta_k)\,d\theta_k.$

This is the marginal likelihood.

Step 3: Interpret the Average

The integral has two roles:

It rewards parameter values that predict the data well.
It penalizes models that spread prior mass over many parameter values that would not have predicted the data.

This is why marginal likelihood can penalize complexity automatically. A flexible model must spend its prior probability across many possible datasets. It wins only if enough of that prior predictive mass is near the observed data.

The warning is equally important:

Priors directly affect model comparison.

Bayes Factors

What Are They For?

A Bayes factor compares two models using their marginal likelihoods.

If model $M_1$ assigned more prior predictive probability to the observed data than model $M_2$ , the Bayes factor favors $M_1$ .

Definition

Compute the marginal likelihood for model 1:

$p_1(y).$

Compute the marginal likelihood for model 2:

$p_2(y).$

Take the ratio:

$B_{12}(y) = \frac{p_1(y)}{p_2(y)}.$

If $B_{12}(y) > 1$ , the data favor $M_1$ over $M_2$ on the marginal-likelihood scale. If it is below 1, they favor $M_2$ .

Posterior Model Probabilities

What Extra Ingredient Is Needed?

A Bayes factor compares evidence from the data. Posterior model probabilities also need prior probabilities over models.

Let

$\Pr(M_k)$

be the prior probability of model $M_k$ .

From Prior Odds to Posterior Odds

Prior odds are

$\frac{\Pr(M_1)}{\Pr(M_2)}.$

The Bayes factor updates those odds:

\frac{\Pr(M_1 \mid y)}{\Pr(M_2 \mid y)} = \frac{\Pr(M_1)}{\Pr(M_2)}\,B_{12}(y).

For many models, the same idea can be written as

$\Pr(M_k \mid y) \propto p(y \mid M_k)\,\Pr(M_k).$

This formula has two pieces:

$p(y \mid M_k)$ : how well model $M_k$ predicted the data before seeing them.
$\Pr(M_k)$ : how plausible model $M_k$ was before seeing the data.

Bayes factors and posterior model probabilities answer a discrete-model question: how should probability be allocated across the listed models? They do not answer whether the models fit the data well in an absolute sense. That is why posterior predictive checking came first.

Example: Bayesian Hypothesis Testing with Bernoulli Data

What Is the Question?

Hypothesis testing can be written as model comparison.

Suppose the data are Bernoulli observations. Let

$s = \text{number of successes}, \qquad f = \text{number of failures}.$

Compare two models.

Model 0: Fixed Success Probability

The point-null model fixes the success probability:

$M_0: x_1,\ldots,x_n \mid \theta_0 \sim \mathrm{Bernoulli}(\theta_0).$

Its marginal likelihood is just the likelihood at the fixed value:

$p(x \mid M_0) = \theta_0^s(1-\theta_0)^f.$

Model 1: Unknown Success Probability

The alternative model learns $\theta$ :

$M_1: x_1,\ldots,x_n \mid \theta \sim \mathrm{Bernoulli}(\theta).$

Use a Beta prior:

$\theta \sim \mathrm{Beta}(\alpha,\beta).$

After integrating over $\theta$ , the marginal likelihood is

$p(x \mid M_1) = \frac{B(\alpha+s,\beta+f)}{B(\alpha,\beta)}.$

The Bayes Factor

The Bayes factor comparing $M_0$ to $M_1$ is

$BF(M_0;M_1) = \frac{p(x \mid M_0)}{p(x \mid M_1)}.$

Substitute the two pieces:

BF(M_0;M_1) = \frac{\theta_0^s(1-\theta_0)^f\,B(\alpha,\beta)} {B(\alpha+s,\beta+f)}.

Read this as a comparison between:

the probability of the data under the fixed value $\theta_0$ ;
the prior predictive probability of the data under the uncertain- $\theta$ model.

Why Priors Matter

The marginal likelihood averages the likelihood over the prior. Therefore the prior affects model comparison directly.

This can surprise us because weak priors are often harmless for parameter estimation. For Bayes factors, a very wide prior can be harmful: it spreads probability mass over many values that predict data far away from what was observed.

The practical rule is:

A prior used for Bayes factors should represent real prior predictive beliefs, not just weak regularization.

Improper priors cannot be used for Bayes factors because the marginal likelihood is not defined up to a meaningful constant.

Example: Geometric vs. Poisson Counts

The Competing Models

Model 1 is geometric:

$y_i \mid \theta_1 \sim \mathrm{Geo}(\theta_1), \qquad \theta_1 \sim \mathrm{Beta}(\alpha_1,\beta_1).$

Model 2 is Poisson:

$y_i \mid \theta_2 \sim \mathrm{Pois}(\theta_2), \qquad \theta_2 \sim \mathrm{Gamma}(\alpha_2,\beta_2).$

The two models can be matched to have similar prior predictive means. Even then, their shapes differ.

What the Example Teaches

If the data are

$y=(0,0),$

the geometric model is favored because it naturally puts more mass near repeated zeros.

If the data are

$y=(3,3),$

the Poisson model is favored because repeated moderate counts are more typical under that model.

As the sample size grows, the posterior model probability concentrates on the true model if it is in the list. If the true model is not in the list, it concentrates on the model closest in KL divergence.

Useful Properties and Warnings

Bayesian model comparison has attractive theoretical properties:

Coherence: pairwise Bayes factors are consistent with one another, so $B_{12} = B_{13}B_{32}$ .
Consistency: if the true model is in the list, its posterior probability tends to 1 as $n$ grows.
KL-consistency: if the true model is not in the list, the best approximating model wins asymptotically.

But there are important warnings:

Vague priors can favor smaller models.
Improper priors cannot be used for Bayes factors.
Severe misspecification can make posterior model probabilities misleading.
Outliers and multivariate responses can make comparisons fragile.

Marginal Likelihood as Sequential Prediction

Why This View Helps

The marginal likelihood can feel abstract because it is an integral. A more intuitive view is sequential prediction.

Imagine observing the data one point at a time. The model must predict each new observation using only the observations already seen.

Step 1: Decompose the Joint Density

For ordered data,

p(y_1,\ldots,y_n) = p(y_1)\,p(y_2 \mid y_1)\cdots p(y_n \mid y_1,\ldots,y_{n-1}).

This is just the chain rule for probability.

Step 2: Interpret One Factor

The prediction for observation $i$ is

$p(y_i \mid y_1,\ldots,y_{i-1}).$

Because $\theta$ is unknown, average over the posterior based on the previous observations:

p(y_i \mid y_1,\ldots,y_{i-1}) = \int p(y_i \mid \theta)\,p(\theta \mid y_1,\ldots,y_{i-1})\,d\theta.

Step 3: Notice the Prior Sensitivity

The first factor

$p(y_1)$

uses only the prior. It can be very sensitive to prior choices.

Later factors use increasingly much data:

$p(y_i \mid y_1,\ldots,y_{i-1}) \quad \text{for large } i.$

These predictions are usually less sensitive to the original prior.

This explains why marginal likelihood can be prior-sensitive: the first predictions are pure prior predictive checks.

Normal Example

The Model

Assume

$y_1,\ldots,y_n \mid \theta \sim N(\theta,\sigma^2),$

with known $\sigma^2$ , and prior

$\theta \sim N(0,\kappa^2\sigma^2).$

The parameter $\kappa$ controls prior spread. Large $\kappa$ means a diffuse prior.

Posterior Mean Before Observation i

Before seeing $y_i$ , we have observed $y_1,\ldots,y_{i-1}$ .

Let $\bar y_{i-1}$ be their average. The posterior mean for $\theta$ is a shrinkage version of that average:

m_{i-1} = \frac{i-1}{\kappa^{-2}+i-1}\bar y_{i-1}.

The multiplier is small early and close to 1 later.

Posterior Variance Before Observation i

The posterior variance for $\theta$ is

v_{i-1} = \frac{\sigma^2}{\kappa^{-2}+i-1}.

As $i$ grows, more data have been observed and this variance shrinks.

Predictive Distribution

A new observation has two sources of uncertainty:

noise in the new observation, $\sigma^2$ ;
uncertainty about $\theta$ , $v_{i-1}$ .

Therefore

y_i \mid y_1,\ldots,y_{i-1} \sim N(m_{i-1},\,\sigma^2+v_{i-1}).

Substituting the two pieces gives

y_i \mid y_1,\ldots,y_{i-1} \sim N\!\left( \frac{i-1}{\kappa^{-2}+i-1}\bar y_{i-1}, \sigma^2\left(1+\frac{1}{\kappa^{-2}+i-1}\right) \right).

For $i=1$ , the prediction uses only the prior and is sensitive to $\kappa$ . For large $i$ , the prediction is close to

$N(\bar y_{i-1},\sigma^2),$

which is much less sensitive to $\kappa$ .

Log Predictive Score

What Problem Does It Solve?

If marginal likelihood is too sensitive to early prior-only predictions, we can evaluate predictions only after the model has been trained on some data.

This is the idea behind Log Predictive Score.

Step 1: Split the Sequence

Use the first $n^*$ observations for training:

$y_1,\ldots,y_{n^*}.$

Use the remaining observations for testing:

$y_{n^*+1},\ldots,y_n.$

Step 2: Predict the Test Observations Sequentially

After training on the first $n^*$ observations, predict

$p(y_{n^*+1} \mid y_1,\ldots,y_{n^*}).$

Then include that observation and predict the next one:

$p(y_{n^*+2} \mid y_1,\ldots,y_{n^*+1}).$

Continue until $y_n$ .

Step 3: Add Log Predictive Densities

The Log Predictive Score is

\mathrm{LPS} = \sum_{i=n^*+1}^{n} \log p(y_i \mid y_1,\ldots,y_{i-1}).

Higher LPS means better predictive performance on the held-out part.

For time series, the ordering is natural: train on the past and predict the future. For cross-sectional data, use cross-validation folds instead of a single time order.

Be Careful with Bayes Factors

Use Bayes factors with extra care when:

The compared models are very different in structure.
The models are severely misspecified.
The models are very complex or black-box.
Priors are weak defaults rather than carefully chosen prior predictive distributions.
The data contain outliers that none of the models handle.

In such cases, posterior predictive checks and predictive scores are often easier to interpret than posterior probabilities over a fragile list of models.

Chapter Summary

Model comparison should not begin with in-sample fit. The marginal likelihood asks how well a model predicted the data before seeing them, averaging over the prior. Bayes factors compare marginal likelihoods and update prior model odds into posterior model odds. This is coherent for a fixed list of proper models, but sensitive to priors. The sequential predictive view explains why: early factors in the marginal likelihood are prior predictive. Log Predictive Score reduces that sensitivity by training first and evaluating later.

Check Your Understanding

Why is comparing maximum likelihood values usually unfair to smaller models?
What does the marginal likelihood average over?
How does a Bayes factor update prior model odds?
Why can vague priors favor smaller models in Bayes-factor comparisons?
Why is Log Predictive Score less sensitive to the prior than the full marginal likelihood?