Skip to content

Model Evaluation

Source: Lecture 12 — Model evaluation (BDA Ch. 6), with the framing of BDA Ch. 6—7.

Up to this point we have mostly asked:

Given a model, what can we infer?

This part asks a different question:

Should we trust this model for the job we have in mind?

The storyline is:

  1. Evaluate one model: what does it reproduce, and what does it miss?
  2. Compare several models: which predicts future or replicated data better?
  3. Study Bayes factors: when do posterior model probabilities make sense?
  4. Scale up to many models: how do we search and average over a large model space?

The order matters. If we start directly with Bayes factors, it can feel like model comparison is only about formulas. The better starting point is more practical: before ranking models, learn how to notice what a model is doing.

A Bayesian analysis has three broad steps:

  1. Build a probability model.
  2. Compute the posterior distribution.
  3. Check whether the fitted model is adequate for the purpose.

The third step is easy to skip, but it is essential. Posterior inference assumes the whole model structure. If the likelihood or prior is badly mismatched to the problem, the posterior can be precise and misleading at the same time.

George Box summarized the practical attitude:

“All models are wrong, but some are useful.”

So the question is not:

Is the model true?

It is:

Are the model’s failures important for the conclusions we want to draw?

This is the viewpoint in BDA Ch. 6. Model checking is not a ritual. It is a way to find the parts of reality that the model has not captured.

The same fitted model can be useful for one purpose and inadequate for another. Start by naming the purpose.

If the goal is prediction, the main question is:

Will the model predict future data well?

Interpretability may be secondary. A black-box model may be acceptable. Model averaging can be helpful because it can reduce predictive risk.

If the goal is abstraction, the model is a tool for thinking about a phenomenon.

Here the main question is:

Does the model isolate the mechanism or pattern we care about?

Pure predictive accuracy may be less important. Model averaging can even be unhelpful if it blurs the interpretation.

Sometimes the model is a compact summary of a complex process.

Then the main question is:

Is the model simple enough to compute, communicate, or update in real time?

This matters in online analysis, monitoring, or other settings where a perfect model is too slow or too complicated.

Before computing any diagnostic, ask:

  • What data should the model be able to reproduce?
  • Which discrepancy would change our conclusion?
  • Which discrepancy is only a harmless simplification?

After fitting a model, we want to know whether the observed data look plausible under that fitted model.

But we cannot compare the observed data to one fixed parameter value. We are uncertain about the parameters. Bayesian checking therefore uses the posterior distribution.

The idea is:

  1. Draw plausible parameter values from the posterior.
  2. Simulate replicated datasets from those parameter values.
  3. Compare the real data with the replicated data.

If the model is useful, the observed data should not look systematically stranger than the replicated data.

Let

yy

be the data we actually observed.

Let

yrepy^{\mathrm{rep}}

be data that could have been observed if the fitted model generated a new dataset under the same conditions.

This is not a new future population by default. It is a model-based replication of the data-generating setup we want to check.

Step 3: Average Over Posterior Uncertainty

Section titled “Step 3: Average Over Posterior Uncertainty”

For a fixed parameter value θ\theta, the model can simulate replicated data from

p(yrepθ).p(y^{\mathrm{rep}} \mid \theta).

But θ\theta is unknown. After observing yy, uncertainty about θ\theta is described by

p(θy).p(\theta \mid y).

Average the replicate distribution over this posterior:

p(yrepy)=p(yrepθ)p(θy)dθ.p(y^{\mathrm{rep}} \mid y) = \int p(y^{\mathrm{rep}} \mid \theta)\,p(\theta \mid y)\,d\theta.

This is the posterior predictive distribution.

It answers:

After learning from the observed data, what datasets does this model expect to see in a replication?

Real datasets are often high-dimensional. A full comparison between yy and yrepy^{\mathrm{rep}} can be hard to interpret.

Instead, choose a lower-dimensional summary that targets one possible weakness of the model.

A test statistic or discrepancy measure is a function such as

T(y)T(y)

or, if the statistic depends on parameters,

T(y,θ).T(y, \theta).

For replicated data, compute the same kind of quantity:

T(yrep)orT(yrep,θ).T(y^{\mathrm{rep}}) \quad \text{or} \quad T(y^{\mathrm{rep}}, \theta).

Good checks are targeted:

  • To check outliers, use a maximum or tail statistic.
  • To check dependence, use autocorrelations.
  • To check count models, use the number of zeros or a frequency table.
  • To check hierarchical models, use between-group variation.

The question should be specific. “Does the model fail?” is too vague. “Does the model underpredict zeros?” is useful.

To perform a posterior predictive check:

  1. Draw a parameter value:

    θ(s)p(θy).\theta^{(s)} \sim p(\theta \mid y).

  2. Simulate a replicated dataset:

    yrep,(s)p(yrepθ(s)).y^{\mathrm{rep},(s)} \sim p(y^{\mathrm{rep}} \mid \theta^{(s)}).

  3. Compute the replicated statistic:

    T(yrep,(s))orT(yrep,(s),θ(s)).T(y^{\mathrm{rep},(s)}) \quad \text{or} \quad T(y^{\mathrm{rep},(s)}, \theta^{(s)}).

  4. Repeat for many posterior draws.

  5. Compare the observed statistic with the simulated distribution.

This procedure evaluates the full probability model: likelihood, prior, and posterior uncertainty together.

A graph is often the best check. A posterior predictive p-value gives a numerical summary of the same comparison.

It asks:

How often would the replicated statistic be at least as extreme as the observed statistic?

Start with the observed statistic:

T(y).T(y).

Compare it with the replicated statistic:

T(yrep).T(y^{\mathrm{rep}}).

The posterior predictive p-value is

pB=Pr[T(yrep)T(y)y].p_B = \Pr[T(y^{\mathrm{rep}}) \ge T(y) \mid y].

In simulation, estimate it by the fraction of draws where

T(yrep,(s))T(y).T(y^{\mathrm{rep},(s)}) \ge T(y).

A value near 0 or 1 means the observed statistic is unusual compared with replicated data from the fitted model.

This is not the same as a classical p-value. It is a posterior predictive comparison, conditional on the model and averaging over posterior uncertainty. It should be interpreted as evidence about the chosen discrepancy, not as a universal pass/fail label.

Suppose

y1,,ynμ,σ2N(μ,σ2).y_1,\ldots,y_n \mid \mu,\sigma^2 \sim N(\mu,\sigma^2).

To check whether the Normal model can reproduce extreme observations, use

T(y)=maxiyi.T(y) = \max_i |y_i|.

The check proceeds as follows:

  1. Fit the Normal model.
  2. Simulate many Normal replicated datasets from the posterior predictive distribution.
  3. Compute the maximum absolute value in each replicate.
  4. Compare the observed maximum with the replicated maxima.

If the data are actually heavy-tailed, such as low-degree tt data or Cauchy data, the observed maximum will often be much larger than the replicated maxima. The conclusion is specific: the Normal model is not producing tails heavy enough for this dataset.

For a time-series model, the feature of interest may be dependence.

A useful statistic is the autocorrelation at selected lags:

T(y)=autocorrelation at lag k.T(y) = \text{autocorrelation at lag } k.

If the observed autocorrelations are typical of replicated series, the model captures that dependence pattern. If not, the model is missing structure in time.

For count data, a common failure is underestimating zero counts.

A targeted statistic is

T(y)=number of observations equal to zero.T(y) = \text{number of observations equal to zero}.

If the observed number of zeros is much larger than in posterior predictive replications, the model may need zero inflation, overdispersion, omitted predictors, or another likelihood.

A posterior predictive check is not the end of the analysis. It tells you where to go next.

  • If the model fails on a feature that matters, revise or expand the model.
  • If the model fails on a feature that does not matter, record the limitation and continue.
  • If several models pass the important checks, compare them using predictive scores or model probabilities.

This leads to the next chapter. Once we know how to inspect one model, we can ask how to compare several plausible models on a common scale.

Model evaluation starts with purpose. A model can be wrong and still useful, but its failures should not damage the conclusions we care about. Posterior predictive checking simulates replicated data from the fitted Bayesian model and compares targeted statistics between observed and replicated data. The most useful checks are specific: outliers, dependence, zero inflation, group variation, or another feature tied to the scientific or predictive goal.

  1. Why is “Is the model true?” usually the wrong question?
  2. What is the posterior predictive distribution?
  3. Why do we compare test statistics instead of full datasets?
  4. How is a posterior predictive p-value computed from simulations?
  5. What would a posterior predictive check reveal if a Normal model is fit to Cauchy-distributed data?