Skip to content

Prediction and Decision Theory

Source: Lecture 4 — Predictions and decisions (BDA Ch. 2.5—2.6).

Earlier chapters focused on learning unknown parameters. We observed data yy and used Bayes’ theorem to obtain a posterior distribution

p(θy).p(\theta \mid y).

This chapter asks what to do next.

There are two common next questions:

  1. Prediction: what future or unobserved data should we expect?
  2. Decision: what action should we take under uncertainty?

The two questions are connected. Prediction turns posterior uncertainty into uncertainty about observable quantities. Decision theory then adds preferences, costs, or losses so that uncertainty can guide an action.

Suppose we have observed data yy and want to predict a future observation y~\tilde y.

A tempting shortcut is to plug in one estimate of the parameter, for example the posterior mean, and use

p(y~θ^).p(\tilde y \mid \hat\theta).

That ignores uncertainty about θ\theta. Bayesian prediction keeps that uncertainty.

The key idea is:

Predict future data by averaging over all plausible parameter values.

Step 1: Predict If the Parameter Were Known

Section titled “Step 1: Predict If the Parameter Were Known”

If θ\theta were known, the model would give

p(y~θ).p(\tilde y \mid \theta).

This describes future sampling variation conditional on one fixed parameter value.

Step 2: Use the Posterior for the Unknown Parameter

Section titled “Step 2: Use the Posterior for the Unknown Parameter”

After observing yy, uncertainty about θ\theta is described by

p(θy).p(\theta \mid y).

Values of θ\theta with high posterior probability should contribute more to the prediction than values with low posterior probability.

Step 3: Average Over Parameter Uncertainty

Section titled “Step 3: Average Over Parameter Uncertainty”

The posterior predictive distribution is

p(y~y)=p(y~θ)p(θy)dθ.p(\tilde y \mid y) = \int p(\tilde y \mid \theta)\,p(\theta \mid y)\,d\theta.

Equivalently,

p(y~y)=Eθy[p(y~θ)].p(\tilde y \mid y) = E_{\theta \mid y}[p(\tilde y \mid \theta)].

This formula has two layers:

  • p(y~θ)p(\tilde y \mid \theta) describes future noise.
  • p(θy)p(\theta \mid y) describes parameter uncertainty.

If the future data depend on the observed data even after conditioning on θ\theta, as in many time-series models, use

p(y~θ,y)p(\tilde y \mid \theta, y)

inside the integral instead.

The integral is often easier to understand as an algorithm.

For s=1,,Ss = 1,\ldots,S:

  1. Draw a parameter value:

    θ(s)p(θy).\theta^{(s)} \sim p(\theta \mid y).
  2. Draw a future observation conditional on that parameter:

    y~(s)p(y~θ(s)).\tilde y^{(s)} \sim p(\tilde y \mid \theta^{(s)}).
  3. Use the simulated values y~(1),,y~(S)\tilde y^{(1)},\ldots,\tilde y^{(S)} to summarize the predictive distribution.

The output is not a posterior distribution for a parameter. It is a distribution for an observable future quantity.

Suppose

yiθN(θ,σ2),y_i \mid \theta \sim N(\theta,\sigma^2),

where σ2\sigma^2 is known. Use a flat prior,

p(θ)1.p(\theta) \propto 1.

Then the posterior for the mean is

θyN(yˉ,σ2n).\theta \mid y \sim N\left(\bar y,\frac{\sigma^2}{n}\right).

A future observation satisfies

y~θN(θ,σ2).\tilde y \mid \theta \sim N(\theta,\sigma^2).

One posterior predictive draw can be generated in two steps:

  1. Draw

    θ(s)N(yˉ,σ2n).\theta^{(s)} \sim N\left(\bar y,\frac{\sigma^2}{n}\right).
  2. Draw

    y~(s)N(θ(s),σ2).\tilde y^{(s)} \sim N(\theta^{(s)},\sigma^2).

This creates predictive draws that include both uncertainty about the mean and ordinary observation-to-observation variation.

Write the simulated future observation as

y~(s)=yˉ+ε(s)+υ(s),\tilde y^{(s)} = \bar y+\varepsilon^{(s)}+\upsilon^{(s)},

where

ε(s)N(0,σ2n),υ(s)N(0,σ2),\varepsilon^{(s)} \sim N\left(0,\frac{\sigma^2}{n}\right), \qquad \upsilon^{(s)} \sim N(0,\sigma^2),

independently.

The first error term is posterior uncertainty about θ\theta. The second is future sampling variation. Since independent normal variables add to another normal variable,

y~yN(yˉ,σ2(1+1n)).\tilde y \mid y \sim N\left(\bar y,\sigma^2\left(1+\frac{1}{n}\right)\right).

The predictive variance is larger than the variance of a single observation known to have mean θ\theta. It must also include uncertainty about the unknown mean.

If the posterior under an informative prior is

θyN(μn,τn2),\theta \mid y \sim N(\mu_n,\tau_n^2),

and the sampling model for a future observation is still

y~θN(θ,σ2),\tilde y \mid \theta \sim N(\theta,\sigma^2),

then

y~yN(μn,τn2+σ2).\tilde y \mid y \sim N(\mu_n,\tau_n^2+\sigma^2).

The mean comes from the posterior mean of θ\theta:

E(y~y)=E(θy)=μn.E(\tilde y \mid y) = E(\theta \mid y)=\mu_n.

The variance comes from the law of total variance:

V(y~y)=E[V(y~θ,y)y]+V[E(y~θ,y)y].V(\tilde y \mid y) = E[V(\tilde y \mid \theta,y)\mid y] + V[E(\tilde y \mid \theta,y)\mid y].

In this model,

V(y~y)=σ2+τn2.V(\tilde y \mid y) = \sigma^2+\tau_n^2.

For independent data, a future observation depends on θ\theta but not directly on the observed sequence. Time series are different. The next value depends on the most recent values.

For example, an autoregressive model can be written as

yt=μ+ϕ1(yt1μ)++ϕp(ytpμ)+εt,εtiidN(0,σ2).y_t = \mu +\phi_1(y_{t-1}-\mu) +\cdots +\phi_p(y_{t-p}-\mu) +\varepsilon_t, \qquad \varepsilon_t \overset{\mathrm{iid}}{\sim} N(0,\sigma^2).

Here the parameter vector is

θ=(μ,ϕ1,,ϕp,σ2).\theta=(\mu,\phi_1,\ldots,\phi_p,\sigma^2).

To simulate one future path:

  1. Draw

    θ(s)p(θy).\theta^{(s)} \sim p(\theta \mid y).
  2. Simulate the next value using the last observed values:

    y~T+1(s)p(yT+1yT,,yTp+1,θ(s)).\tilde y_{T+1}^{(s)} \sim p(y_{T+1}\mid y_T,\ldots,y_{T-p+1},\theta^{(s)}).
  3. Simulate the following value using the simulated value where needed:

    y~T+2(s)p(yT+2y~T+1(s),yT,,θ(s)).\tilde y_{T+2}^{(s)} \sim p(y_{T+2}\mid \tilde y_{T+1}^{(s)},y_T,\ldots,\theta^{(s)}).
  4. Continue recursively.

Each simulated path combines two sources of uncertainty:

  • uncertainty about the time-series parameters;
  • uncertainty about future shocks.

The same idea works even when the data-generating process has many steps.

In an auction-price model, a predictive simulation might:

  1. simulate the number of bidders;
  2. simulate each bidder’s valuation;
  3. simulate bids conditional on valuations;
  4. return the price implied by the auction mechanism.

The posterior predictive distribution is then built from many simulated auction outcomes.

The lesson is not the details of auctions. The lesson is that Bayesian prediction follows the generative story of the model all the way to the observable quantity we want to predict.

Prediction describes uncertainty. A decision problem asks what to do with it.

For example:

  • Should we carry an umbrella?
  • Should a patient receive treatment?
  • Which forecast should be reported as a point estimate?

To answer such questions, we need more than probabilities. We also need to describe the consequences of actions.

A Bayesian decision problem has three ingredients.

First, there is an unknown state of nature:

θ.\theta.

Second, there is a set of possible actions:

aA.a \in \mathcal A.

Third, there is a utility or loss function. Utility is something to maximize:

U(a,θ).U(a,\theta).

Loss is something to minimize:

L(a,θ)=U(a,θ).L(a,\theta)=-U(a,\theta).

Suppose the state is either rainy or sunny, and the action is either carrying an umbrella or not.

ActionRainySunny
UmbrellaLoss 20Loss 10
No umbrellaLoss 50Loss 0

If rain is very likely, the umbrella may have lower expected loss. If sun is very likely, leaving it at home may be better.

The action depends on both beliefs and losses.

After observing yy, use the posterior distribution

p(θy).p(\theta \mid y).

This distribution describes which states are plausible.

For each action aa, compute its posterior expected utility:

E[U(a,θ)y]=U(a,θ)p(θy)dθ.E[U(a,\theta)\mid y] = \int U(a,\theta)\,p(\theta \mid y)\,d\theta.

Equivalently, compute posterior expected loss:

E[L(a,θ)y]=L(a,θ)p(θy)dθ.E[L(a,\theta)\mid y] = \int L(a,\theta)\,p(\theta \mid y)\,d\theta.

The Bayes action maximizes posterior expected utility:

aBayes=argmaxaAE[U(a,θ)y].a_{\mathrm{Bayes}} = \arg\max_{a\in\mathcal A} E[U(a,\theta)\mid y].

Equivalently, it minimizes posterior expected loss:

aBayes=argminaAE[L(a,θ)y].a_{\mathrm{Bayes}} = \arg\min_{a\in\mathcal A} E[L(a,\theta)\mid y].

If we have posterior draws

θ(1),,θ(S)p(θy),\theta^{(1)},\ldots,\theta^{(S)} \sim p(\theta \mid y),

then expected utility can be approximated by

E[U(a,θ)y]1Ss=1SU(a,θ(s)).E[U(a,\theta)\mid y] \approx \frac{1}{S}\sum_{s=1}^{S}U(a,\theta^{(s)}).

Compute this quantity for each action and choose the largest one.

Bayesian analysis separates two tasks.

First, do inference:

p(θy).p(\theta \mid y).

Second, make the decision using a utility or loss function:

U(a,θ)orL(a,θ).U(a,\theta) \quad \text{or} \quad L(a,\theta).

This separation is useful because the same posterior can support different decisions. A medical posterior, for example, could be used for treatment choice, patient communication, or future study design. The probabilities are the same, but the utilities may differ.

A point estimate is an action.

If the unknown quantity is θ\theta, reporting a number aa is a decision. Different loss functions lead to different optimal point estimates.

Under quadratic loss,

L(a,θ)=(aθ)2,L(a,\theta)=(a-\theta)^2,

the posterior expected loss is minimized by the posterior mean:

aBayes=E(θy).a_{\mathrm{Bayes}}=E(\theta \mid y).

Quadratic loss punishes large errors strongly, so the mean is pulled toward the center of mass of the posterior.

Under absolute loss,

L(a,θ)=aθ,L(a,\theta)=|a-\theta|,

the posterior expected loss is minimized by a posterior median.

Absolute loss treats distance linearly, so it is less sensitive to tails than quadratic loss.

For a discrete parameter space, zero-one loss is

L(a,θ)={0,a=θ,1,aθ.L(a,\theta) = \begin{cases} 0, & a=\theta,\\ 1, & a\ne\theta. \end{cases}

The posterior expected loss is minimized by the posterior mode, also called the MAP estimate.

For continuous parameters, exact zero-one loss needs care because exact equality has probability zero. The MAP idea is still often used as a density-based summary.

Sometimes overestimation and underestimation have different costs. A lin-lin loss can be written as

L(a,θ)={c1θa,aθ,c2aθ,a>θ.L(a,\theta) = \begin{cases} c_1|\theta-a|, & a\le \theta,\\ c_2|a-\theta|, & a>\theta. \end{cases}

The Bayes action is a posterior quantile. With this convention, it is the quantile level

c1c1+c2.\frac{c_1}{c_1+c_2}.

If underestimating is more costly, c1c_1 is large and the chosen estimate moves upward.

Loss functionBayes estimate
Quadratic loss (aθ)2(a-\theta)^2Posterior mean
Absolute loss $a-\theta
Zero-one loss, discrete casePosterior mode
Asymmetric linear lossPosterior quantile

The phrase “best estimate” is incomplete until the loss function is named.

Posterior prediction averages future-data uncertainty over the posterior distribution of unknown parameters. This produces predictions that include both sampling variation and parameter uncertainty.

Decision theory adds actions and consequences. The Bayes action maximizes posterior expected utility or minimizes posterior expected loss. Point estimation is one special decision problem, so the posterior mean, median, mode, and quantiles are optimal under different losses.

  1. In the normal known-variance model, why is the predictive variance larger than σ2\sigma^2?
  2. What are the two steps in simulating from a posterior predictive distribution?
  3. Why do probabilities alone not determine the best action?
  4. Which loss function leads to the posterior mean? Which leads to the posterior median?
  5. Why is a point estimate better understood as a decision than as a purely mathematical summary?