Prediction and Decision Theory

Source: Lecture 4 — Predictions and decisions (BDA Ch. 2.5—2.6).

The Question of This Chapter

Earlier chapters focused on learning unknown parameters. We observed data $y$ and used Bayes’ theorem to obtain a posterior distribution

p(\theta \mid y).

This chapter asks what to do next.

There are two common next questions:

Prediction: what future or unobserved data should we expect?
Decision: what action should we take under uncertainty?

The two questions are connected. Prediction turns posterior uncertainty into uncertainty about observable quantities. Decision theory then adds preferences, costs, or losses so that uncertainty can guide an action.

Posterior Prediction

What Is It For?

Suppose we have observed data $y$ and want to predict a future observation $\tilde y$ .

A tempting shortcut is to plug in one estimate of the parameter, for example the posterior mean, and use

p(\tilde y \mid \hat\theta).

That ignores uncertainty about $\theta$ . Bayesian prediction keeps that uncertainty.

The key idea is:

Predict future data by averaging over all plausible parameter values.

Step 1: Predict If the Parameter Were Known

If $\theta$ were known, the model would give

p(\tilde y \mid \theta).

This describes future sampling variation conditional on one fixed parameter value.

Step 2: Use the Posterior for the Unknown Parameter

After observing $y$ , uncertainty about $\theta$ is described by

p(\theta \mid y).

Values of $\theta$ with high posterior probability should contribute more to the prediction than values with low posterior probability.

Step 3: Average Over Parameter Uncertainty

The posterior predictive distribution is

p(\tilde y \mid y) = \int p(\tilde y \mid \theta)\,p(\theta \mid y)\,d\theta.

Equivalently,

p(\tilde y \mid y) = E_{\theta \mid y}[p(\tilde y \mid \theta)].

This formula has two layers:

$p(\tilde y \mid \theta)$ describes future noise.
$p(\theta \mid y)$ describes parameter uncertainty.

If the future data depend on the observed data even after conditioning on $\theta$ , as in many time-series models, use

p(\tilde y \mid \theta, y)

inside the integral instead.

How to Simulate Predictions

The integral is often easier to understand as an algorithm.

For $s = 1,\ldots,S$ :

Draw a parameter value:
$\theta^{(s)} \sim p(\theta \mid y).$
Draw a future observation conditional on that parameter:
$\tilde y^{(s)} \sim p(\tilde y \mid \theta^{(s)}).$
Use the simulated values $\tilde y^{(1)},\ldots,\tilde y^{(S)}$ to summarize the predictive distribution.

The output is not a posterior distribution for a parameter. It is a distribution for an observable future quantity.

Example: Normal Model with Known Variance

The Setup

Suppose

y_i \mid \theta \sim N(\theta,\sigma^2),

where $\sigma^2$ is known. Use a flat prior,

p(\theta) \propto 1.

Then the posterior for the mean is

\theta \mid y \sim N\left(\bar y,\frac{\sigma^2}{n}\right).

A future observation satisfies

\tilde y \mid \theta \sim N(\theta,\sigma^2).

The Simulation View

One posterior predictive draw can be generated in two steps:

Draw
$\theta^{(s)} \sim N\left(\bar y,\frac{\sigma^2}{n}\right).$
Draw
$\tilde y^{(s)} \sim N(\theta^{(s)},\sigma^2).$

This creates predictive draws that include both uncertainty about the mean and ordinary observation-to-observation variation.

How to Read the Formula

Write the simulated future observation as

\tilde y^{(s)} = \bar y+\varepsilon^{(s)}+\upsilon^{(s)},

where

\varepsilon^{(s)} \sim N\left(0,\frac{\sigma^2}{n}\right), \qquad \upsilon^{(s)} \sim N(0,\sigma^2),

independently.

The first error term is posterior uncertainty about $\theta$ . The second is future sampling variation. Since independent normal variables add to another normal variable,

\tilde y \mid y \sim N\left(\bar y,\sigma^2\left(1+\frac{1}{n}\right)\right).

The predictive variance is larger than the variance of a single observation known to have mean $\theta$ . It must also include uncertainty about the unknown mean.

Informative Prior Version

If the posterior under an informative prior is

\theta \mid y \sim N(\mu_n,\tau_n^2),

and the sampling model for a future observation is still

\tilde y \mid \theta \sim N(\theta,\sigma^2),

then

\tilde y \mid y \sim N(\mu_n,\tau_n^2+\sigma^2).

The mean comes from the posterior mean of $\theta$ :

E(\tilde y \mid y) = E(\theta \mid y)=\mu_n.

The variance comes from the law of total variance:

V(\tilde y \mid y) = E[V(\tilde y \mid \theta,y)\mid y] + V[E(\tilde y \mid \theta,y)\mid y].

In this model,

V(\tilde y \mid y) = \sigma^2+\tau_n^2.

Prediction for Time Series

What Changes?

For independent data, a future observation depends on $\theta$ but not directly on the observed sequence. Time series are different. The next value depends on the most recent values.

For example, an autoregressive model can be written as

y_t = \mu +\phi_1(y_{t-1}-\mu) +\cdots +\phi_p(y_{t-p}-\mu) +\varepsilon_t, \qquad \varepsilon_t \overset{\mathrm{iid}}{\sim} N(0,\sigma^2).

Here the parameter vector is

\theta=(\mu,\phi_1,\ldots,\phi_p,\sigma^2).

The Forecasting Algorithm

To simulate one future path:

Draw
$\theta^{(s)} \sim p(\theta \mid y).$
Simulate the next value using the last observed values:
$\tilde y_{T+1}^{(s)} \sim p(y_{T+1}\mid y_T,\ldots,y_{T-p+1},\theta^{(s)}).$
Simulate the following value using the simulated value where needed:
$\tilde y_{T+2}^{(s)} \sim p(y_{T+2}\mid \tilde y_{T+1}^{(s)},y_T,\ldots,\theta^{(s)}).$
Continue recursively.

Each simulated path combines two sources of uncertainty:

uncertainty about the time-series parameters;
uncertainty about future shocks.

Complex Predictive Models

The same idea works even when the data-generating process has many steps.

In an auction-price model, a predictive simulation might:

simulate the number of bidders;
simulate each bidder’s valuation;
simulate bids conditional on valuations;
return the price implied by the auction mechanism.

The posterior predictive distribution is then built from many simulated auction outcomes.

The lesson is not the details of auctions. The lesson is that Bayesian prediction follows the generative story of the model all the way to the observable quantity we want to predict.

Decision Theory

What Problem Does It Solve?

Prediction describes uncertainty. A decision problem asks what to do with it.

For example:

Should we carry an umbrella?
Should a patient receive treatment?
Which forecast should be reported as a point estimate?

To answer such questions, we need more than probabilities. We also need to describe the consequences of actions.

The Ingredients

A Bayesian decision problem has three ingredients.

First, there is an unknown state of nature:

\theta.

Second, there is a set of possible actions:

a \in \mathcal A.

Third, there is a utility or loss function. Utility is something to maximize:

U(a,\theta).

Loss is something to minimize:

L(a,\theta)=-U(a,\theta).

A Small Example

Suppose the state is either rainy or sunny, and the action is either carrying an umbrella or not.

Action	Rainy	Sunny
Umbrella	Loss 20	Loss 10
No umbrella	Loss 50	Loss 0

If rain is very likely, the umbrella may have lower expected loss. If sun is very likely, leaving it at home may be better.

The action depends on both beliefs and losses.

The Bayesian Decision Rule

Step 1: Condition on the Data

After observing $y$ , use the posterior distribution

p(\theta \mid y).

This distribution describes which states are plausible.

Step 2: Compute Expected Utility

For each action $a$ , compute its posterior expected utility:

E[U(a,\theta)\mid y] = \int U(a,\theta)\,p(\theta \mid y)\,d\theta.

Equivalently, compute posterior expected loss:

E[L(a,\theta)\mid y] = \int L(a,\theta)\,p(\theta \mid y)\,d\theta.

Step 3: Choose the Best Action

The Bayes action maximizes posterior expected utility:

a_{\mathrm{Bayes}} = \arg\max_{a\in\mathcal A} E[U(a,\theta)\mid y].

Equivalently, it minimizes posterior expected loss:

a_{\mathrm{Bayes}} = \arg\min_{a\in\mathcal A} E[L(a,\theta)\mid y].

Simulation Version

If we have posterior draws

\theta^{(1)},\ldots,\theta^{(S)} \sim p(\theta \mid y),

then expected utility can be approximated by

E[U(a,\theta)\mid y] \approx \frac{1}{S}\sum_{s=1}^{S}U(a,\theta^{(s)}).

Compute this quantity for each action and choose the largest one.

The Separation Principle

Bayesian analysis separates two tasks.

First, do inference:

p(\theta \mid y).

Second, make the decision using a utility or loss function:

U(a,\theta) \quad \text{or} \quad L(a,\theta).

This separation is useful because the same posterior can support different decisions. A medical posterior, for example, could be used for treatment choice, patient communication, or future study design. The probabilities are the same, but the utilities may differ.

Point Estimation as a Decision Problem

The Main Idea

A point estimate is an action.

If the unknown quantity is $\theta$ , reporting a number $a$ is a decision. Different loss functions lead to different optimal point estimates.

Quadratic Loss

Under quadratic loss,

L(a,\theta)=(a-\theta)^2,

the posterior expected loss is minimized by the posterior mean:

a_{\mathrm{Bayes}}=E(\theta \mid y).

Quadratic loss punishes large errors strongly, so the mean is pulled toward the center of mass of the posterior.

Absolute Loss

Under absolute loss,

L(a,\theta)=|a-\theta|,

the posterior expected loss is minimized by a posterior median.

Absolute loss treats distance linearly, so it is less sensitive to tails than quadratic loss.

Zero-One Loss

For a discrete parameter space, zero-one loss is

L(a,\theta) = \begin{cases} 0, & a=\theta,\\ 1, & a\ne\theta. \end{cases}

The posterior expected loss is minimized by the posterior mode, also called the MAP estimate.

For continuous parameters, exact zero-one loss needs care because exact equality has probability zero. The MAP idea is still often used as a density-based summary.

Asymmetric Linear Loss

Sometimes overestimation and underestimation have different costs. A lin-lin loss can be written as

L(a,\theta) = \begin{cases} c_1|\theta-a|, & a\le \theta,\\ c_2|a-\theta|, & a>\theta. \end{cases}

The Bayes action is a posterior quantile. With this convention, it is the quantile level

\frac{c_1}{c_1+c_2}.

If underestimating is more costly, $c_1$ is large and the chosen estimate moves upward.

Summary Table

Loss function	Bayes estimate
Quadratic loss $(a-\theta)^2$	Posterior mean
Absolute loss $	a-\theta
Zero-one loss, discrete case	Posterior mode
Asymmetric linear loss	Posterior quantile

The phrase “best estimate” is incomplete until the loss function is named.

Chapter Summary

Posterior prediction averages future-data uncertainty over the posterior distribution of unknown parameters. This produces predictions that include both sampling variation and parameter uncertainty.

Decision theory adds actions and consequences. The Bayes action maximizes posterior expected utility or minimizes posterior expected loss. Point estimation is one special decision problem, so the posterior mean, median, mode, and quantiles are optimal under different losses.

Check Your Understanding

In the normal known-variance model, why is the predictive variance larger than $\sigma^2$ ?
What are the two steps in simulating from a posterior predictive distribution?
Why do probabilities alone not determine the best action?
Which loss function leads to the posterior mean? Which leads to the posterior median?
Why is a point estimate better understood as a decision than as a purely mathematical summary?