Prediction and Decision Theory
Source: Lecture 4 — Predictions and decisions (BDA Ch. 2.5—2.6).
The Question of This Chapter
Section titled “The Question of This Chapter”Earlier chapters focused on learning unknown parameters. We observed data and used Bayes’ theorem to obtain a posterior distribution
This chapter asks what to do next.
There are two common next questions:
- Prediction: what future or unobserved data should we expect?
- Decision: what action should we take under uncertainty?
The two questions are connected. Prediction turns posterior uncertainty into uncertainty about observable quantities. Decision theory then adds preferences, costs, or losses so that uncertainty can guide an action.
Posterior Prediction
Section titled “Posterior Prediction”What Is It For?
Section titled “What Is It For?”Suppose we have observed data and want to predict a future observation .
A tempting shortcut is to plug in one estimate of the parameter, for example the posterior mean, and use
That ignores uncertainty about . Bayesian prediction keeps that uncertainty.
The key idea is:
Predict future data by averaging over all plausible parameter values.
Step 1: Predict If the Parameter Were Known
Section titled “Step 1: Predict If the Parameter Were Known”If were known, the model would give
This describes future sampling variation conditional on one fixed parameter value.
Step 2: Use the Posterior for the Unknown Parameter
Section titled “Step 2: Use the Posterior for the Unknown Parameter”After observing , uncertainty about is described by
Values of with high posterior probability should contribute more to the prediction than values with low posterior probability.
Step 3: Average Over Parameter Uncertainty
Section titled “Step 3: Average Over Parameter Uncertainty”The posterior predictive distribution is
Equivalently,
This formula has two layers:
- describes future noise.
- describes parameter uncertainty.
If the future data depend on the observed data even after conditioning on , as in many time-series models, use
inside the integral instead.
How to Simulate Predictions
Section titled “How to Simulate Predictions”The integral is often easier to understand as an algorithm.
For :
-
Draw a parameter value:
-
Draw a future observation conditional on that parameter:
-
Use the simulated values to summarize the predictive distribution.
The output is not a posterior distribution for a parameter. It is a distribution for an observable future quantity.
Example: Normal Model with Known Variance
Section titled “Example: Normal Model with Known Variance”The Setup
Section titled “The Setup”Suppose
where is known. Use a flat prior,
Then the posterior for the mean is
A future observation satisfies
The Simulation View
Section titled “The Simulation View”One posterior predictive draw can be generated in two steps:
-
Draw
-
Draw
This creates predictive draws that include both uncertainty about the mean and ordinary observation-to-observation variation.
How to Read the Formula
Section titled “How to Read the Formula”Write the simulated future observation as
where
independently.
The first error term is posterior uncertainty about . The second is future sampling variation. Since independent normal variables add to another normal variable,
The predictive variance is larger than the variance of a single observation known to have mean . It must also include uncertainty about the unknown mean.
Informative Prior Version
Section titled “Informative Prior Version”If the posterior under an informative prior is
and the sampling model for a future observation is still
then
The mean comes from the posterior mean of :
The variance comes from the law of total variance:
In this model,
Prediction for Time Series
Section titled “Prediction for Time Series”What Changes?
Section titled “What Changes?”For independent data, a future observation depends on but not directly on the observed sequence. Time series are different. The next value depends on the most recent values.
For example, an autoregressive model can be written as
Here the parameter vector is
The Forecasting Algorithm
Section titled “The Forecasting Algorithm”To simulate one future path:
-
Draw
-
Simulate the next value using the last observed values:
-
Simulate the following value using the simulated value where needed:
-
Continue recursively.
Each simulated path combines two sources of uncertainty:
- uncertainty about the time-series parameters;
- uncertainty about future shocks.
Complex Predictive Models
Section titled “Complex Predictive Models”The same idea works even when the data-generating process has many steps.
In an auction-price model, a predictive simulation might:
- simulate the number of bidders;
- simulate each bidder’s valuation;
- simulate bids conditional on valuations;
- return the price implied by the auction mechanism.
The posterior predictive distribution is then built from many simulated auction outcomes.
The lesson is not the details of auctions. The lesson is that Bayesian prediction follows the generative story of the model all the way to the observable quantity we want to predict.
Decision Theory
Section titled “Decision Theory”What Problem Does It Solve?
Section titled “What Problem Does It Solve?”Prediction describes uncertainty. A decision problem asks what to do with it.
For example:
- Should we carry an umbrella?
- Should a patient receive treatment?
- Which forecast should be reported as a point estimate?
To answer such questions, we need more than probabilities. We also need to describe the consequences of actions.
The Ingredients
Section titled “The Ingredients”A Bayesian decision problem has three ingredients.
First, there is an unknown state of nature:
Second, there is a set of possible actions:
Third, there is a utility or loss function. Utility is something to maximize:
Loss is something to minimize:
A Small Example
Section titled “A Small Example”Suppose the state is either rainy or sunny, and the action is either carrying an umbrella or not.
| Action | Rainy | Sunny |
|---|---|---|
| Umbrella | Loss 20 | Loss 10 |
| No umbrella | Loss 50 | Loss 0 |
If rain is very likely, the umbrella may have lower expected loss. If sun is very likely, leaving it at home may be better.
The action depends on both beliefs and losses.
The Bayesian Decision Rule
Section titled “The Bayesian Decision Rule”Step 1: Condition on the Data
Section titled “Step 1: Condition on the Data”After observing , use the posterior distribution
This distribution describes which states are plausible.
Step 2: Compute Expected Utility
Section titled “Step 2: Compute Expected Utility”For each action , compute its posterior expected utility:
Equivalently, compute posterior expected loss:
Step 3: Choose the Best Action
Section titled “Step 3: Choose the Best Action”The Bayes action maximizes posterior expected utility:
Equivalently, it minimizes posterior expected loss:
Simulation Version
Section titled “Simulation Version”If we have posterior draws
then expected utility can be approximated by
Compute this quantity for each action and choose the largest one.
The Separation Principle
Section titled “The Separation Principle”Bayesian analysis separates two tasks.
First, do inference:
Second, make the decision using a utility or loss function:
This separation is useful because the same posterior can support different decisions. A medical posterior, for example, could be used for treatment choice, patient communication, or future study design. The probabilities are the same, but the utilities may differ.
Point Estimation as a Decision Problem
Section titled “Point Estimation as a Decision Problem”The Main Idea
Section titled “The Main Idea”A point estimate is an action.
If the unknown quantity is , reporting a number is a decision. Different loss functions lead to different optimal point estimates.
Quadratic Loss
Section titled “Quadratic Loss”Under quadratic loss,
the posterior expected loss is minimized by the posterior mean:
Quadratic loss punishes large errors strongly, so the mean is pulled toward the center of mass of the posterior.
Absolute Loss
Section titled “Absolute Loss”Under absolute loss,
the posterior expected loss is minimized by a posterior median.
Absolute loss treats distance linearly, so it is less sensitive to tails than quadratic loss.
Zero-One Loss
Section titled “Zero-One Loss”For a discrete parameter space, zero-one loss is
The posterior expected loss is minimized by the posterior mode, also called the MAP estimate.
For continuous parameters, exact zero-one loss needs care because exact equality has probability zero. The MAP idea is still often used as a density-based summary.
Asymmetric Linear Loss
Section titled “Asymmetric Linear Loss”Sometimes overestimation and underestimation have different costs. A lin-lin loss can be written as
The Bayes action is a posterior quantile. With this convention, it is the quantile level
If underestimating is more costly, is large and the chosen estimate moves upward.
Summary Table
Section titled “Summary Table”| Loss function | Bayes estimate |
|---|---|
| Quadratic loss | Posterior mean |
| Absolute loss $ | a-\theta |
| Zero-one loss, discrete case | Posterior mode |
| Asymmetric linear loss | Posterior quantile |
The phrase “best estimate” is incomplete until the loss function is named.
Chapter Summary
Section titled “Chapter Summary”Posterior prediction averages future-data uncertainty over the posterior distribution of unknown parameters. This produces predictions that include both sampling variation and parameter uncertainty.
Decision theory adds actions and consequences. The Bayes action maximizes posterior expected utility or minimizes posterior expected loss. Point estimation is one special decision problem, so the posterior mean, median, mode, and quantiles are optimal under different losses.
Check Your Understanding
Section titled “Check Your Understanding”- In the normal known-variance model, why is the predictive variance larger than ?
- What are the two steps in simulating from a posterior predictive distribution?
- Why do probabilities alone not determine the best action?
- Which loss function leads to the posterior mean? Which leads to the posterior median?
- Why is a point estimate better understood as a decision than as a purely mathematical summary?