Variable Selection and Model Averaging

Source: Lecture 11 — Bayesian Model Comparison. Variable selection (BDA Ch. 7).

From Two Models to a Model Space

The previous chapter compared a small number of models. Variable selection makes the same problem much larger.

In a regression with $p$ possible predictors, each predictor can be either included or excluded. That gives

$2^p$

possible subsets.

So the question changes from:

Which of two models is better?

to:

How do we compare, search, and average over many plausible models?

This chapter moves in four steps:

Use information criteria for quick predictive comparisons.
Compute or approximate marginal likelihoods when we want Bayes factors.
Use indicator variables to represent selected predictors.
Use model averaging so uncertainty about the selected model is not forgotten.

Information Criteria: Fast Predictive Comparisons

What Are They For?

Information criteria estimate out-of-sample predictive accuracy.

They are useful when we want a quick comparison but do not want to assign posterior probability to every model in a discrete list.

All information criteria follow the same rough pattern:

Measure how well the model fits.
Subtract a penalty for fitting too flexibly.
Use the result as an estimate of predictive performance.

Different criteria define “fit” and “penalty” differently.

AIC

Purpose

AIC is a classical large-sample approximation to predictive accuracy. It is useful when the model is regular, priors are flat or negligible, and $n$ is much larger than the number of parameters.

Ingredients

First compute the maximum likelihood estimate:

$\hat\theta_{\mathrm{ML}}.$

Then compute the fitted log likelihood:

$\ln p(y \mid \hat\theta_{\mathrm{ML}}).$

Let

$p$

be the number of fitted parameters.

Formula

AIC is

\mathrm{AIC} = -2\ln p(y \mid \hat\theta_{\mathrm{ML}}) +2p.

The first term rewards fit. The second term penalizes model size. Lower AIC means better estimated predictive performance on the deviance scale.

DIC

Purpose

DIC is a Bayesian analogue of AIC. It uses posterior draws rather than only a maximum likelihood estimate.

It is most useful when the posterior is approximately normal and the model is not too irregular.

Ingredient 1: A Bayesian Point Estimate

Let

$\hat\theta_{\mathrm{Bayes}}$

be a posterior summary, often the posterior mean.

Compute the log likelihood at that estimate:

$\ln p(y \mid \hat\theta_{\mathrm{Bayes}}).$

Ingredient 2: Average Posterior Fit

Suppose posterior simulation gives draws

$\theta^{(1)},\ldots,\theta^{(S)}.$

The average posterior log likelihood is

E_{\mathrm{post}}[\ln p(y \mid \theta)] \approx \frac{1}{S}\sum_{s=1}^{S}\ln p(y \mid \theta^{(s)}).

Ingredient 3: Effective Number of Parameters

DIC estimates model flexibility by comparing fit at the posterior summary with average posterior fit:

p_{\mathrm{DIC}} = 2\left[ \ln p(y \mid \hat\theta_{\mathrm{Bayes}}) - E_{\mathrm{post}}[\ln p(y \mid \theta)] \right].

Formula

DIC is

\mathrm{DIC} = -2\ln p(y \mid \hat\theta_{\mathrm{Bayes}}) +2p_{\mathrm{DIC}}.

Again, lower is better on this scale.

WAIC

Purpose

WAIC is more fully Bayesian than AIC and DIC because it uses the whole posterior distribution point by point.

It is useful when we can split the data into meaningful pieces:

$y_1,\ldots,y_n.$

This pointwise split is natural for iid data, but less obvious for time series, spatial data, network data, or hierarchical models.

Ingredient 1: Pointwise Predictive Density

For each observation $y_i$ , average its likelihood over posterior draws:

$\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)}).$

Then take the logarithm:

$\log\left(\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)})\right).$

Ingredient 2: Add Over Observations

The log pointwise predictive density is

\mathrm{lppd} = \sum_{i=1}^{n} \log\left(\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)})\right).

Ingredient 3: Estimate Flexibility

For each observation, look at how much the log likelihood varies across posterior draws:

$V_{s=1}^{S}\left(\log p(y_i \mid \theta^{(s)})\right).$

Add these variances:

p_{\mathrm{WAIC}} = \sum_{i=1}^{n} V_{s=1}^{S}\left(\log p(y_i \mid \theta^{(s)})\right).

Formula

WAIC is

\mathrm{WAIC} = -2\,\mathrm{lppd} +2p_{\mathrm{WAIC}}.

The first term rewards posterior predictive fit. The second term corrects for effective flexibility.

Bayesian LOO-CV

Purpose

Leave-one-out cross-validation asks a direct predictive question:

How well does the model predict each observation when that observation was not used to fit the model?

Ingredient 1: Leave One Observation Out

For observation $i$ , fit or approximate the posterior using all data except $y_i$ :

$p(\theta \mid y_{-i}).$

Ingredient 2: Predict the Held-Out Observation

The leave-one-out predictive density is

p(y_i \mid y_{-i}) = \int p(y_i \mid \theta)\,p(\theta \mid y_{-i})\,d\theta.

Ingredient 3: Add Log Predictive Densities

The leave-one-out log score is

\mathrm{lppd}_{\mathrm{loo}} = \sum_{i=1}^{n}\log p(y_i \mid y_{-i}).

Formula

On the deviance scale,

\mathrm{LOO\text{-}CV} = -2\,\mathrm{lppd}_{\mathrm{loo}}.

It is often written in a form parallel to WAIC:

\mathrm{LOO\text{-}CV} = -2\,\mathrm{lppd} +2p_{\mathrm{loo}},

where

$p_{\mathrm{loo}} = \mathrm{lppd} - \mathrm{lppd}_{\mathrm{loo}}.$

WAIC and LOO-CV are closely related asymptotically. Both require a meaningful data partition.

Computing the Marginal Likelihood

Why Return to Marginal Likelihood?

Information criteria estimate predictive performance. Bayes factors require marginal likelihoods.

The marginal likelihood is

$p(y) = \int p(y \mid \theta)\,p(\theta)\,d\theta.$

The rest of this section is about ways to compute or approximate this integral.

Conjugate Shortcut

Purpose

In conjugate models, the posterior has the same family as the prior. This sometimes gives a shortcut for the marginal likelihood.

The Identity

Bayes’ theorem says

p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}.

Rearrange it:

p(y) = \frac{p(y \mid \theta)p(\theta)}{p(\theta \mid y)}.

This expression is constant in $\theta$ , so we may evaluate it at a convenient value.

Bernoulli Example

For Bernoulli data with $s$ successes and $f$ failures,

$\theta \sim \mathrm{Beta}(\alpha,\beta).$

The posterior is

$\theta \mid y \sim \mathrm{Beta}(\alpha+s,\beta+f).$

The marginal likelihood is

$p(y) = \frac{B(\alpha+s,\beta+f)}{B(\alpha,\beta)}.$

Monte Carlo From the Prior

Purpose

The marginal likelihood is an expectation under the prior:

$p(y) = E_{\theta \sim p(\theta)}[p(y \mid \theta)].$

That suggests a simple simulation estimate.

Algorithm

Draw prior samples:

$\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta).$

Compute the likelihood for each draw:

$p(y \mid \theta^{(i)}).$

Average:

\hat p(y) = \frac{1}{N}\sum_{i=1}^{N}p(y \mid \theta^{(i)}).

Warning

This can be unstable when the posterior is much narrower than the prior. Most prior draws then land in regions with almost zero likelihood.

Importance Sampling

Purpose

Importance sampling improves on prior simulation by drawing from a proposal distribution that puts more mass near important parameter values.

Let

$g(\theta)$

be the proposal density.

Step 1: Rewrite the Integral

Start with

$p(y) = \int p(y \mid \theta)p(\theta)\,d\theta.$

Multiply and divide by $g(\theta)$ :

p(y) = \int \frac{p(y \mid \theta)p(\theta)}{g(\theta)} g(\theta)\,d\theta.

Step 2: Estimate by Simulation

Draw

$\theta^{(1)},\ldots,\theta^{(N)} \sim g(\theta).$

Average the weights:

\hat p(y) = \frac{1}{N}\sum_{i=1}^{N} \frac{p(y \mid \theta^{(i)})p(\theta^{(i)})} {g(\theta^{(i)})}.

The proposal must have good tail behavior. A bad proposal can give unstable estimates.

Modified Harmonic Mean

Purpose

The modified harmonic mean estimator is another way to use posterior simulation to estimate marginal likelihood. The lecture version uses a proposal centered on posterior draws and truncated to a high-density region.

Components

Let

$\tilde\theta$

be the posterior mean, and let

$\tilde\Sigma$

be the posterior covariance matrix.

Define an indicator for an ellipsoid:

I_c(\theta)=1 \quad \text{if} \quad (\theta-\tilde\theta)'\tilde\Sigma^{-1}(\theta-\tilde\theta)\le c.

Use the proposal

$g(\theta)=N(\tilde\theta,\tilde\Sigma)\,I_c(\theta).$

The point is to focus the proposal on the posterior region that contributes most to the integral.

Laplace and BIC Approximations

Why Bring This In?

The marginal likelihood is easy to define but often hard to compute. We need an approximation when:

The model is not conjugate.
Direct numerical integration is too expensive.
We want a quick approximation for Bayes factors or posterior model probabilities.

The Laplace approximation says: if most posterior mass is near one high-probability point, approximate the log density near that point by a quadratic curve. A quadratic log density corresponds to a Normal density, which is easy to integrate.

Step 1: Start With the Log Integrand

The integral combines likelihood and prior. On the log scale, define

$h(\theta) = \ln p(y \mid \theta) + \ln p(\theta).$

Then

$p(y)=\int \exp(h(\theta))\,d\theta.$

Step 2: Find the Best Local Point

Let

$\hat\theta = \arg\max_\theta h(\theta).$

This is the posterior mode, or the mode of the unnormalized posterior.

Step 3: Measure Local Curvature

Let $J_{\hat\theta,y}$ measure the curvature of the negative log integrand at the mode:

$J_{\hat\theta,y}=-h''(\hat\theta).$

In several dimensions, this is the negative Hessian matrix. Large curvature means the posterior mass is concentrated. Small curvature means it is spread out.

Step 4: Approximate by a Quadratic

Near $\hat\theta$ ,

h(\theta) \approx h(\hat\theta) - \frac{1}{2}J_{\hat\theta,y}(\theta-\hat\theta)^2.

This turns the integrand into a Normal-shaped curve.

Step 5: Assemble the Approximation

The approximation has three pieces:

fit at the mode: $\ln p(y \mid \hat\theta)$ ;
prior density at the mode: $\ln p(\hat\theta)$ ;
posterior volume:

$\frac{1}{2}\ln|J_{\hat\theta,y}^{-1}|+\frac{p}{2}\ln(2\pi).$

Putting the pieces together:

\ln\hat p(y) = \ln p(y \mid \hat\theta) +\ln p(\hat\theta) +\frac{1}{2}\ln|J_{\hat\theta,y}^{-1}| +\frac{p}{2}\ln(2\pi).

Here $p$ is the number of unrestricted parameters.

Step 6: BIC as a Large-Sample Shortcut

For large $n$ , the observed information often scales like

$J_{\hat\theta,y}\approx nJ_{\hat\theta}.$

This creates the main complexity penalty:

$-\frac{p}{2}\ln n.$

After dropping smaller terms, the log marginal likelihood is approximated by

\ln\hat p(y) \approx \ln p(y \mid \hat\theta) - \frac{p}{2}\ln n.

BIC is therefore a rough marginal-likelihood approximation. It is connected to Bayes factors, while AIC, DIC, WAIC, and LOO-CV are motivated by predictive accuracy.

Bayesian Variable Selection

What Problem Are We Solving?

In linear regression,

$y=\beta_0+\beta_1x_1+\cdots+\beta_px_p+\varepsilon,$

we may not know which predictors belong in the model.

The question is:

Which coefficients should be treated as nonzero?

Indicator Variables

Introduce one indicator per predictor:

I_j = \begin{cases} 1, & \beta_j \ne 0,\\ 0, & \beta_j = 0. \end{cases}

Collect them into

$I=(I_1,\ldots,I_p).$

For example,

$I=(1,1,0)$

means that $x_1$ and $x_2$ are included, while $x_3$ is excluded.

Prior on Inclusion

Let $\theta$ be the prior probability that a predictor is included.

Then a simple prior is

$I_1,\ldots,I_p \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(\theta).$

This prior controls how large we expect the model to be before seeing the data.

Posterior Over Models

For a selected model $I$ , the posterior probability is proportional to two pieces:

$p(I \mid y,X) \propto p(y \mid X,I)\,p(I).$

The first piece,

$p(y \mid X,I),$

is the marginal likelihood of the submodel.

The second piece,

$p(I),$

is the prior probability of that subset.

Marginal Likelihood for a Selected Submodel

Purpose

To compare two predictor subsets, we need a score for each subset. That score is the marginal likelihood after integrating out regression coefficients and variance.

For a selected subset $I$ , let:

$X_I$ be the design matrix using only the included predictors;
$\beta_I$ be the included coefficients.

Prior

Use

$\beta_I \mid \sigma^2 \sim N(0,\sigma^2\Omega_{I,0}^{-1}),$

and

$\sigma^2 \sim \mathrm{Inv\text{-}}\chi^2(\nu_0,\sigma_0^2).$

Component 1: Posterior Precision

Define

$A_I = X_I'X_I+\Omega_{I,0}^{-1}.$

This combines information from the data and the prior.

Component 2: Residual Fit

Define

\mathrm{RSS}_I = y'y - y'X_I(X_I'X_I+\Omega_{I,0})^{-1}X_I'y.

This measures how well the selected submodel fits after accounting for the prior structure.

Component 3: Three Factors

The marginal likelihood is proportional to three factors:

posterior volume:

$|A_I|^{-1/2};$
prior volume:

$|\Omega_{I,0}|^{1/2};$
fit after integrating out variance:

$(\nu_0\sigma_0^2+\mathrm{RSS}_I)^{-(\nu_0+n-1)/2}.$

Full Formula

Putting the pieces together:

\begin{aligned} p(y \mid X,I) \propto &|A_I|^{-1/2} \cdot |\Omega_{I,0}|^{1/2} \\ &\cdot (\nu_0\sigma_0^2+\mathrm{RSS}_I)^{-(\nu_0+n-1)/2}. \end{aligned}

This gives a score for one subset $I$ .

Gibbs Sampling for Variable Selection

Why Do We Need Sampling?

There are $2^p$ possible subsets. For even moderately large $p$ , enumerating all models is expensive.

But most subsets have tiny posterior probability. A sampler can spend most of its time in the important part of the model space.

One Indicator at a Time

The Gibbs sampler updates one indicator while holding the others fixed.

Let

$I_{-j}$

mean all indicators except $I_j$ .

For predictor $j$ , compare two nearby models:

the model with $I_j=0$ ;
the model with $I_j=1$ .

Full Conditional

The probability of including predictor $j$ is proportional to:

\Pr(I_j=1 \mid I_{-j},y,X) \propto p(y \mid X,I_j=1,I_{-j})\,\Pr(I_j=1).

The probability of excluding it is proportional to:

\Pr(I_j=0 \mid I_{-j},y,X) \propto p(y \mid X,I_j=0,I_{-j})\,\Pr(I_j=0).

Normalize these two numbers so they sum to 1, then draw $I_j$ .

One full sweep updates $I_1$ , then $I_2$ , and so on. Across many sweeps, the sampler visits models in proportion to their posterior probability.

General Variable Selection via Metropolis-Hastings

When Is Gibbs Not Enough?

The Gibbs strategy above is convenient when we can compute

$p(y \mid X,I)$

for each proposed subset.

In more complex models, that marginal likelihood may not be available analytically.

A Broader Strategy

Use a Metropolis-Hastings proposal that changes both:

the model indicators $I$ ;
the nonzero coefficients $\beta$ .

A simple approach is to approximate the posterior from the full model:

$\beta \mid y,X \approx N(\hat\beta,J_y^{-1}(\hat\beta)).$

Then propose coefficients conditional on the zero restrictions implied by the proposed indicator vector.

The purpose is not to enumerate every model, but to move through model space in a way that respects both parameter uncertainty and model uncertainty.

Model Averaging

Why Not Just Pick One Model?

If the data clearly identify one model, selecting it may be harmless. Often they do not.

Several predictor subsets may have similar posterior probability. If we choose one and ignore the rest, uncertainty is understated.

Model averaging keeps this uncertainty.

A Shared Quantity

Suppose $\gamma$ has the same interpretation under two models. For example, $\gamma$ could be a future prediction.

Under model $M_1$ , its posterior is

$p_1(\gamma \mid y).$

Under model $M_2$ , its posterior is

$p_2(\gamma \mid y).$

Weighted Average

Weight each model-specific posterior by the posterior probability of the model:

p(\gamma \mid y) = p(M_1 \mid y)p_1(\gamma \mid y) + p(M_2 \mid y)p_2(\gamma \mid y).

For many models, the same idea extends by summing over all models.

Three Sources of Predictive Uncertainty

The averaged predictive distribution includes:

Future noise: the randomness in future observations.
Parameter uncertainty: uncertainty about parameters within each model.
Model uncertainty: uncertainty about which model or subset is appropriate.

This is often a better final answer than reporting only the single most probable subset.

Selection, Expansion, and Humility

Model comparison does not have to end with choosing one model from a fixed menu.

If model checks reveal a systematic failure, the better response may be model expansion:

Build a model that includes the missing structure.

In practice:

Use posterior predictive checks to understand what each model misses.
Use WAIC, LOO-CV, or LPS when predictive performance is the target.
Use Bayes factors only when the model list and priors are scientifically meaningful.
Use model averaging when model uncertainty should propagate into predictions or other shared quantities.

Chapter Summary

Variable selection turns model comparison into a large model-space problem. Information criteria provide fast estimates of predictive performance. Marginal likelihood methods support Bayes factors when posterior model probabilities are the goal. Indicator variables turn predictor inclusion into a Bayesian inference problem. Gibbs and Metropolis-Hastings samplers explore the model space without enumerating every subset. Model averaging carries model uncertainty into the final inference.

Check Your Understanding

Why does variable selection create a $2^p$ model comparison problem?
What is the difference between predictive information criteria and Bayes-factor-based comparison?
Why can prior Monte Carlo estimates of the marginal likelihood be unstable?
What are the main pieces of the Laplace approximation?
How does Gibbs sampling update one predictor indicator at a time?
What are the three sources of uncertainty captured by Bayesian model averaging?