Skip to content

Variable Selection and Model Averaging

Source: Lecture 11 — Bayesian Model Comparison. Variable selection (BDA Ch. 7).

The previous chapter compared a small number of models. Variable selection makes the same problem much larger.

In a regression with pp possible predictors, each predictor can be either included or excluded. That gives

2p2^p

possible subsets.

So the question changes from:

Which of two models is better?

to:

How do we compare, search, and average over many plausible models?

This chapter moves in four steps:

  1. Use information criteria for quick predictive comparisons.
  2. Compute or approximate marginal likelihoods when we want Bayes factors.
  3. Use indicator variables to represent selected predictors.
  4. Use model averaging so uncertainty about the selected model is not forgotten.

Information Criteria: Fast Predictive Comparisons

Section titled “Information Criteria: Fast Predictive Comparisons”

Information criteria estimate out-of-sample predictive accuracy.

They are useful when we want a quick comparison but do not want to assign posterior probability to every model in a discrete list.

All information criteria follow the same rough pattern:

  1. Measure how well the model fits.
  2. Subtract a penalty for fitting too flexibly.
  3. Use the result as an estimate of predictive performance.

Different criteria define “fit” and “penalty” differently.

AIC is a classical large-sample approximation to predictive accuracy. It is useful when the model is regular, priors are flat or negligible, and nn is much larger than the number of parameters.

First compute the maximum likelihood estimate:

θ^ML.\hat\theta_{\mathrm{ML}}.

Then compute the fitted log likelihood:

lnp(yθ^ML).\ln p(y \mid \hat\theta_{\mathrm{ML}}).

Let

pp

be the number of fitted parameters.

AIC is

AIC=2lnp(yθ^ML)+2p.\mathrm{AIC} = -2\ln p(y \mid \hat\theta_{\mathrm{ML}}) +2p.

The first term rewards fit. The second term penalizes model size. Lower AIC means better estimated predictive performance on the deviance scale.

DIC is a Bayesian analogue of AIC. It uses posterior draws rather than only a maximum likelihood estimate.

It is most useful when the posterior is approximately normal and the model is not too irregular.

Let

θ^Bayes\hat\theta_{\mathrm{Bayes}}

be a posterior summary, often the posterior mean.

Compute the log likelihood at that estimate:

lnp(yθ^Bayes).\ln p(y \mid \hat\theta_{\mathrm{Bayes}}).

Suppose posterior simulation gives draws

θ(1),,θ(S).\theta^{(1)},\ldots,\theta^{(S)}.

The average posterior log likelihood is

Epost[lnp(yθ)]1Ss=1Slnp(yθ(s)).E_{\mathrm{post}}[\ln p(y \mid \theta)] \approx \frac{1}{S}\sum_{s=1}^{S}\ln p(y \mid \theta^{(s)}).

Ingredient 3: Effective Number of Parameters

Section titled “Ingredient 3: Effective Number of Parameters”

DIC estimates model flexibility by comparing fit at the posterior summary with average posterior fit:

pDIC=2[lnp(yθ^Bayes)Epost[lnp(yθ)]].p_{\mathrm{DIC}} = 2\left[ \ln p(y \mid \hat\theta_{\mathrm{Bayes}}) - E_{\mathrm{post}}[\ln p(y \mid \theta)] \right].

DIC is

DIC=2lnp(yθ^Bayes)+2pDIC.\mathrm{DIC} = -2\ln p(y \mid \hat\theta_{\mathrm{Bayes}}) +2p_{\mathrm{DIC}}.

Again, lower is better on this scale.

WAIC is more fully Bayesian than AIC and DIC because it uses the whole posterior distribution point by point.

It is useful when we can split the data into meaningful pieces:

y1,,yn.y_1,\ldots,y_n.

This pointwise split is natural for iid data, but less obvious for time series, spatial data, network data, or hierarchical models.

Ingredient 1: Pointwise Predictive Density

Section titled “Ingredient 1: Pointwise Predictive Density”

For each observation yiy_i, average its likelihood over posterior draws:

1Ss=1Sp(yiθ(s)).\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)}).

Then take the logarithm:

log(1Ss=1Sp(yiθ(s))).\log\left(\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)})\right).

The log pointwise predictive density is

lppd=i=1nlog(1Ss=1Sp(yiθ(s))).\mathrm{lppd} = \sum_{i=1}^{n} \log\left(\frac{1}{S}\sum_{s=1}^{S}p(y_i \mid \theta^{(s)})\right).

For each observation, look at how much the log likelihood varies across posterior draws:

Vs=1S(logp(yiθ(s))).V_{s=1}^{S}\left(\log p(y_i \mid \theta^{(s)})\right).

Add these variances:

pWAIC=i=1nVs=1S(logp(yiθ(s))).p_{\mathrm{WAIC}} = \sum_{i=1}^{n} V_{s=1}^{S}\left(\log p(y_i \mid \theta^{(s)})\right).

WAIC is

WAIC=2lppd+2pWAIC.\mathrm{WAIC} = -2\,\mathrm{lppd} +2p_{\mathrm{WAIC}}.

The first term rewards posterior predictive fit. The second term corrects for effective flexibility.

Leave-one-out cross-validation asks a direct predictive question:

How well does the model predict each observation when that observation was not used to fit the model?

For observation ii, fit or approximate the posterior using all data except yiy_i:

p(θyi).p(\theta \mid y_{-i}).

Ingredient 2: Predict the Held-Out Observation

Section titled “Ingredient 2: Predict the Held-Out Observation”

The leave-one-out predictive density is

p(yiyi)=p(yiθ)p(θyi)dθ.p(y_i \mid y_{-i}) = \int p(y_i \mid \theta)\,p(\theta \mid y_{-i})\,d\theta.

Ingredient 3: Add Log Predictive Densities

Section titled “Ingredient 3: Add Log Predictive Densities”

The leave-one-out log score is

lppdloo=i=1nlogp(yiyi).\mathrm{lppd}_{\mathrm{loo}} = \sum_{i=1}^{n}\log p(y_i \mid y_{-i}).

On the deviance scale,

LOO-CV=2lppdloo.\mathrm{LOO\text{-}CV} = -2\,\mathrm{lppd}_{\mathrm{loo}}.

It is often written in a form parallel to WAIC:

LOO-CV=2lppd+2ploo,\mathrm{LOO\text{-}CV} = -2\,\mathrm{lppd} +2p_{\mathrm{loo}},

where

ploo=lppdlppdloo.p_{\mathrm{loo}} = \mathrm{lppd} - \mathrm{lppd}_{\mathrm{loo}}.

WAIC and LOO-CV are closely related asymptotically. Both require a meaningful data partition.

Information criteria estimate predictive performance. Bayes factors require marginal likelihoods.

The marginal likelihood is

p(y)=p(yθ)p(θ)dθ.p(y) = \int p(y \mid \theta)\,p(\theta)\,d\theta.

The rest of this section is about ways to compute or approximate this integral.

In conjugate models, the posterior has the same family as the prior. This sometimes gives a shortcut for the marginal likelihood.

Bayes’ theorem says

p(θy)=p(yθ)p(θ)p(y).p(\theta \mid y) = \frac{p(y \mid \theta)p(\theta)}{p(y)}.

Rearrange it:

p(y)=p(yθ)p(θ)p(θy).p(y) = \frac{p(y \mid \theta)p(\theta)}{p(\theta \mid y)}.

This expression is constant in θ\theta, so we may evaluate it at a convenient value.

For Bernoulli data with ss successes and ff failures,

θBeta(α,β).\theta \sim \mathrm{Beta}(\alpha,\beta).

The posterior is

θyBeta(α+s,β+f).\theta \mid y \sim \mathrm{Beta}(\alpha+s,\beta+f).

The marginal likelihood is

p(y)=B(α+s,β+f)B(α,β).p(y) = \frac{B(\alpha+s,\beta+f)}{B(\alpha,\beta)}.

The marginal likelihood is an expectation under the prior:

p(y)=Eθp(θ)[p(yθ)].p(y) = E_{\theta \sim p(\theta)}[p(y \mid \theta)].

That suggests a simple simulation estimate.

Draw prior samples:

θ(1),,θ(N)p(θ).\theta^{(1)},\ldots,\theta^{(N)} \sim p(\theta).

Compute the likelihood for each draw:

p(yθ(i)).p(y \mid \theta^{(i)}).

Average:

p^(y)=1Ni=1Np(yθ(i)).\hat p(y) = \frac{1}{N}\sum_{i=1}^{N}p(y \mid \theta^{(i)}).

This can be unstable when the posterior is much narrower than the prior. Most prior draws then land in regions with almost zero likelihood.

Importance sampling improves on prior simulation by drawing from a proposal distribution that puts more mass near important parameter values.

Let

g(θ)g(\theta)

be the proposal density.

Start with

p(y)=p(yθ)p(θ)dθ.p(y) = \int p(y \mid \theta)p(\theta)\,d\theta.

Multiply and divide by g(θ)g(\theta):

p(y)=p(yθ)p(θ)g(θ)g(θ)dθ.p(y) = \int \frac{p(y \mid \theta)p(\theta)}{g(\theta)} g(\theta)\,d\theta.

Draw

θ(1),,θ(N)g(θ).\theta^{(1)},\ldots,\theta^{(N)} \sim g(\theta).

Average the weights:

p^(y)=1Ni=1Np(yθ(i))p(θ(i))g(θ(i)).\hat p(y) = \frac{1}{N}\sum_{i=1}^{N} \frac{p(y \mid \theta^{(i)})p(\theta^{(i)})} {g(\theta^{(i)})}.

The proposal must have good tail behavior. A bad proposal can give unstable estimates.

The modified harmonic mean estimator is another way to use posterior simulation to estimate marginal likelihood. The lecture version uses a proposal centered on posterior draws and truncated to a high-density region.

Let

θ~\tilde\theta

be the posterior mean, and let

Σ~\tilde\Sigma

be the posterior covariance matrix.

Define an indicator for an ellipsoid:

Ic(θ)=1if(θθ~)Σ~1(θθ~)c.I_c(\theta)=1 \quad \text{if} \quad (\theta-\tilde\theta)'\tilde\Sigma^{-1}(\theta-\tilde\theta)\le c.

Use the proposal

g(θ)=N(θ~,Σ~)Ic(θ).g(\theta)=N(\tilde\theta,\tilde\Sigma)\,I_c(\theta).

The point is to focus the proposal on the posterior region that contributes most to the integral.

The marginal likelihood is easy to define but often hard to compute. We need an approximation when:

  • The model is not conjugate.
  • Direct numerical integration is too expensive.
  • We want a quick approximation for Bayes factors or posterior model probabilities.

The Laplace approximation says: if most posterior mass is near one high-probability point, approximate the log density near that point by a quadratic curve. A quadratic log density corresponds to a Normal density, which is easy to integrate.

The integral combines likelihood and prior. On the log scale, define

h(θ)=lnp(yθ)+lnp(θ).h(\theta) = \ln p(y \mid \theta) + \ln p(\theta).

Then

p(y)=exp(h(θ))dθ.p(y)=\int \exp(h(\theta))\,d\theta.

Let

θ^=argmaxθh(θ).\hat\theta = \arg\max_\theta h(\theta).

This is the posterior mode, or the mode of the unnormalized posterior.

Let Jθ^,yJ_{\hat\theta,y} measure the curvature of the negative log integrand at the mode:

Jθ^,y=h(θ^).J_{\hat\theta,y}=-h''(\hat\theta).

In several dimensions, this is the negative Hessian matrix. Large curvature means the posterior mass is concentrated. Small curvature means it is spread out.

Near θ^\hat\theta,

h(θ)h(θ^)12Jθ^,y(θθ^)2.h(\theta) \approx h(\hat\theta) - \frac{1}{2}J_{\hat\theta,y}(\theta-\hat\theta)^2.

This turns the integrand into a Normal-shaped curve.

The approximation has three pieces:

  1. fit at the mode: lnp(yθ^)\ln p(y \mid \hat\theta);

  2. prior density at the mode: lnp(θ^)\ln p(\hat\theta);

  3. posterior volume:

    12lnJθ^,y1+p2ln(2π).\frac{1}{2}\ln|J_{\hat\theta,y}^{-1}|+\frac{p}{2}\ln(2\pi).

Putting the pieces together:

lnp^(y)=lnp(yθ^)+lnp(θ^)+12lnJθ^,y1+p2ln(2π).\ln\hat p(y) = \ln p(y \mid \hat\theta) +\ln p(\hat\theta) +\frac{1}{2}\ln|J_{\hat\theta,y}^{-1}| +\frac{p}{2}\ln(2\pi).

Here pp is the number of unrestricted parameters.

For large nn, the observed information often scales like

Jθ^,ynJθ^.J_{\hat\theta,y}\approx nJ_{\hat\theta}.

This creates the main complexity penalty:

p2lnn.-\frac{p}{2}\ln n.

After dropping smaller terms, the log marginal likelihood is approximated by

lnp^(y)lnp(yθ^)p2lnn.\ln\hat p(y) \approx \ln p(y \mid \hat\theta) - \frac{p}{2}\ln n.

BIC is therefore a rough marginal-likelihood approximation. It is connected to Bayes factors, while AIC, DIC, WAIC, and LOO-CV are motivated by predictive accuracy.

In linear regression,

y=β0+β1x1++βpxp+ε,y=\beta_0+\beta_1x_1+\cdots+\beta_px_p+\varepsilon,

we may not know which predictors belong in the model.

The question is:

Which coefficients should be treated as nonzero?

Introduce one indicator per predictor:

Ij={1,βj0,0,βj=0.I_j = \begin{cases} 1, & \beta_j \ne 0,\\ 0, & \beta_j = 0. \end{cases}

Collect them into

I=(I1,,Ip).I=(I_1,\ldots,I_p).

For example,

I=(1,1,0)I=(1,1,0)

means that x1x_1 and x2x_2 are included, while x3x_3 is excluded.

Let θ\theta be the prior probability that a predictor is included.

Then a simple prior is

I1,,IpθiidBernoulli(θ).I_1,\ldots,I_p \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bernoulli}(\theta).

This prior controls how large we expect the model to be before seeing the data.

For a selected model II, the posterior probability is proportional to two pieces:

p(Iy,X)p(yX,I)p(I).p(I \mid y,X) \propto p(y \mid X,I)\,p(I).

The first piece,

p(yX,I),p(y \mid X,I),

is the marginal likelihood of the submodel.

The second piece,

p(I),p(I),

is the prior probability of that subset.

Marginal Likelihood for a Selected Submodel

Section titled “Marginal Likelihood for a Selected Submodel”

To compare two predictor subsets, we need a score for each subset. That score is the marginal likelihood after integrating out regression coefficients and variance.

For a selected subset II, let:

  • XIX_I be the design matrix using only the included predictors;
  • βI\beta_I be the included coefficients.

Use

βIσ2N(0,σ2ΩI,01),\beta_I \mid \sigma^2 \sim N(0,\sigma^2\Omega_{I,0}^{-1}),

and

σ2Inv-χ2(ν0,σ02).\sigma^2 \sim \mathrm{Inv\text{-}}\chi^2(\nu_0,\sigma_0^2).

Define

AI=XIXI+ΩI,01.A_I = X_I'X_I+\Omega_{I,0}^{-1}.

This combines information from the data and the prior.

Define

RSSI=yyyXI(XIXI+ΩI,0)1XIy.\mathrm{RSS}_I = y'y - y'X_I(X_I'X_I+\Omega_{I,0})^{-1}X_I'y.

This measures how well the selected submodel fits after accounting for the prior structure.

The marginal likelihood is proportional to three factors:

  1. posterior volume:

    AI1/2;|A_I|^{-1/2};

  2. prior volume:

    ΩI,01/2;|\Omega_{I,0}|^{1/2};

  3. fit after integrating out variance:

    (ν0σ02+RSSI)(ν0+n1)/2.(\nu_0\sigma_0^2+\mathrm{RSS}_I)^{-(\nu_0+n-1)/2}.

Putting the pieces together:

p(yX,I)AI1/2ΩI,01/2(ν0σ02+RSSI)(ν0+n1)/2.\begin{aligned} p(y \mid X,I) \propto &|A_I|^{-1/2} \cdot |\Omega_{I,0}|^{1/2} \\ &\cdot (\nu_0\sigma_0^2+\mathrm{RSS}_I)^{-(\nu_0+n-1)/2}. \end{aligned}

This gives a score for one subset II.

There are 2p2^p possible subsets. For even moderately large pp, enumerating all models is expensive.

But most subsets have tiny posterior probability. A sampler can spend most of its time in the important part of the model space.

The Gibbs sampler updates one indicator while holding the others fixed.

Let

IjI_{-j}

mean all indicators except IjI_j.

For predictor jj, compare two nearby models:

  1. the model with Ij=0I_j=0;
  2. the model with Ij=1I_j=1.

The probability of including predictor jj is proportional to:

Pr(Ij=1Ij,y,X)p(yX,Ij=1,Ij)Pr(Ij=1).\Pr(I_j=1 \mid I_{-j},y,X) \propto p(y \mid X,I_j=1,I_{-j})\,\Pr(I_j=1).

The probability of excluding it is proportional to:

Pr(Ij=0Ij,y,X)p(yX,Ij=0,Ij)Pr(Ij=0).\Pr(I_j=0 \mid I_{-j},y,X) \propto p(y \mid X,I_j=0,I_{-j})\,\Pr(I_j=0).

Normalize these two numbers so they sum to 1, then draw IjI_j.

One full sweep updates I1I_1, then I2I_2, and so on. Across many sweeps, the sampler visits models in proportion to their posterior probability.

General Variable Selection via Metropolis-Hastings

Section titled “General Variable Selection via Metropolis-Hastings”

The Gibbs strategy above is convenient when we can compute

p(yX,I)p(y \mid X,I)

for each proposed subset.

In more complex models, that marginal likelihood may not be available analytically.

Use a Metropolis-Hastings proposal that changes both:

  1. the model indicators II;
  2. the nonzero coefficients β\beta.

A simple approach is to approximate the posterior from the full model:

βy,XN(β^,Jy1(β^)).\beta \mid y,X \approx N(\hat\beta,J_y^{-1}(\hat\beta)).

Then propose coefficients conditional on the zero restrictions implied by the proposed indicator vector.

The purpose is not to enumerate every model, but to move through model space in a way that respects both parameter uncertainty and model uncertainty.

If the data clearly identify one model, selecting it may be harmless. Often they do not.

Several predictor subsets may have similar posterior probability. If we choose one and ignore the rest, uncertainty is understated.

Model averaging keeps this uncertainty.

Suppose γ\gamma has the same interpretation under two models. For example, γ\gamma could be a future prediction.

Under model M1M_1, its posterior is

p1(γy).p_1(\gamma \mid y).

Under model M2M_2, its posterior is

p2(γy).p_2(\gamma \mid y).

Weight each model-specific posterior by the posterior probability of the model:

p(γy)=p(M1y)p1(γy)+p(M2y)p2(γy).p(\gamma \mid y) = p(M_1 \mid y)p_1(\gamma \mid y) + p(M_2 \mid y)p_2(\gamma \mid y).

For many models, the same idea extends by summing over all models.

The averaged predictive distribution includes:

  1. Future noise: the randomness in future observations.
  2. Parameter uncertainty: uncertainty about parameters within each model.
  3. Model uncertainty: uncertainty about which model or subset is appropriate.

This is often a better final answer than reporting only the single most probable subset.

Model comparison does not have to end with choosing one model from a fixed menu.

If model checks reveal a systematic failure, the better response may be model expansion:

Build a model that includes the missing structure.

In practice:

  • Use posterior predictive checks to understand what each model misses.
  • Use WAIC, LOO-CV, or LPS when predictive performance is the target.
  • Use Bayes factors only when the model list and priors are scientifically meaningful.
  • Use model averaging when model uncertainty should propagate into predictions or other shared quantities.

Variable selection turns model comparison into a large model-space problem. Information criteria provide fast estimates of predictive performance. Marginal likelihood methods support Bayes factors when posterior model probabilities are the goal. Indicator variables turn predictor inclusion into a Bayesian inference problem. Gibbs and Metropolis-Hastings samplers explore the model space without enumerating every subset. Model averaging carries model uncertainty into the final inference.

  1. Why does variable selection create a 2p2^p model comparison problem?
  2. What is the difference between predictive information criteria and Bayes-factor-based comparison?
  3. Why can prior Monte Carlo estimates of the marginal likelihood be unstable?
  4. What are the main pieces of the Laplace approximation?
  5. How does Gibbs sampling update one predictor indicator at a time?
  6. What are the three sources of uncertainty captured by Bayesian model averaging?