Skip to content

Regression and Regularization

Source: Lecture 5 — Regression and Regularization (BDA Ch. 14, 20).

Regression models describe how an outcome changes with predictors.

In a Bayesian course, regression is useful for two reasons:

  1. It is one of the most common applied models.
  2. It shows how priors can act as regularization.

The storyline is:

  1. Start with the linear regression likelihood.
  2. Add priors for the coefficients and error variance.
  3. Extend the linear model with basis functions such as polynomials and splines.
  4. Use shrinkage priors to control overfitting.
  5. Connect Bayesian priors to ridge regression and lasso.

Linear regression predicts a continuous outcome yiy_i from predictors xix_i.

For observation ii, the model says

yi=xiβ+εi,εiN(0,σ2).y_i = x_i'\beta+\varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2).

The coefficient vector β\beta describes how the mean of yiy_i changes with the predictors. The variance σ2\sigma^2 describes residual variation around that mean.

Stack the nn observations into a vector yy. Stack the predictors into an n×kn\times k matrix XX. Usually the first column of XX is all ones, giving an intercept.

Then the model is

y=Xβ+ε,εN(0,σ2In).y=X\beta+\varepsilon, \qquad \varepsilon \sim N(0,\sigma^2 I_n).

Equivalently, the likelihood is

yβ,σ2,XN(Xβ,σ2In).y \mid \beta,\sigma^2,X \sim N(X\beta,\sigma^2 I_n).

This is the sampling model. Bayesian regression is obtained by adding priors for β\beta and σ2\sigma^2.

Regression with the Standard Noninformative Prior

Section titled “Regression with the Standard Noninformative Prior”

A common default prior for linear regression is

p(β,σ2)σ2.p(\beta,\sigma^2)\propto \sigma^{-2}.

This prior is improper, so it is mainly used for parameter estimation, not for marginal-likelihood model comparison.

The ordinary least squares estimate is

β^=(XX)1Xy.\hat\beta=(X'X)^{-1}X'y.

The residual variance estimate is

s2=1nk(yXβ^)(yXβ^).s^2 = \frac{1}{n-k}(y-X\hat\beta)'(y-X\hat\beta).

Here kk is the number of regression coefficients, including the intercept if one is present.

Under the standard prior,

βσ2,yN(β^,σ2(XX)1),\beta \mid \sigma^2,y \sim N\left(\hat\beta,\sigma^2(X'X)^{-1}\right),

and

σ2yScale-inv-χ2(nk,s2).\sigma^2 \mid y \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(n-k,s^2).

After integrating out σ2\sigma^2, the marginal posterior for β\beta is a multivariate Student distribution:

βytnk(β^,s2(XX)1).\beta \mid y \sim t_{n-k}\left(\hat\beta,s^2(X'X)^{-1}\right).

The posterior center for β\beta is the OLS estimate. The posterior spread is larger when:

  • the residual variance s2s^2 is large;
  • the predictors provide weak information, reflected in (XX)1(X'X)^{-1};
  • the degrees of freedom nkn-k are small.

So the Bayesian result keeps the familiar OLS center but turns uncertainty into a posterior distribution.

The noninformative prior is useful as a default, but it does not encode real prior information.

A conjugate prior lets us express prior beliefs about plausible coefficient values and residual variation while keeping closed-form posterior updates.

Use

βσ2N(μ0,σ2Ω01),\beta \mid \sigma^2 \sim N(\mu_0,\sigma^2\Omega_0^{-1}),

and

σ2Scale-inv-χ2(ν0,σ02).\sigma^2 \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_0,\sigma_0^2).

The vector μ0\mu_0 is the prior center for β\beta. The matrix Ω0\Omega_0 is prior precision. Larger values in Ω0\Omega_0 mean stronger prior information.

The posterior precision is

Ωn=XX+Ω0.\Omega_n=X'X+\Omega_0.

This combines information from the likelihood, XXX'X, with information from the prior, Ω0\Omega_0.

The posterior mean for β\beta conditional on σ2\sigma^2 is

μn=(XX+Ω0)1(XXβ^+Ω0μ0).\mu_n = (X'X+\Omega_0)^{-1}(X'X\hat\beta+\Omega_0\mu_0).

This is a weighted compromise between the data estimate β^\hat\beta and the prior mean μ0\mu_0.

When the data are very informative, XXX'X dominates. When the prior precision is large, Ω0\Omega_0 dominates.

The conditional posterior for the coefficients is

βσ2,yN(μn,σ2Ωn1).\beta \mid \sigma^2,y \sim N(\mu_n,\sigma^2\Omega_n^{-1}).

The posterior degrees of freedom are

νn=ν0+n.\nu_n=\nu_0+n.

The posterior scale satisfies

νnσn2=ν0σ02+yy+μ0Ω0μ0μnΩnμn.\nu_n\sigma_n^2 = \nu_0\sigma_0^2 +y'y +\mu_0'\Omega_0\mu_0 -\mu_n'\Omega_n\mu_n.

Therefore

σ2yScale-inv-χ2(νn,σn2).\sigma^2 \mid y \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_n,\sigma_n^2).

Linear regression is linear in the coefficients, not necessarily linear in the original predictor.

We can create new columns of XX from transformations of xx and still use the same regression machinery.

This is called a basis expansion.

For one predictor xx, a polynomial of degree kk uses

1,x,x2,,xk.1,x,x^2,\ldots,x^k.

The regression function is

f(xi)=β0+β1xi+β2xi2++βkxik.f(x_i) = \beta_0+\beta_1x_i+\beta_2x_i^2+\cdots+\beta_kx_i^k.

Step 2: Put the Basis in the Design Matrix

Section titled “Step 2: Put the Basis in the Design Matrix”

Let XPX_P contain columns

1,x,x2,,xk.1,x,x^2,\ldots,x^k.

Then the model is still

y=XPβ+ε.y=X_P\beta+\varepsilon.

Since the model is linear in β\beta, the posterior formulas from linear regression still apply.

High-degree polynomials are global. Changing one coefficient affects the fitted curve everywhere.

This can make polynomial regression unstable at the edges of the data and overly sensitive to the chosen degree. Splines address this by using more local basis functions.

Splines allow flexible curves without making every part of the curve depend equally on every coefficient.

They do this by adding basis functions that change behavior at selected points called knots.

Let the knots be

κ1,,κm.\kappa_1,\ldots,\kappa_m.

For a polynomial order pp, define the truncated basis function

bj(xi)={(xiκj)p,xi>κj,0,xiκj.b_j(x_i) = \begin{cases} (x_i-\kappa_j)^p, & x_i>\kappa_j,\\ 0, & x_i\le \kappa_j. \end{cases}

Then a spline regression can be written as

y=Xbβ+ε,y=X_b\beta+\varepsilon,

where XbX_b contains ordinary polynomial terms and the knot-based columns.

Again, this is still a linear regression in the coefficient vector β\beta.

Flexible bases can fit many shapes. If we add many polynomial terms or many knots, the model can follow noise rather than signal.

In frequentist language, this is overfitting. In Bayesian language, it means the likelihood alone allows too many rough or unstable functions.

Add a prior that expresses the belief that large unnecessary coefficients are unlikely.

For spline coefficients, a common shrinkage prior is

βjσ2,λiidN(0,σ2λ),j=1,,m.\beta_j \mid \sigma^2,\lambda \overset{\mathrm{iid}}{\sim} N\left(0,\frac{\sigma^2}{\lambda}\right), \qquad j=1,\ldots,m.

The parameter λ\lambda is the shrinkage strength:

  • small λ\lambda gives a weak penalty and allows a flexible fit;
  • large λ\lambda gives a strong penalty and produces a smoother fit.

Ridge Regression as a Bayesian Posterior Mode

Section titled “Ridge Regression as a Bayesian Posterior Mode”

Consider the normal shrinkage prior

βσ2,λN(0,σ2λ1I).\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I).

This corresponds to a prior precision matrix

Ω0=λI.\Omega_0=\lambda I.

With μ0=0\mu_0=0 and known σ2\sigma^2, the posterior mean is

β~=(XX+λI)1Xy.\tilde\beta = (X'X+\lambda I)^{-1}X'y.

For a normal posterior, the posterior mean and posterior mode are the same. This is the ridge regression estimator.

Compare ridge with OLS:

β^=(XX)1Xy.\hat\beta=(X'X)^{-1}X'y.

Ridge adds λI\lambda I before inversion:

β~=(XX+λI)1Xy.\tilde\beta=(X'X+\lambda I)^{-1}X'y.

This makes the fitted coefficients smaller and more stable, especially when predictors are correlated or when there are many predictors.

If the columns are orthogonal and scaled so that

XX=I,X'X=I,

then

β~=11+λβ^.\tilde\beta = \frac{1}{1+\lambda}\hat\beta.

Every coefficient is shrunk toward zero by the same factor.

Ridge shrinks coefficients, but it usually does not set them exactly to zero.

Lasso uses a different penalty that can produce sparse estimates.

The lasso posterior mode corresponds to an independent Laplace prior on coefficients:

βjσ2,λiidLaplace(0,σ2λ).\beta_j \mid \sigma^2,\lambda \overset{\mathrm{iid}}{\sim} \mathrm{Laplace}\left(0,\frac{\sigma^2}{\lambda}\right).

The Laplace density has a sharper peak at zero than the normal density. This makes exact zeros possible for the posterior mode.

PriorPenalized estimatorTypical behavior
Normal priorRidgeShrinks coefficients toward zero
Laplace priorLassoShrinks and can set coefficients to zero

The important distinction is about the mode. The full Bayesian posterior under a Laplace prior still represents uncertainty about coefficients; the exact sparsity is a property of the posterior mode.

The shrinkage parameter controls the bias-variance tradeoff. Choosing it by hand can be difficult.

A Bayesian model can put a prior on λ\lambda and learn about it from the data.

One possible hierarchy is

yβ,σ2N(Xβ,σ2In),y \mid \beta,\sigma^2 \sim N(X\beta,\sigma^2I_n), βσ2,λN(0,σ2λ1Im),\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I_m), σ2Scale-inv-χ2(ν0,σ02),\sigma^2 \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_0,\sigma_0^2),

and

λScale-inv-χ2(η0,λ0).\lambda \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\eta_0,\lambda_0).

The data then inform λ\lambda through the estimated size and variability of the coefficients.

With λ\lambda unknown, the posterior for

(β,σ2,λ)(\beta,\sigma^2,\lambda)

usually needs simulation. Gibbs sampling is a natural approach when the conditional distributions have convenient forms.

The conceptual point is simple: regularization can be learned as part of the model rather than selected outside it.

The same framework can be extended in several directions:

  • Treat knot locations as unknown parameters.
  • Model nonconstant error variance with another regression or spline.
  • Use Student-tt or mixture errors for robustness.
  • Add autocorrelation when residuals are time-dependent.

Each extension changes the model, but the basic Bayesian workflow remains the same: write the likelihood, choose priors, compute or simulate the posterior, and interpret predictions or fitted functions.

Bayesian linear regression combines a normal likelihood with priors for coefficients and residual variance. With the standard noninformative prior, the posterior is centered on the OLS estimate. With a conjugate prior, the posterior mean is a compromise between the data estimate and the prior mean.

Polynomial and spline regressions are basis expansions: they remain linear in the coefficients. Regularization enters through priors that shrink coefficients toward zero. Normal priors connect to ridge regression, while Laplace priors connect to lasso. A hierarchical Bayesian model can also estimate the shrinkage strength from the data.

  1. What does the matrix XXX'X represent in the posterior for linear regression?
  2. How does the conjugate posterior mean combine β^\hat\beta and μ0\mu_0?
  3. Why can high-degree polynomial regression behave poorly near the edges of the data?
  4. What does a larger value of λ\lambda do in a normal shrinkage prior?
  5. Why does lasso produce sparse posterior modes while ridge usually does not?