Regression and Regularization

Source: Lecture 5 — Regression and Regularization (BDA Ch. 14, 20).

The Question of This Chapter

Regression models describe how an outcome changes with predictors.

In a Bayesian course, regression is useful for two reasons:

It is one of the most common applied models.
It shows how priors can act as regularization.

The storyline is:

Start with the linear regression likelihood.
Add priors for the coefficients and error variance.
Extend the linear model with basis functions such as polynomials and splines.
Use shrinkage priors to control overfitting.
Connect Bayesian priors to ridge regression and lasso.

Linear Regression

What Is It For?

Linear regression predicts a continuous outcome $y_i$ from predictors $x_i$ .

For observation $i$ , the model says

y_i = x_i'\beta+\varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2).

The coefficient vector $\beta$ describes how the mean of $y_i$ changes with the predictors. The variance $\sigma^2$ describes residual variation around that mean.

Matrix Form

Stack the $n$ observations into a vector $y$ . Stack the predictors into an $n\times k$ matrix $X$ . Usually the first column of $X$ is all ones, giving an intercept.

Then the model is

y=X\beta+\varepsilon, \qquad \varepsilon \sim N(0,\sigma^2 I_n).

Equivalently, the likelihood is

y \mid \beta,\sigma^2,X \sim N(X\beta,\sigma^2 I_n).

This is the sampling model. Bayesian regression is obtained by adding priors for $\beta$ and $\sigma^2$ .

Regression with the Standard Noninformative Prior

The Prior

A common default prior for linear regression is

p(\beta,\sigma^2)\propto \sigma^{-2}.

This prior is improper, so it is mainly used for parameter estimation, not for marginal-likelihood model comparison.

The Familiar OLS Quantities

The ordinary least squares estimate is

\hat\beta=(X'X)^{-1}X'y.

The residual variance estimate is

s^2 = \frac{1}{n-k}(y-X\hat\beta)'(y-X\hat\beta).

Here $k$ is the number of regression coefficients, including the intercept if one is present.

The Posterior

Under the standard prior,

\beta \mid \sigma^2,y \sim N\left(\hat\beta,\sigma^2(X'X)^{-1}\right),

and

\sigma^2 \mid y \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(n-k,s^2).

After integrating out $\sigma^2$ , the marginal posterior for $\beta$ is a multivariate Student distribution:

\beta \mid y \sim t_{n-k}\left(\hat\beta,s^2(X'X)^{-1}\right).

How to Read This Result

The posterior center for $\beta$ is the OLS estimate. The posterior spread is larger when:

the residual variance $s^2$ is large;
the predictors provide weak information, reflected in $(X'X)^{-1}$ ;
the degrees of freedom $n-k$ are small.

So the Bayesian result keeps the familiar OLS center but turns uncertainty into a posterior distribution.

Conjugate Prior for Regression

What Is It For?

The noninformative prior is useful as a default, but it does not encode real prior information.

A conjugate prior lets us express prior beliefs about plausible coefficient values and residual variation while keeping closed-form posterior updates.

The Prior

Use

\beta \mid \sigma^2 \sim N(\mu_0,\sigma^2\Omega_0^{-1}),

and

\sigma^2 \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_0,\sigma_0^2).

The vector $\mu_0$ is the prior center for $\beta$ . The matrix $\Omega_0$ is prior precision. Larger values in $\Omega_0$ mean stronger prior information.

The Posterior Precision

The posterior precision is

\Omega_n=X'X+\Omega_0.

This combines information from the likelihood, $X'X$ , with information from the prior, $\Omega_0$ .

The Posterior Mean

The posterior mean for $\beta$ conditional on $\sigma^2$ is

\mu_n = (X'X+\Omega_0)^{-1}(X'X\hat\beta+\Omega_0\mu_0).

This is a weighted compromise between the data estimate $\hat\beta$ and the prior mean $\mu_0$ .

When the data are very informative, $X'X$ dominates. When the prior precision is large, $\Omega_0$ dominates.

The Full Posterior

The conditional posterior for the coefficients is

\beta \mid \sigma^2,y \sim N(\mu_n,\sigma^2\Omega_n^{-1}).

The posterior degrees of freedom are

\nu_n=\nu_0+n.

The posterior scale satisfies

\nu_n\sigma_n^2 = \nu_0\sigma_0^2 +y'y +\mu_0'\Omega_0\mu_0 -\mu_n'\Omega_n\mu_n.

Therefore

\sigma^2 \mid y \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_n,\sigma_n^2).

Basis Expansions

The Main Idea

Linear regression is linear in the coefficients, not necessarily linear in the original predictor.

We can create new columns of $X$ from transformations of $x$ and still use the same regression machinery.

This is called a basis expansion.

Polynomial Regression

Step 1: Choose Basis Functions

For one predictor $x$ , a polynomial of degree $k$ uses

1,x,x^2,\ldots,x^k.

The regression function is

f(x_i) = \beta_0+\beta_1x_i+\beta_2x_i^2+\cdots+\beta_kx_i^k.

Step 2: Put the Basis in the Design Matrix

Let $X_P$ contain columns

1,x,x^2,\ldots,x^k.

Then the model is still

y=X_P\beta+\varepsilon.

Since the model is linear in $\beta$ , the posterior formulas from linear regression still apply.

What Can Go Wrong?

High-degree polynomials are global. Changing one coefficient affects the fitted curve everywhere.

This can make polynomial regression unstable at the edges of the data and overly sensitive to the chosen degree. Splines address this by using more local basis functions.

Spline Regression

What Is It For?

Splines allow flexible curves without making every part of the curve depend equally on every coefficient.

They do this by adding basis functions that change behavior at selected points called knots.

Truncated Power Basis

Let the knots be

\kappa_1,\ldots,\kappa_m.

For a polynomial order $p$ , define the truncated basis function

b_j(x_i) = \begin{cases} (x_i-\kappa_j)^p, & x_i>\kappa_j,\\ 0, & x_i\le \kappa_j. \end{cases}

Then a spline regression can be written as

y=X_b\beta+\varepsilon,

where $X_b$ contains ordinary polynomial terms and the knot-based columns.

Again, this is still a linear regression in the coefficient vector $\beta$ .

Why Regularization Is Needed

The Problem

Flexible bases can fit many shapes. If we add many polynomial terms or many knots, the model can follow noise rather than signal.

In frequentist language, this is overfitting. In Bayesian language, it means the likelihood alone allows too many rough or unstable functions.

The Bayesian Response

Add a prior that expresses the belief that large unnecessary coefficients are unlikely.

For spline coefficients, a common shrinkage prior is

\beta_j \mid \sigma^2,\lambda \overset{\mathrm{iid}}{\sim} N\left(0,\frac{\sigma^2}{\lambda}\right), \qquad j=1,\ldots,m.

The parameter $\lambda$ is the shrinkage strength:

small $\lambda$ gives a weak penalty and allows a flexible fit;
large $\lambda$ gives a strong penalty and produces a smoother fit.

Ridge Regression as a Bayesian Posterior Mode

The Prior

Consider the normal shrinkage prior

\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I).

This corresponds to a prior precision matrix

\Omega_0=\lambda I.

The Posterior Mean and Mode

With $\mu_0=0$ and known $\sigma^2$ , the posterior mean is

\tilde\beta = (X'X+\lambda I)^{-1}X'y.

For a normal posterior, the posterior mean and posterior mode are the same. This is the ridge regression estimator.

How to Read the Formula

Compare ridge with OLS:

\hat\beta=(X'X)^{-1}X'y.

Ridge adds $\lambda I$ before inversion:

\tilde\beta=(X'X+\lambda I)^{-1}X'y.

This makes the fitted coefficients smaller and more stable, especially when predictors are correlated or when there are many predictors.

If the columns are orthogonal and scaled so that

X'X=I,

then

\tilde\beta = \frac{1}{1+\lambda}\hat\beta.

Every coefficient is shrunk toward zero by the same factor.

Lasso and the Laplace Prior

What Changes?

Ridge shrinks coefficients, but it usually does not set them exactly to zero.

Lasso uses a different penalty that can produce sparse estimates.

Bayesian Interpretation

The lasso posterior mode corresponds to an independent Laplace prior on coefficients:

\beta_j \mid \sigma^2,\lambda \overset{\mathrm{iid}}{\sim} \mathrm{Laplace}\left(0,\frac{\sigma^2}{\lambda}\right).

The Laplace density has a sharper peak at zero than the normal density. This makes exact zeros possible for the posterior mode.

Ridge Versus Lasso

Prior	Penalized estimator	Typical behavior
Normal prior	Ridge	Shrinks coefficients toward zero
Laplace prior	Lasso	Shrinks and can set coefficients to zero

The important distinction is about the mode. The full Bayesian posterior under a Laplace prior still represents uncertainty about coefficients; the exact sparsity is a property of the posterior mode.

Estimating the Shrinkage Parameter

Why Not Always Fix $\lambda$ ?

The shrinkage parameter controls the bias-variance tradeoff. Choosing it by hand can be difficult.

A Bayesian model can put a prior on $\lambda$ and learn about it from the data.

A Hierarchical Model

One possible hierarchy is

y \mid \beta,\sigma^2 \sim N(X\beta,\sigma^2I_n),

\beta \mid \sigma^2,\lambda \sim N(0,\sigma^2\lambda^{-1}I_m),

\sigma^2 \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\nu_0,\sigma_0^2),

and

\lambda \sim \mathrm{Scale\text{-}inv\text{-}}\chi^2(\eta_0,\lambda_0).

The data then inform $\lambda$ through the estimated size and variability of the coefficients.

What Computation Is Needed?

With $\lambda$ unknown, the posterior for

(\beta,\sigma^2,\lambda)

usually needs simulation. Gibbs sampling is a natural approach when the conditional distributions have convenient forms.

The conceptual point is simple: regularization can be learned as part of the model rather than selected outside it.

Extensions

The same framework can be extended in several directions:

Treat knot locations as unknown parameters.
Model nonconstant error variance with another regression or spline.
Use Student- $t$ or mixture errors for robustness.
Add autocorrelation when residuals are time-dependent.

Each extension changes the model, but the basic Bayesian workflow remains the same: write the likelihood, choose priors, compute or simulate the posterior, and interpret predictions or fitted functions.

Chapter Summary

Bayesian linear regression combines a normal likelihood with priors for coefficients and residual variance. With the standard noninformative prior, the posterior is centered on the OLS estimate. With a conjugate prior, the posterior mean is a compromise between the data estimate and the prior mean.

Polynomial and spline regressions are basis expansions: they remain linear in the coefficients. Regularization enters through priors that shrink coefficients toward zero. Normal priors connect to ridge regression, while Laplace priors connect to lasso. A hierarchical Bayesian model can also estimate the shrinkage strength from the data.

Check Your Understanding

What does the matrix $X'X$ represent in the posterior for linear regression?
How does the conjugate posterior mean combine $\hat\beta$ and $\mu_0$ ?
Why can high-degree polynomial regression behave poorly near the edges of the data?
What does a larger value of $\lambda$ do in a normal shrinkage prior?
Why does lasso produce sparse posterior modes while ridge usually does not?

Regression and Regularization

The Question of This Chapter

Linear Regression

What Is It For?

Matrix Form

Regression with the Standard Noninformative Prior

The Prior

The Familiar OLS Quantities

The Posterior

How to Read This Result

Conjugate Prior for Regression

What Is It For?

The Prior

The Posterior Precision

The Posterior Mean

The Full Posterior

Basis Expansions

The Main Idea

Polynomial Regression

Step 1: Choose Basis Functions

Step 2: Put the Basis in the Design Matrix

What Can Go Wrong?

Spline Regression

What Is It For?

Truncated Power Basis

Why Regularization Is Needed

The Problem

The Bayesian Response

Ridge Regression as a Bayesian Posterior Mode

The Prior

The Posterior Mean and Mode

How to Read the Formula

Lasso and the Laplace Prior

What Changes?

Bayesian Interpretation

Ridge Versus Lasso

Estimating the Shrinkage Parameter

Why Not Always Fix λ\lambdaλ?

A Hierarchical Model

What Computation Is Needed?

Extensions

Chapter Summary

Check Your Understanding

Why Not Always Fix $\lambda$ ?