Regression and Regularization
Source: Lecture 5 — Regression and Regularization (BDA Ch. 14, 20).
The Question of This Chapter
Section titled “The Question of This Chapter”Regression models describe how an outcome changes with predictors.
In a Bayesian course, regression is useful for two reasons:
- It is one of the most common applied models.
- It shows how priors can act as regularization.
The storyline is:
- Start with the linear regression likelihood.
- Add priors for the coefficients and error variance.
- Extend the linear model with basis functions such as polynomials and splines.
- Use shrinkage priors to control overfitting.
- Connect Bayesian priors to ridge regression and lasso.
Linear Regression
Section titled “Linear Regression”What Is It For?
Section titled “What Is It For?”Linear regression predicts a continuous outcome from predictors .
For observation , the model says
The coefficient vector describes how the mean of changes with the predictors. The variance describes residual variation around that mean.
Matrix Form
Section titled “Matrix Form”Stack the observations into a vector . Stack the predictors into an matrix . Usually the first column of is all ones, giving an intercept.
Then the model is
Equivalently, the likelihood is
This is the sampling model. Bayesian regression is obtained by adding priors for and .
Regression with the Standard Noninformative Prior
Section titled “Regression with the Standard Noninformative Prior”The Prior
Section titled “The Prior”A common default prior for linear regression is
This prior is improper, so it is mainly used for parameter estimation, not for marginal-likelihood model comparison.
The Familiar OLS Quantities
Section titled “The Familiar OLS Quantities”The ordinary least squares estimate is
The residual variance estimate is
Here is the number of regression coefficients, including the intercept if one is present.
The Posterior
Section titled “The Posterior”Under the standard prior,
and
After integrating out , the marginal posterior for is a multivariate Student distribution:
How to Read This Result
Section titled “How to Read This Result”The posterior center for is the OLS estimate. The posterior spread is larger when:
- the residual variance is large;
- the predictors provide weak information, reflected in ;
- the degrees of freedom are small.
So the Bayesian result keeps the familiar OLS center but turns uncertainty into a posterior distribution.
Conjugate Prior for Regression
Section titled “Conjugate Prior for Regression”What Is It For?
Section titled “What Is It For?”The noninformative prior is useful as a default, but it does not encode real prior information.
A conjugate prior lets us express prior beliefs about plausible coefficient values and residual variation while keeping closed-form posterior updates.
The Prior
Section titled “The Prior”Use
and
The vector is the prior center for . The matrix is prior precision. Larger values in mean stronger prior information.
The Posterior Precision
Section titled “The Posterior Precision”The posterior precision is
This combines information from the likelihood, , with information from the prior, .
The Posterior Mean
Section titled “The Posterior Mean”The posterior mean for conditional on is
This is a weighted compromise between the data estimate and the prior mean .
When the data are very informative, dominates. When the prior precision is large, dominates.
The Full Posterior
Section titled “The Full Posterior”The conditional posterior for the coefficients is
The posterior degrees of freedom are
The posterior scale satisfies
Therefore
Basis Expansions
Section titled “Basis Expansions”The Main Idea
Section titled “The Main Idea”Linear regression is linear in the coefficients, not necessarily linear in the original predictor.
We can create new columns of from transformations of and still use the same regression machinery.
This is called a basis expansion.
Polynomial Regression
Section titled “Polynomial Regression”Step 1: Choose Basis Functions
Section titled “Step 1: Choose Basis Functions”For one predictor , a polynomial of degree uses
The regression function is
Step 2: Put the Basis in the Design Matrix
Section titled “Step 2: Put the Basis in the Design Matrix”Let contain columns
Then the model is still
Since the model is linear in , the posterior formulas from linear regression still apply.
What Can Go Wrong?
Section titled “What Can Go Wrong?”High-degree polynomials are global. Changing one coefficient affects the fitted curve everywhere.
This can make polynomial regression unstable at the edges of the data and overly sensitive to the chosen degree. Splines address this by using more local basis functions.
Spline Regression
Section titled “Spline Regression”What Is It For?
Section titled “What Is It For?”Splines allow flexible curves without making every part of the curve depend equally on every coefficient.
They do this by adding basis functions that change behavior at selected points called knots.
Truncated Power Basis
Section titled “Truncated Power Basis”Let the knots be
For a polynomial order , define the truncated basis function
Then a spline regression can be written as
where contains ordinary polynomial terms and the knot-based columns.
Again, this is still a linear regression in the coefficient vector .
Why Regularization Is Needed
Section titled “Why Regularization Is Needed”The Problem
Section titled “The Problem”Flexible bases can fit many shapes. If we add many polynomial terms or many knots, the model can follow noise rather than signal.
In frequentist language, this is overfitting. In Bayesian language, it means the likelihood alone allows too many rough or unstable functions.
The Bayesian Response
Section titled “The Bayesian Response”Add a prior that expresses the belief that large unnecessary coefficients are unlikely.
For spline coefficients, a common shrinkage prior is
The parameter is the shrinkage strength:
- small gives a weak penalty and allows a flexible fit;
- large gives a strong penalty and produces a smoother fit.
Ridge Regression as a Bayesian Posterior Mode
Section titled “Ridge Regression as a Bayesian Posterior Mode”The Prior
Section titled “The Prior”Consider the normal shrinkage prior
This corresponds to a prior precision matrix
The Posterior Mean and Mode
Section titled “The Posterior Mean and Mode”With and known , the posterior mean is
For a normal posterior, the posterior mean and posterior mode are the same. This is the ridge regression estimator.
How to Read the Formula
Section titled “How to Read the Formula”Compare ridge with OLS:
Ridge adds before inversion:
This makes the fitted coefficients smaller and more stable, especially when predictors are correlated or when there are many predictors.
If the columns are orthogonal and scaled so that
then
Every coefficient is shrunk toward zero by the same factor.
Lasso and the Laplace Prior
Section titled “Lasso and the Laplace Prior”What Changes?
Section titled “What Changes?”Ridge shrinks coefficients, but it usually does not set them exactly to zero.
Lasso uses a different penalty that can produce sparse estimates.
Bayesian Interpretation
Section titled “Bayesian Interpretation”The lasso posterior mode corresponds to an independent Laplace prior on coefficients:
The Laplace density has a sharper peak at zero than the normal density. This makes exact zeros possible for the posterior mode.
Ridge Versus Lasso
Section titled “Ridge Versus Lasso”| Prior | Penalized estimator | Typical behavior |
|---|---|---|
| Normal prior | Ridge | Shrinks coefficients toward zero |
| Laplace prior | Lasso | Shrinks and can set coefficients to zero |
The important distinction is about the mode. The full Bayesian posterior under a Laplace prior still represents uncertainty about coefficients; the exact sparsity is a property of the posterior mode.
Estimating the Shrinkage Parameter
Section titled “Estimating the Shrinkage Parameter”Why Not Always Fix ?
Section titled “Why Not Always Fix λ\lambdaλ?”The shrinkage parameter controls the bias-variance tradeoff. Choosing it by hand can be difficult.
A Bayesian model can put a prior on and learn about it from the data.
A Hierarchical Model
Section titled “A Hierarchical Model”One possible hierarchy is
and
The data then inform through the estimated size and variability of the coefficients.
What Computation Is Needed?
Section titled “What Computation Is Needed?”With unknown, the posterior for
usually needs simulation. Gibbs sampling is a natural approach when the conditional distributions have convenient forms.
The conceptual point is simple: regularization can be learned as part of the model rather than selected outside it.
Extensions
Section titled “Extensions”The same framework can be extended in several directions:
- Treat knot locations as unknown parameters.
- Model nonconstant error variance with another regression or spline.
- Use Student- or mixture errors for robustness.
- Add autocorrelation when residuals are time-dependent.
Each extension changes the model, but the basic Bayesian workflow remains the same: write the likelihood, choose priors, compute or simulate the posterior, and interpret predictions or fitted functions.
Chapter Summary
Section titled “Chapter Summary”Bayesian linear regression combines a normal likelihood with priors for coefficients and residual variance. With the standard noninformative prior, the posterior is centered on the OLS estimate. With a conjugate prior, the posterior mean is a compromise between the data estimate and the prior mean.
Polynomial and spline regressions are basis expansions: they remain linear in the coefficients. Regularization enters through priors that shrink coefficients toward zero. Normal priors connect to ridge regression, while Laplace priors connect to lasso. A hierarchical Bayesian model can also estimate the shrinkage strength from the data.
Check Your Understanding
Section titled “Check Your Understanding”- What does the matrix represent in the posterior for linear regression?
- How does the conjugate posterior mean combine and ?
- Why can high-degree polynomial regression behave poorly near the edges of the data?
- What does a larger value of do in a normal shrinkage prior?
- Why does lasso produce sparse posterior modes while ridge usually does not?