Classification and Large-Sample Approximations
Source: Lecture 6 — Classification. Large sample approximations (BDA Ch. 16).
The Question of This Chapter
Section titled “The Question of This Chapter”So far, many examples have used continuous outcomes. Classification changes the target.
Instead of predicting a number, we predict a class label:
This chapter also introduces a practical computational idea. Some useful Bayesian models do not have closed-form posteriors. When the sample size is large, the posterior can often be approximated by a normal distribution near its mode.
The storyline is:
- Turn classification into posterior class probabilities.
- Build binary logistic and probit regression models.
- Extend logistic regression to multiple classes.
- Approximate nonstandard posteriors by a normal distribution.
- Improve approximations by reparametrizing constrained parameters.
Bayesian Classification
Section titled “Bayesian Classification”What Is It For?
Section titled “What Is It For?”In classification, the goal is to assign a label to an observation with covariates .
Bayesian classification starts by computing class probabilities:
The simplest classifier chooses the class with largest posterior probability:
This rule is optimal under equal loss for all wrong classifications.
What If Mistakes Have Different Costs?
Section titled “What If Mistakes Have Different Costs?”If different mistakes have different costs, the largest-probability class may not be the best action.
For example, falsely missing a disease can be more costly than falsely flagging a healthy patient. In that case, combine class probabilities with a loss table and choose the action with smallest posterior expected loss.
Classification is therefore another decision problem.
Two Ways to Model Class Probabilities
Section titled “Two Ways to Model Class Probabilities”Discriminative Models
Section titled “Discriminative Models”A discriminative model directly models
Logistic regression is the main example in this chapter.
The model focuses on the boundary between classes. It does not need to model the full distribution of covariates.
Generative Models
Section titled “Generative Models”A generative model describes how covariates are distributed within each class:
It also assigns prior class probabilities:
Bayes’ theorem then gives
Naive Bayes is a common example.
The contrast is:
- discriminative models learn class probabilities directly;
- generative models learn a data-generating story for each class and then apply Bayes’ theorem.
Binary Logistic Regression
Section titled “Binary Logistic Regression”What Is It For?
Section titled “What Is It For?”Binary logistic regression models an outcome
The model connects predictors to the probability that .
Step 1: Build a Linear Predictor
Section titled “Step 1: Build a Linear Predictor”Start with the same linear predictor used in regression:
The value can be any real number.
Step 2: Convert It to a Probability
Section titled “Step 2: Convert It to a Probability”Use the logistic function
Then
This keeps the probability between 0 and 1.
Step 3: Write the Likelihood
Section titled “Step 3: Write the Likelihood”For a Bernoulli outcome, the likelihood contribution for observation is
Multiplying over observations gives
Step 4: Add a Prior
Section titled “Step 4: Add a Prior”A common prior is a normal shrinkage prior:
This prior keeps very large coefficients from being too plausible unless the data strongly support them.
What Makes the Posterior Nonstandard?
Section titled “What Makes the Posterior Nonstandard?”The posterior is proportional to likelihood times prior:
Unlike normal linear regression, this posterior is not conjugate. It does not simplify to a familiar closed-form distribution.
That is why logistic regression motivates approximation methods and simulation.
How to Interpret Logistic Coefficients
Section titled “How to Interpret Logistic Coefficients”The logistic model can also be written in odds form.
The odds of class 1 are
Taking logs gives
So a coefficient is a change in log-odds for a one-unit change in the corresponding predictor, holding other predictors fixed.
Probit Regression
Section titled “Probit Regression”What Changes?
Section titled “What Changes?”Probit regression uses a different link function. Instead of the logistic CDF, it uses the standard normal CDF:
Here is the cumulative distribution function of a standard normal random variable.
Why Use It?
Section titled “Why Use It?”Logistic and probit regression often give similar fitted probabilities. Probit is important in Bayesian computation because it has a useful latent-variable representation.
That representation supports data augmentation methods, which are introduced later with simulation algorithms.
Multi-Class Logistic Regression
Section titled “Multi-Class Logistic Regression”The Setup
Section titled “The Setup”Now suppose
Each class gets its own coefficient vector .
The Probability Formula
Section titled “The Probability Formula”The multi-class logistic model is
The denominator makes the class probabilities add to 1.
Identifiability
Section titled “Identifiability”Adding the same value to every class score does not change the probabilities. To make the parameters identifiable, choose a baseline class and set its coefficients to zero, for example
The other coefficients are then interpreted relative to that baseline class.
Large-Sample Normal Approximation
Section titled “Large-Sample Normal Approximation”What Problem Does It Solve?
Section titled “What Problem Does It Solve?”In models like logistic regression, the posterior is often nonstandard.
One approach is MCMC. Another is to approximate the posterior by a normal distribution near its mode. This is especially useful when:
- the sample size is large;
- the posterior is unimodal;
- the posterior mass is concentrated near the mode.
Step 1: Find the Posterior Mode
Section titled “Step 1: Find the Posterior Mode”Let be the posterior mode:
Equivalently, it maximizes the log posterior.
Step 2: Approximate the Log Posterior Locally
Section titled “Step 2: Approximate the Log Posterior Locally”Use a second-order Taylor expansion around .
At the mode, the first derivative is zero. Keeping terms up to second order gives
The matrix
is the observed information matrix at the posterior mode.
Step 3: Recognize the Normal Kernel
Section titled “Step 3: Recognize the Normal Kernel”Exponentiating the quadratic approximation gives the kernel of a normal distribution:
Therefore,
The inverse information matrix is the approximate posterior covariance matrix.
How to Use the Approximation
Section titled “How to Use the Approximation”The normal approximation gives a practical workflow:
- Write the log posterior up to an additive constant.
- Numerically maximize it to find .
- Compute the negative Hessian at the mode.
- Invert the observed information matrix to get an approximate covariance matrix.
- Use the resulting normal distribution for summaries or simulation.
In R, functions such as optim can return both the mode and an approximate Hessian.
Example: Gamma Posterior
Section titled “Example: Gamma Posterior”The Exact Posterior
Section titled “The Exact Posterior”For a Poisson model with a Gamma prior, the posterior can be written as
where the second parameter is the rate.
Let
Then
The Mode and Information
Section titled “The Mode and Information”For , the posterior mode is
The observed information at the mode is
The Normal Approximation
Section titled “The Normal Approximation”The normal approximation is therefore
This approximation improves as the posterior becomes more concentrated and less skewed.
Reparametrization
Section titled “Reparametrization”Why Reparametrize?
Section titled “Why Reparametrize?”A normal approximation on the original parameter scale can be poor when the parameter is constrained.
For example:
- a variance must be positive;
- a probability must lie between 0 and 1.
A normal approximation can put probability outside the allowed range.
Common Transformations
Section titled “Common Transformations”For a positive parameter, use
For a probability parameter, use
Approximate the posterior on the transformed scale, then transform back.
Remember the Jacobian
Section titled “Remember the Jacobian”When transforming a density, include the Jacobian.
If
then
and
The extra factor is the Jacobian.
Gamma Example on the Log Scale
Section titled “Gamma Example on the Log Scale”For the Gamma posterior
use
A normal approximation on the scale gives
Transforming back gives a log-normal approximation:
This often respects the shape of the exact Gamma posterior better than a normal approximation directly on .
Heavy-Tailed Approximation
Section titled “Heavy-Tailed Approximation”The normal approximation can be too confident if the posterior has heavier tails than a normal distribution.
A simple robust alternative is a Student- approximation:
The degrees of freedom control tail thickness. Smaller gives heavier tails.
Derived Quantities
Section titled “Derived Quantities”The Problem
Section titled “The Problem”Even if is approximately normal, a function
may not be approximately normal.
For example, a ratio, probability, or inequality measure can have a skewed distribution even when the underlying parameters look close to normal.
The Simulation Solution
Section titled “The Simulation Solution”Use the normal approximation as a simulation distribution:
-
Draw
-
Compute
-
Summarize
This avoids forcing a normal approximation onto directly.
Example: Gini Coefficient
Section titled “Example: Gini Coefficient”Suppose incomes are modeled as
If approximate posterior draws of are available, compute the Gini coefficient for each draw:
The simulated values give an approximate posterior distribution for the Gini coefficient.
Chapter Summary
Section titled “Chapter Summary”Bayesian classification uses posterior class probabilities. With equal misclassification costs, the chosen class is the one with largest posterior probability. Logistic regression models binary probabilities through the logistic link, while probit regression uses the normal CDF. Multi-class logistic regression extends the same idea by assigning scores to all classes and normalizing them.
Large-sample normal approximation replaces a difficult posterior with a normal distribution centered at the posterior mode. The covariance matrix is the inverse observed information. Reparametrization can improve the approximation for constrained parameters, and simulation from the approximation is often the easiest way to study derived quantities.
Check Your Understanding
Section titled “Check Your Understanding”- Why does the largest posterior class probability give the optimal classifier only under equal misclassification loss?
- What role does the logistic function play in binary logistic regression?
- Why is the logistic regression posterior not conjugate?
- What are the mode and covariance matrix in the large-sample normal approximation?
- Why can a log-scale approximation be better for a positive parameter?