Classification and Large-Sample Approximations

Source: Lecture 6 — Classification. Large sample approximations (BDA Ch. 16).

The Question of This Chapter

So far, many examples have used continuous outcomes. Classification changes the target.

Instead of predicting a number, we predict a class label:

c \in \{1,\ldots,C\}.

This chapter also introduces a practical computational idea. Some useful Bayesian models do not have closed-form posteriors. When the sample size is large, the posterior can often be approximated by a normal distribution near its mode.

The storyline is:

Turn classification into posterior class probabilities.
Build binary logistic and probit regression models.
Extend logistic regression to multiple classes.
Approximate nonstandard posteriors by a normal distribution.
Improve approximations by reparametrizing constrained parameters.

Bayesian Classification

What Is It For?

In classification, the goal is to assign a label to an observation with covariates $x$ .

Bayesian classification starts by computing class probabilities:

p(c \mid x).

The simplest classifier chooses the class with largest posterior probability:

\hat c = \arg\max_{c\in\mathcal C} p(c\mid x).

This rule is optimal under equal loss for all wrong classifications.

What If Mistakes Have Different Costs?

If different mistakes have different costs, the largest-probability class may not be the best action.

For example, falsely missing a disease can be more costly than falsely flagging a healthy patient. In that case, combine class probabilities with a loss table and choose the action with smallest posterior expected loss.

Classification is therefore another decision problem.

Two Ways to Model Class Probabilities

Discriminative Models

A discriminative model directly models

p(c \mid x).

Logistic regression is the main example in this chapter.

The model focuses on the boundary between classes. It does not need to model the full distribution of covariates.

Generative Models

A generative model describes how covariates are distributed within each class:

p(x\mid c).

It also assigns prior class probabilities:

p(c).

Bayes’ theorem then gives

p(c\mid x) \propto p(x\mid c)p(c).

Naive Bayes is a common example.

The contrast is:

discriminative models learn class probabilities directly;
generative models learn a data-generating story for each class and then apply Bayes’ theorem.

Binary Logistic Regression

What Is It For?

Binary logistic regression models an outcome

y_i\in\{0,1\}.

The model connects predictors $x_i$ to the probability that $y_i=1$ .

Step 1: Build a Linear Predictor

Start with the same linear predictor used in regression:

\eta_i=x_i'\beta.

The value $\eta_i$ can be any real number.

Step 2: Convert It to a Probability

Use the logistic function

\Lambda(z) = \frac{1}{1+\exp(-z)}.

Then

\Pr(y_i=1\mid x_i,\beta) = \Lambda(x_i'\beta) = \frac{\exp(x_i'\beta)}{1+\exp(x_i'\beta)}.

This keeps the probability between 0 and 1.

Step 3: Write the Likelihood

For a Bernoulli outcome, the likelihood contribution for observation $i$ is

\Lambda(x_i'\beta)^{y_i} \left[1-\Lambda(x_i'\beta)\right]^{1-y_i}.

Multiplying over observations gives

p(y\mid X,\beta) = \prod_{i=1}^{n} \Lambda(x_i'\beta)^{y_i} \left[1-\Lambda(x_i'\beta)\right]^{1-y_i}.

Step 4: Add a Prior

A common prior is a normal shrinkage prior:

\beta \sim N(0,\lambda^{-1}I).

This prior keeps very large coefficients from being too plausible unless the data strongly support them.

What Makes the Posterior Nonstandard?

The posterior is proportional to likelihood times prior:

p(\beta\mid y,X) \propto p(y\mid X,\beta)p(\beta).

Unlike normal linear regression, this posterior is not conjugate. It does not simplify to a familiar closed-form distribution.

That is why logistic regression motivates approximation methods and simulation.

How to Interpret Logistic Coefficients

The logistic model can also be written in odds form.

The odds of class 1 are

\frac{\Pr(y_i=1\mid x_i,\beta)} {\Pr(y_i=0\mid x_i,\beta)}.

Taking logs gives

\log \frac{\Pr(y_i=1\mid x_i,\beta)} {\Pr(y_i=0\mid x_i,\beta)} = x_i'\beta.

So a coefficient is a change in log-odds for a one-unit change in the corresponding predictor, holding other predictors fixed.

Probit Regression

What Changes?

Probit regression uses a different link function. Instead of the logistic CDF, it uses the standard normal CDF:

\Pr(y_i=1\mid x_i,\beta) = \Phi(x_i'\beta).

Here $\Phi$ is the cumulative distribution function of a standard normal random variable.

Why Use It?

Logistic and probit regression often give similar fitted probabilities. Probit is important in Bayesian computation because it has a useful latent-variable representation.

That representation supports data augmentation methods, which are introduced later with simulation algorithms.

Multi-Class Logistic Regression

The Setup

Now suppose

y_i\in\{1,\ldots,C\}.

Each class gets its own coefficient vector $\beta_c$ .

The Probability Formula

The multi-class logistic model is

\Pr(y_i=c\mid x_i,\beta_1,\ldots,\beta_C) = \frac{\exp(x_i'\beta_c)} {\sum_{k=1}^{C}\exp(x_i'\beta_k)}.

The denominator makes the class probabilities add to 1.

Identifiability

Adding the same value to every class score does not change the probabilities. To make the parameters identifiable, choose a baseline class and set its coefficients to zero, for example

\beta_1=0.

The other coefficients are then interpreted relative to that baseline class.

Large-Sample Normal Approximation

What Problem Does It Solve?

In models like logistic regression, the posterior is often nonstandard.

One approach is MCMC. Another is to approximate the posterior by a normal distribution near its mode. This is especially useful when:

the sample size is large;
the posterior is unimodal;
the posterior mass is concentrated near the mode.

Step 1: Find the Posterior Mode

Let $\tilde\theta$ be the posterior mode:

\tilde\theta = \arg\max_\theta p(\theta\mid y).

Equivalently, it maximizes the log posterior.

Step 2: Approximate the Log Posterior Locally

Use a second-order Taylor expansion around $\tilde\theta$ .

At the mode, the first derivative is zero. Keeping terms up to second order gives

\log p(\theta\mid y) \approx \log p(\tilde\theta\mid y) - \frac{1}{2} (\theta-\tilde\theta)' J_y(\tilde\theta) (\theta-\tilde\theta).

The matrix

J_y(\tilde\theta) = - \left. \frac{\partial^2\log p(\theta\mid y)} {\partial\theta\,\partial\theta'} \right|_{\theta=\tilde\theta}

is the observed information matrix at the posterior mode.

Step 3: Recognize the Normal Kernel

Exponentiating the quadratic approximation gives the kernel of a normal distribution:

p(\theta\mid y) \approx \text{constant} \times \exp\left\{ - \frac{1}{2} (\theta-\tilde\theta)' J_y(\tilde\theta) (\theta-\tilde\theta) \right\}.

Therefore,

\theta\mid y \stackrel{\mathrm{approx}}{\sim} N\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).

The inverse information matrix is the approximate posterior covariance matrix.

How to Use the Approximation

The normal approximation gives a practical workflow:

Write the log posterior up to an additive constant.
Numerically maximize it to find $\tilde\theta$ .
Compute the negative Hessian at the mode.
Invert the observed information matrix to get an approximate covariance matrix.
Use the resulting normal distribution for summaries or simulation.

In R, functions such as optim can return both the mode and an approximate Hessian.

Example: Gamma Posterior

The Exact Posterior

For a Poisson model with a Gamma prior, the posterior can be written as

\theta\mid y \sim \mathrm{Gamma}\left(\alpha+\sum_{i=1}^{n}y_i,\beta+n\right),

where the second parameter is the rate.

Let

A=\alpha+\sum_{i=1}^{n}y_i, \qquad B=\beta+n.

Then

\theta\mid y \sim \mathrm{Gamma}(A,B).

The Mode and Information

For $A>1$ , the posterior mode is

\tilde\theta = \frac{A-1}{B}.

The observed information at the mode is

J_y(\tilde\theta) = \frac{B^2}{A-1}.

The Normal Approximation

The normal approximation is therefore

\theta\mid y \stackrel{\mathrm{approx}}{\sim} N\left( \frac{A-1}{B}, \frac{A-1}{B^2} \right).

This approximation improves as the posterior becomes more concentrated and less skewed.

Reparametrization

Why Reparametrize?

A normal approximation on the original parameter scale can be poor when the parameter is constrained.

For example:

a variance must be positive;
a probability must lie between 0 and 1.

A normal approximation can put probability outside the allowed range.

Common Transformations

For a positive parameter, use

\phi=\log\theta.

For a probability parameter, use

\phi=\mathrm{logit}(\theta) = \log\frac{\theta}{1-\theta}.

Approximate the posterior on the transformed scale, then transform back.

Remember the Jacobian

When transforming a density, include the Jacobian.

\phi=\log\theta,

then

\theta=\exp(\phi)

and

p(\phi\mid y) = p(\theta=\exp(\phi)\mid y)\exp(\phi).

The extra factor $\exp(\phi)$ is the Jacobian.

Gamma Example on the Log Scale

For the Gamma posterior

\theta\mid y \sim \mathrm{Gamma}(A,B),

use

\phi=\log\theta.

A normal approximation on the $\phi$ scale gives

\phi\mid y \stackrel{\mathrm{approx}}{\sim} N\left(\log\frac{A}{B},\frac{1}{A}\right).

Transforming back gives a log-normal approximation:

\theta\mid y \stackrel{\mathrm{approx}}{\sim} \mathrm{LogNormal}\left(\log\frac{A}{B},\frac{1}{A}\right).

This often respects the shape of the exact Gamma posterior better than a normal approximation directly on $\theta$ .

Heavy-Tailed Approximation

The normal approximation can be too confident if the posterior has heavier tails than a normal distribution.

A simple robust alternative is a Student- $t$ approximation:

\theta\mid y \stackrel{\mathrm{approx}}{\sim} t_\nu\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).

The degrees of freedom $\nu$ control tail thickness. Smaller $\nu$ gives heavier tails.

Derived Quantities

The Problem

Even if $\theta\mid y$ is approximately normal, a function

g(\theta)

may not be approximately normal.

For example, a ratio, probability, or inequality measure can have a skewed distribution even when the underlying parameters look close to normal.

The Simulation Solution

Use the normal approximation as a simulation distribution:

Draw
$\theta^{(s)} \sim N\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).$
Compute
$g^{(s)}=g(\theta^{(s)}).$
Summarize
$g^{(1)},\ldots,g^{(S)}.$

This avoids forcing a normal approximation onto $g(\theta)$ directly.

Example: Gini Coefficient

Suppose incomes are modeled as

y\sim \mathrm{LogNormal}(\mu,\sigma^2).

If approximate posterior draws of $(\mu,\sigma^2)$ are available, compute the Gini coefficient for each draw:

G^{(s)} = 2\Phi\left(\frac{\sigma^{(s)}}{\sqrt{2}}\right)-1.

The simulated values $G^{(1)},\ldots,G^{(S)}$ give an approximate posterior distribution for the Gini coefficient.

Chapter Summary

Bayesian classification uses posterior class probabilities. With equal misclassification costs, the chosen class is the one with largest posterior probability. Logistic regression models binary probabilities through the logistic link, while probit regression uses the normal CDF. Multi-class logistic regression extends the same idea by assigning scores to all classes and normalizing them.

Large-sample normal approximation replaces a difficult posterior with a normal distribution centered at the posterior mode. The covariance matrix is the inverse observed information. Reparametrization can improve the approximation for constrained parameters, and simulation from the approximation is often the easiest way to study derived quantities.

Check Your Understanding

Why does the largest posterior class probability give the optimal classifier only under equal misclassification loss?
What role does the logistic function play in binary logistic regression?
Why is the logistic regression posterior not conjugate?
What are the mode and covariance matrix in the large-sample normal approximation?
Why can a log-scale approximation be better for a positive parameter?