Skip to content

Classification and Large-Sample Approximations

Source: Lecture 6 — Classification. Large sample approximations (BDA Ch. 16).

So far, many examples have used continuous outcomes. Classification changes the target.

Instead of predicting a number, we predict a class label:

c{1,,C}.c \in \{1,\ldots,C\}.

This chapter also introduces a practical computational idea. Some useful Bayesian models do not have closed-form posteriors. When the sample size is large, the posterior can often be approximated by a normal distribution near its mode.

The storyline is:

  1. Turn classification into posterior class probabilities.
  2. Build binary logistic and probit regression models.
  3. Extend logistic regression to multiple classes.
  4. Approximate nonstandard posteriors by a normal distribution.
  5. Improve approximations by reparametrizing constrained parameters.

In classification, the goal is to assign a label to an observation with covariates xx.

Bayesian classification starts by computing class probabilities:

p(cx).p(c \mid x).

The simplest classifier chooses the class with largest posterior probability:

c^=argmaxcCp(cx).\hat c = \arg\max_{c\in\mathcal C} p(c\mid x).

This rule is optimal under equal loss for all wrong classifications.

If different mistakes have different costs, the largest-probability class may not be the best action.

For example, falsely missing a disease can be more costly than falsely flagging a healthy patient. In that case, combine class probabilities with a loss table and choose the action with smallest posterior expected loss.

Classification is therefore another decision problem.

A discriminative model directly models

p(cx).p(c \mid x).

Logistic regression is the main example in this chapter.

The model focuses on the boundary between classes. It does not need to model the full distribution of covariates.

A generative model describes how covariates are distributed within each class:

p(xc).p(x\mid c).

It also assigns prior class probabilities:

p(c).p(c).

Bayes’ theorem then gives

p(cx)p(xc)p(c).p(c\mid x) \propto p(x\mid c)p(c).

Naive Bayes is a common example.

The contrast is:

  • discriminative models learn class probabilities directly;
  • generative models learn a data-generating story for each class and then apply Bayes’ theorem.

Binary logistic regression models an outcome

yi{0,1}.y_i\in\{0,1\}.

The model connects predictors xix_i to the probability that yi=1y_i=1.

Start with the same linear predictor used in regression:

ηi=xiβ.\eta_i=x_i'\beta.

The value ηi\eta_i can be any real number.

Use the logistic function

Λ(z)=11+exp(z).\Lambda(z) = \frac{1}{1+\exp(-z)}.

Then

Pr(yi=1xi,β)=Λ(xiβ)=exp(xiβ)1+exp(xiβ).\Pr(y_i=1\mid x_i,\beta) = \Lambda(x_i'\beta) = \frac{\exp(x_i'\beta)}{1+\exp(x_i'\beta)}.

This keeps the probability between 0 and 1.

For a Bernoulli outcome, the likelihood contribution for observation ii is

Λ(xiβ)yi[1Λ(xiβ)]1yi.\Lambda(x_i'\beta)^{y_i} \left[1-\Lambda(x_i'\beta)\right]^{1-y_i}.

Multiplying over observations gives

p(yX,β)=i=1nΛ(xiβ)yi[1Λ(xiβ)]1yi.p(y\mid X,\beta) = \prod_{i=1}^{n} \Lambda(x_i'\beta)^{y_i} \left[1-\Lambda(x_i'\beta)\right]^{1-y_i}.

A common prior is a normal shrinkage prior:

βN(0,λ1I).\beta \sim N(0,\lambda^{-1}I).

This prior keeps very large coefficients from being too plausible unless the data strongly support them.

The posterior is proportional to likelihood times prior:

p(βy,X)p(yX,β)p(β).p(\beta\mid y,X) \propto p(y\mid X,\beta)p(\beta).

Unlike normal linear regression, this posterior is not conjugate. It does not simplify to a familiar closed-form distribution.

That is why logistic regression motivates approximation methods and simulation.

The logistic model can also be written in odds form.

The odds of class 1 are

Pr(yi=1xi,β)Pr(yi=0xi,β).\frac{\Pr(y_i=1\mid x_i,\beta)} {\Pr(y_i=0\mid x_i,\beta)}.

Taking logs gives

logPr(yi=1xi,β)Pr(yi=0xi,β)=xiβ.\log \frac{\Pr(y_i=1\mid x_i,\beta)} {\Pr(y_i=0\mid x_i,\beta)} = x_i'\beta.

So a coefficient is a change in log-odds for a one-unit change in the corresponding predictor, holding other predictors fixed.

Probit regression uses a different link function. Instead of the logistic CDF, it uses the standard normal CDF:

Pr(yi=1xi,β)=Φ(xiβ).\Pr(y_i=1\mid x_i,\beta) = \Phi(x_i'\beta).

Here Φ\Phi is the cumulative distribution function of a standard normal random variable.

Logistic and probit regression often give similar fitted probabilities. Probit is important in Bayesian computation because it has a useful latent-variable representation.

That representation supports data augmentation methods, which are introduced later with simulation algorithms.

Now suppose

yi{1,,C}.y_i\in\{1,\ldots,C\}.

Each class gets its own coefficient vector βc\beta_c.

The multi-class logistic model is

Pr(yi=cxi,β1,,βC)=exp(xiβc)k=1Cexp(xiβk).\Pr(y_i=c\mid x_i,\beta_1,\ldots,\beta_C) = \frac{\exp(x_i'\beta_c)} {\sum_{k=1}^{C}\exp(x_i'\beta_k)}.

The denominator makes the class probabilities add to 1.

Adding the same value to every class score does not change the probabilities. To make the parameters identifiable, choose a baseline class and set its coefficients to zero, for example

β1=0.\beta_1=0.

The other coefficients are then interpreted relative to that baseline class.

In models like logistic regression, the posterior is often nonstandard.

One approach is MCMC. Another is to approximate the posterior by a normal distribution near its mode. This is especially useful when:

  • the sample size is large;
  • the posterior is unimodal;
  • the posterior mass is concentrated near the mode.

Let θ~\tilde\theta be the posterior mode:

θ~=argmaxθp(θy).\tilde\theta = \arg\max_\theta p(\theta\mid y).

Equivalently, it maximizes the log posterior.

Step 2: Approximate the Log Posterior Locally

Section titled “Step 2: Approximate the Log Posterior Locally”

Use a second-order Taylor expansion around θ~\tilde\theta.

At the mode, the first derivative is zero. Keeping terms up to second order gives

logp(θy)logp(θ~y)12(θθ~)Jy(θ~)(θθ~).\log p(\theta\mid y) \approx \log p(\tilde\theta\mid y) - \frac{1}{2} (\theta-\tilde\theta)' J_y(\tilde\theta) (\theta-\tilde\theta).

The matrix

Jy(θ~)=2logp(θy)θθθ=θ~J_y(\tilde\theta) = - \left. \frac{\partial^2\log p(\theta\mid y)} {\partial\theta\,\partial\theta'} \right|_{\theta=\tilde\theta}

is the observed information matrix at the posterior mode.

Exponentiating the quadratic approximation gives the kernel of a normal distribution:

p(θy)constant×exp{12(θθ~)Jy(θ~)(θθ~)}.p(\theta\mid y) \approx \text{constant} \times \exp\left\{ - \frac{1}{2} (\theta-\tilde\theta)' J_y(\tilde\theta) (\theta-\tilde\theta) \right\}.

Therefore,

θyapproxN(θ~,Jy(θ~)1).\theta\mid y \stackrel{\mathrm{approx}}{\sim} N\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).

The inverse information matrix is the approximate posterior covariance matrix.

The normal approximation gives a practical workflow:

  1. Write the log posterior up to an additive constant.
  2. Numerically maximize it to find θ~\tilde\theta.
  3. Compute the negative Hessian at the mode.
  4. Invert the observed information matrix to get an approximate covariance matrix.
  5. Use the resulting normal distribution for summaries or simulation.

In R, functions such as optim can return both the mode and an approximate Hessian.

For a Poisson model with a Gamma prior, the posterior can be written as

θyGamma(α+i=1nyi,β+n),\theta\mid y \sim \mathrm{Gamma}\left(\alpha+\sum_{i=1}^{n}y_i,\beta+n\right),

where the second parameter is the rate.

Let

A=α+i=1nyi,B=β+n.A=\alpha+\sum_{i=1}^{n}y_i, \qquad B=\beta+n.

Then

θyGamma(A,B).\theta\mid y \sim \mathrm{Gamma}(A,B).

For A>1A>1, the posterior mode is

θ~=A1B.\tilde\theta = \frac{A-1}{B}.

The observed information at the mode is

Jy(θ~)=B2A1.J_y(\tilde\theta) = \frac{B^2}{A-1}.

The normal approximation is therefore

θyapproxN(A1B,A1B2).\theta\mid y \stackrel{\mathrm{approx}}{\sim} N\left( \frac{A-1}{B}, \frac{A-1}{B^2} \right).

This approximation improves as the posterior becomes more concentrated and less skewed.

A normal approximation on the original parameter scale can be poor when the parameter is constrained.

For example:

  • a variance must be positive;
  • a probability must lie between 0 and 1.

A normal approximation can put probability outside the allowed range.

For a positive parameter, use

ϕ=logθ.\phi=\log\theta.

For a probability parameter, use

ϕ=logit(θ)=logθ1θ.\phi=\mathrm{logit}(\theta) = \log\frac{\theta}{1-\theta}.

Approximate the posterior on the transformed scale, then transform back.

When transforming a density, include the Jacobian.

If

ϕ=logθ,\phi=\log\theta,

then

θ=exp(ϕ)\theta=\exp(\phi)

and

p(ϕy)=p(θ=exp(ϕ)y)exp(ϕ).p(\phi\mid y) = p(\theta=\exp(\phi)\mid y)\exp(\phi).

The extra factor exp(ϕ)\exp(\phi) is the Jacobian.

For the Gamma posterior

θyGamma(A,B),\theta\mid y \sim \mathrm{Gamma}(A,B),

use

ϕ=logθ.\phi=\log\theta.

A normal approximation on the ϕ\phi scale gives

ϕyapproxN(logAB,1A).\phi\mid y \stackrel{\mathrm{approx}}{\sim} N\left(\log\frac{A}{B},\frac{1}{A}\right).

Transforming back gives a log-normal approximation:

θyapproxLogNormal(logAB,1A).\theta\mid y \stackrel{\mathrm{approx}}{\sim} \mathrm{LogNormal}\left(\log\frac{A}{B},\frac{1}{A}\right).

This often respects the shape of the exact Gamma posterior better than a normal approximation directly on θ\theta.

The normal approximation can be too confident if the posterior has heavier tails than a normal distribution.

A simple robust alternative is a Student-tt approximation:

θyapproxtν(θ~,Jy(θ~)1).\theta\mid y \stackrel{\mathrm{approx}}{\sim} t_\nu\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).

The degrees of freedom ν\nu control tail thickness. Smaller ν\nu gives heavier tails.

Even if θy\theta\mid y is approximately normal, a function

g(θ)g(\theta)

may not be approximately normal.

For example, a ratio, probability, or inequality measure can have a skewed distribution even when the underlying parameters look close to normal.

Use the normal approximation as a simulation distribution:

  1. Draw

    θ(s)N(θ~,Jy(θ~)1).\theta^{(s)} \sim N\left(\tilde\theta,J_y(\tilde\theta)^{-1}\right).
  2. Compute

    g(s)=g(θ(s)).g^{(s)}=g(\theta^{(s)}).
  3. Summarize

    g(1),,g(S).g^{(1)},\ldots,g^{(S)}.

This avoids forcing a normal approximation onto g(θ)g(\theta) directly.

Suppose incomes are modeled as

yLogNormal(μ,σ2).y\sim \mathrm{LogNormal}(\mu,\sigma^2).

If approximate posterior draws of (μ,σ2)(\mu,\sigma^2) are available, compute the Gini coefficient for each draw:

G(s)=2Φ(σ(s)2)1.G^{(s)} = 2\Phi\left(\frac{\sigma^{(s)}}{\sqrt{2}}\right)-1.

The simulated values G(1),,G(S)G^{(1)},\ldots,G^{(S)} give an approximate posterior distribution for the Gini coefficient.

Bayesian classification uses posterior class probabilities. With equal misclassification costs, the chosen class is the one with largest posterior probability. Logistic regression models binary probabilities through the logistic link, while probit regression uses the normal CDF. Multi-class logistic regression extends the same idea by assigning scores to all classes and normalizing them.

Large-sample normal approximation replaces a difficult posterior with a normal distribution centered at the posterior mode. The covariance matrix is the inverse observed information. Reparametrization can improve the approximation for constrained parameters, and simulation from the approximation is often the easiest way to study derived quantities.

  1. Why does the largest posterior class probability give the optimal classifier only under equal misclassification loss?
  2. What role does the logistic function play in binary logistic regression?
  3. Why is the logistic regression posterior not conjugate?
  4. What are the mode and covariance matrix in the large-sample normal approximation?
  5. Why can a log-scale approximation be better for a positive parameter?