Skip to content

Multiparameter Models

Source: Lecture 3 — Multiparameter models (BDA Ch. 3).

Earlier chapters mostly used one unknown parameter:

θ.\theta.

Many real models have several unknown quantities. A Normal model may have both a mean and a variance. A multinomial model has several category probabilities. A multivariate Normal model has a vector mean.

The main Bayesian idea does not change:

p(θx)p(xθ)p(θ).p(\theta \mid x) \propto p(x \mid \theta)p(\theta).

What changes is that θ\theta may now be a vector.

Suppose

θ=(θ1,θ2).\theta=(\theta_1,\theta_2).

The Bayesian update gives a joint posterior:

p(θ1,θ2x)p(xθ1,θ2)p(θ1,θ2).p(\theta_1,\theta_2 \mid x) \propto p(x \mid \theta_1,\theta_2)p(\theta_1,\theta_2).

This joint distribution describes uncertainty about both parameters and their dependence after seeing the data.

Often one parameter is the main target and the other is a nuisance parameter.

If θ1\theta_1 is the parameter of interest, remove θ2\theta_2 by integration:

p(θ1x)=p(θ1,θ2x)dθ2.p(\theta_1 \mid x) = \int p(\theta_1,\theta_2 \mid x)\,d\theta_2.

This is called marginalization.

The same marginal posterior can be written as

p(θ1x)=p(θ1θ2,x)p(θ2x)dθ2.p(\theta_1 \mid x) = \int p(\theta_1 \mid \theta_2,x)p(\theta_2 \mid x)\,d\theta_2.

This says:

  1. condition on a possible value of θ2\theta_2;
  2. describe uncertainty about θ1\theta_1 at that value;
  3. average over posterior uncertainty in θ2\theta_2.

This interpretation is central to Bayesian computation.

Previously, the Normal variance σ2\sigma^2 was known. Now both the mean and the variance are unknown:

X1,,Xnθ,σ2iidN(θ,σ2).X_1,\ldots,X_n \mid \theta,\sigma^2 \overset{\mathrm{iid}}{\sim} N(\theta,\sigma^2).

The two unknowns are:

  • θ\theta: the population mean;
  • σ2\sigma^2: the population variance.

A common prior for this model is

p(θ,σ2)1σ2.p(\theta,\sigma^2) \propto \frac{1}{\sigma^2}.

This prior is improper, but it leads to a proper posterior when there is enough data.

It can be understood as flat in θ\theta and flat in logσ\log\sigma.

If σ2\sigma^2 were known, the posterior for θ\theta would be Normal. The same conditional result appears here:

θσ2,xN(xˉ,σ2n).\theta \mid \sigma^2,x \sim N\left(\bar{x},\frac{\sigma^2}{n}\right).

For each possible value of σ2\sigma^2, uncertainty about θ\theta is centered at the sample mean.

Larger σ2\sigma^2 gives a wider conditional posterior for θ\theta. Smaller σ2\sigma^2 gives a narrower one.

Let

s2=1n1i=1n(xixˉ)2.s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2.

Under the standard non-informative prior,

σ2xScaleInvChiSq(n1,s2).\sigma^2 \mid x \sim \mathrm{ScaleInvChiSq}(n-1,s^2).

The degrees of freedom are n1n-1, matching the usual sample variance calculation.

The joint posterior can be simulated in two simple steps. This is often easier than trying to manipulate the full joint density directly.

Repeat the following:

  1. Draw

    Uχn12.U \sim \chi^2_{n-1}.
  2. Set

    σ2=(n1)s2U.\sigma^2=\frac{(n-1)s^2}{U}.
  3. Draw

    θN(xˉ,σ2n).\theta \sim N\left(\bar{x},\frac{\sigma^2}{n}\right).

The paired draws (θ,σ2)(\theta,\sigma^2) are draws from the joint posterior.

After averaging over uncertainty in σ2\sigma^2, the marginal posterior for θ\theta is a Student tt distribution:

θxtn1(xˉ,s2n).\theta \mid x \sim t_{n-1}\left(\bar{x},\frac{s^2}{n}\right).

If σ2\sigma^2 were known, the posterior for θ\theta would be Normal.

When σ2\sigma^2 is unknown, extra uncertainty remains. The tt distribution has heavier tails than the Normal, which reflects this additional variance uncertainty.

As nn grows, σ2\sigma^2 is estimated more precisely and the tt distribution becomes closer to a Normal distribution.

Conjugate Normal-Inverse-Chi-Squared Prior

Section titled “Conjugate Normal-Inverse-Chi-Squared Prior”

The non-informative prior is useful, but sometimes we have real prior information about both the mean and the variance.

A conjugate prior is

θσ2N(μ0,σ2κ0),\theta \mid \sigma^2 \sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right),

and

σ2ScaleInvChiSq(ν0,σ02).\sigma^2 \sim \mathrm{ScaleInvChiSq}(\nu_0,\sigma_0^2).

The hyperparameters have useful interpretations:

  • μ0\mu_0 is the prior center for the mean;
  • κ0\kappa_0 is the prior strength for the mean;
  • ν0\nu_0 is the prior degrees of freedom for the variance;
  • σ02\sigma_0^2 is the prior scale for the variance.

The posterior has the same family:

θσ2,yN(μn,σ2κn),\theta \mid \sigma^2,y \sim N\left(\mu_n,\frac{\sigma^2}{\kappa_n}\right),

and

σ2yScaleInvChiSq(νn,σn2).\sigma^2 \mid y \sim \mathrm{ScaleInvChiSq}(\nu_n,\sigma_n^2).

The updated parameters are

κn=κ0+n,νn=ν0+n,\kappa_n=\kappa_0+n, \qquad \nu_n=\nu_0+n,

and

μn=κ0κ0+nμ0+nκ0+nyˉ.\mu_n = \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y}.

The updated variance scale satisfies

νnσn2=ν0σ02+(n1)s2+κ0nκ0+n(yˉμ0)2.\nu_n\sigma_n^2 = \nu_0\sigma_0^2 + (n-1)s^2 + \frac{\kappa_0 n}{\kappa_0+n}(\bar{y}-\mu_0)^2.

The posterior mean μn\mu_n is a weighted average of the prior mean and the sample mean.

The variance update has three contributions:

  1. prior variance information;
  2. within-sample variation;
  3. disagreement between the prior mean and the sample mean.

The marginal posterior for the mean is

θytνn(μn,σn2κn).\theta \mid y \sim t_{\nu_n}\left(\mu_n,\frac{\sigma_n^2}{\kappa_n}\right).

The multinomial model handles counts in several categories.

Suppose there are KK categories and observed counts

y=(y1,,yK),k=1Kyk=n.y=(y_1,\ldots,y_K), \qquad \sum_{k=1}^K y_k=n.

Let

θ=(θ1,,θK)\theta=(\theta_1,\ldots,\theta_K)

be the category probabilities, where

k=1Kθk=1.\sum_{k=1}^K \theta_k=1.

Ignoring constants that do not depend on θ\theta, the likelihood is

p(yθ)k=1Kθkyk.p(y \mid \theta) \propto \prod_{k=1}^K \theta_k^{y_k}.

The Dirichlet distribution is a distribution over probability vectors. It is the multivariate analogue of the Beta distribution.

Write

θDirichlet(α1,,αK).\theta \sim \mathrm{Dirichlet}(\alpha_1,\ldots,\alpha_K).

Its density kernel is

p(θ)k=1Kθkαk1.p(\theta) \propto \prod_{k=1}^K \theta_k^{\alpha_k-1}.

The prior mean for category kk is

E(θk)=αkj=1Kαj.E(\theta_k) = \frac{\alpha_k}{\sum_{j=1}^K \alpha_j}.

The total

j=1Kαj\sum_{j=1}^K \alpha_j

controls how concentrated the prior is.

Multiply likelihood and prior:

p(θy)k=1Kθkykk=1Kθkαk1.p(\theta \mid y) \propto \prod_{k=1}^K \theta_k^{y_k} \prod_{k=1}^K \theta_k^{\alpha_k-1}.

Collect powers:

p(θy)k=1Kθkαk+yk1.p(\theta \mid y) \propto \prod_{k=1}^K \theta_k^{\alpha_k+y_k-1}.

Therefore

θyDirichlet(α1+y1,,αK+yK).\theta \mid y \sim \mathrm{Dirichlet}(\alpha_1+y_1,\ldots,\alpha_K+y_K).

Each category updates independently in the parameter list:

αkαk+yk.\alpha_k \quad \longrightarrow \quad \alpha_k+y_k.

The prior count for a category is increased by the observed count in that category.

To simulate

zDirichlet(α1,,αK),z \sim \mathrm{Dirichlet}(\alpha_1,\ldots,\alpha_K),

use independent Gamma draws:

  1. Draw

    xkGamma(αk,1),k=1,,K.x_k \sim \mathrm{Gamma}(\alpha_k,1), \qquad k=1,\ldots,K.
  2. Normalize:

    zk=xkj=1Kxj.z_k = \frac{x_k}{\sum_{j=1}^K x_j}.

Then

z=(z1,,zK)z=(z_1,\ldots,z_K)

has the desired Dirichlet distribution.

The Gamma draws create positive category weights. Dividing by their sum turns those weights into probabilities that add to one.

A survey of 513 smartphone owners gives:

CategoryCount
iPhone180
Android230
Windows62
Other41

An older survey suggested shares of 30%, 30%, 20%, and 20%. Represent that prior as a 50-person pseudo-survey:

α=(15,15,10,10).\alpha=(15,15,10,10).

The posterior is

θyDirichlet(195,245,72,51).\theta \mid y \sim \mathrm{Dirichlet}(195,245,72,51).

The interpretation is direct: prior pseudo-counts plus observed counts.

Now each observation is a vector:

yi=(yi1,,yip).y_i=(y_{i1},\ldots,y_{ip}).

Assume

y1,,yniidNp(μ,Σ),y_1,\ldots,y_n \overset{\mathrm{iid}}{\sim} N_p(\mu,\Sigma),

where Σ\Sigma is known and μ\mu is unknown.

The goal is to learn the vector mean μ\mu.

Let

yˉ=1ni=1nyi.\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i.

As a function of μ\mu, the likelihood has the same shape as a multivariate Normal density centered at yˉ\bar{y}:

p(y1,,ynμ,Σ)exp[12i=1n(yiμ)TΣ1(yiμ)].p(y_1,\ldots,y_n \mid \mu,\Sigma) \propto \exp\left[ -\frac{1}{2} \sum_{i=1}^n (y_i-\mu)^\mathsf{T}\Sigma^{-1}(y_i-\mu) \right].

The matrix Σ1\Sigma^{-1} is the data precision matrix for one observation.

Use the conjugate prior

μNp(μ0,Λ0).\mu \sim N_p(\mu_0,\Lambda_0).

Here μ0\mu_0 is the prior mean vector and Λ0\Lambda_0 is the prior covariance matrix.

The posterior is

μyNp(μn,Λn).\mu \mid y \sim N_p(\mu_n,\Lambda_n).

The posterior precision matrix is

Λn1=Λ01+nΣ1.\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}.

The posterior mean is

μn=Λn(Λ01μ0+nΣ1yˉ).\mu_n = \Lambda_n \left( \Lambda_0^{-1}\mu_0 + n\Sigma^{-1}\bar{y} \right).

This is the vector version of the one-dimensional Normal update.

The posterior precision equals:

prior precision+data precision.\text{prior precision} + \text{data precision}.

The posterior mean combines:

  • prior information, Λ01μ0\Lambda_0^{-1}\mu_0;
  • data information, nΣ1yˉn\Sigma^{-1}\bar{y}.

If the prior is made very weak, then Λ01\Lambda_0^{-1} approaches zero and

μyNp(yˉ,Σn).\mu \mid y \sim N_p\left(\bar{y},\frac{\Sigma}{n}\right).

So the posterior centers on the sample mean vector with covariance shrinking at rate 1/n1/n.

  1. What does it mean to marginalize over a nuisance parameter?
  2. Why does the Normal mean have a tt posterior when σ2\sigma^2 is unknown?
  3. In the Normal-inverse-chi-squared update, what does κ0\kappa_0 control?
  4. How does the Dirichlet prior update after multinomial counts are observed?
  5. How is the multivariate Normal update similar to the one-dimensional Normal update?

Multiparameter Bayesian models use joint posterior distributions. When one parameter is not the main target, it is removed by marginalization. In the Normal model with unknown variance, uncertainty about σ2\sigma^2 makes the marginal posterior for the mean a Student tt distribution. The conjugate Normal-inverse-chi-squared prior adds prior information about both mean and variance. For categorical counts, the Dirichlet prior updates by adding observed counts to prior pseudo-counts. For a multivariate Normal model with known covariance, the vector mean update is the same precision-weighted idea as in the one-dimensional Normal model.