Multiparameter Models

Source: Lecture 3 — Multiparameter models (BDA Ch. 3).

The Main Change

Earlier chapters mostly used one unknown parameter:

\theta.

Many real models have several unknown quantities. A Normal model may have both a mean and a variance. A multinomial model has several category probabilities. A multivariate Normal model has a vector mean.

The main Bayesian idea does not change:

p(\theta \mid x) \propto p(x \mid \theta)p(\theta).

What changes is that $\theta$ may now be a vector.

Joint and Marginal Posteriors

Step 1: Keep the Parameters Together

Suppose

\theta=(\theta_1,\theta_2).

The Bayesian update gives a joint posterior:

p(\theta_1,\theta_2 \mid x) \propto p(x \mid \theta_1,\theta_2)p(\theta_1,\theta_2).

This joint distribution describes uncertainty about both parameters and their dependence after seeing the data.

Step 2: Focus on One Parameter

Often one parameter is the main target and the other is a nuisance parameter.

If $\theta_1$ is the parameter of interest, remove $\theta_2$ by integration:

p(\theta_1 \mid x) = \int p(\theta_1,\theta_2 \mid x)\,d\theta_2.

This is called marginalization.

Step 3: Read the Conditional Form

The same marginal posterior can be written as

p(\theta_1 \mid x) = \int p(\theta_1 \mid \theta_2,x)p(\theta_2 \mid x)\,d\theta_2.

This says:

condition on a possible value of $\theta_2$ ;
describe uncertainty about $\theta_1$ at that value;
average over posterior uncertainty in $\theta_2$ .

This interpretation is central to Bayesian computation.

Normal Model with Unknown Variance

What Problem Are We Solving?

Previously, the Normal variance $\sigma^2$ was known. Now both the mean and the variance are unknown:

X_1,\ldots,X_n \mid \theta,\sigma^2 \overset{\mathrm{iid}}{\sim} N(\theta,\sigma^2).

The two unknowns are:

$\theta$ : the population mean;
$\sigma^2$ : the population variance.

A Standard Non-Informative Prior

A common prior for this model is

p(\theta,\sigma^2) \propto \frac{1}{\sigma^2}.

This prior is improper, but it leads to a proper posterior when there is enough data.

It can be understood as flat in $\theta$ and flat in $\log\sigma$ .

Conditional Posterior for the Mean

If $\sigma^2$ were known, the posterior for $\theta$ would be Normal. The same conditional result appears here:

\theta \mid \sigma^2,x \sim N\left(\bar{x},\frac{\sigma^2}{n}\right).

How to Read It

For each possible value of $\sigma^2$ , uncertainty about $\theta$ is centered at the sample mean.

Larger $\sigma^2$ gives a wider conditional posterior for $\theta$ . Smaller $\sigma^2$ gives a narrower one.

Posterior for the Variance

Let

s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2.

Under the standard non-informative prior,

\sigma^2 \mid x \sim \mathrm{ScaleInvChiSq}(n-1,s^2).

The degrees of freedom are $n-1$ , matching the usual sample variance calculation.

The Simulation Algorithm

Purpose

The joint posterior can be simulated in two simple steps. This is often easier than trying to manipulate the full joint density directly.

Algorithm

Repeat the following:

Draw
$U \sim \chi^2_{n-1}.$
Set
$\sigma^2=\frac{(n-1)s^2}{U}.$
Draw
$\theta \sim N\left(\bar{x},\frac{\sigma^2}{n}\right).$

The paired draws $(\theta,\sigma^2)$ are draws from the joint posterior.

Marginal Posterior for the Mean

After averaging over uncertainty in $\sigma^2$ , the marginal posterior for $\theta$ is a Student $t$ distribution:

\theta \mid x \sim t_{n-1}\left(\bar{x},\frac{s^2}{n}\right).

Why a t Distribution Appears

If $\sigma^2$ were known, the posterior for $\theta$ would be Normal.

When $\sigma^2$ is unknown, extra uncertainty remains. The $t$ distribution has heavier tails than the Normal, which reflects this additional variance uncertainty.

As $n$ grows, $\sigma^2$ is estimated more precisely and the $t$ distribution becomes closer to a Normal distribution.

Conjugate Normal-Inverse-Chi-Squared Prior

What Problem Does It Solve?

The non-informative prior is useful, but sometimes we have real prior information about both the mean and the variance.

A conjugate prior is

\theta \mid \sigma^2 \sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right),

and

\sigma^2 \sim \mathrm{ScaleInvChiSq}(\nu_0,\sigma_0^2).

The hyperparameters have useful interpretations:

$\mu_0$ is the prior center for the mean;
$\kappa_0$ is the prior strength for the mean;
$\nu_0$ is the prior degrees of freedom for the variance;
$\sigma_0^2$ is the prior scale for the variance.

The Conjugate Update

The posterior has the same family:

\theta \mid \sigma^2,y \sim N\left(\mu_n,\frac{\sigma^2}{\kappa_n}\right),

and

\sigma^2 \mid y \sim \mathrm{ScaleInvChiSq}(\nu_n,\sigma_n^2).

The updated parameters are

\kappa_n=\kappa_0+n, \qquad \nu_n=\nu_0+n,

and

\mu_n = \frac{\kappa_0}{\kappa_0+n}\mu_0 + \frac{n}{\kappa_0+n}\bar{y}.

The updated variance scale satisfies

\nu_n\sigma_n^2 = \nu_0\sigma_0^2 + (n-1)s^2 + \frac{\kappa_0 n}{\kappa_0+n}(\bar{y}-\mu_0)^2.

How to Read the Update

The posterior mean $\mu_n$ is a weighted average of the prior mean and the sample mean.

The variance update has three contributions:

prior variance information;
within-sample variation;
disagreement between the prior mean and the sample mean.

The marginal posterior for the mean is

\theta \mid y \sim t_{\nu_n}\left(\mu_n,\frac{\sigma_n^2}{\kappa_n}\right).

Multinomial Model

What Problem Are We Solving?

The multinomial model handles counts in several categories.

Suppose there are $K$ categories and observed counts

y=(y_1,\ldots,y_K), \qquad \sum_{k=1}^K y_k=n.

Let

\theta=(\theta_1,\ldots,\theta_K)

be the category probabilities, where

\sum_{k=1}^K \theta_k=1.

Ignoring constants that do not depend on $\theta$ , the likelihood is

p(y \mid \theta) \propto \prod_{k=1}^K \theta_k^{y_k}.

Dirichlet Prior

Why This Prior?

The Dirichlet distribution is a distribution over probability vectors. It is the multivariate analogue of the Beta distribution.

Write

\theta \sim \mathrm{Dirichlet}(\alpha_1,\ldots,\alpha_K).

Its density kernel is

p(\theta) \propto \prod_{k=1}^K \theta_k^{\alpha_k-1}.

The prior mean for category $k$ is

E(\theta_k) = \frac{\alpha_k}{\sum_{j=1}^K \alpha_j}.

The total

\sum_{j=1}^K \alpha_j

controls how concentrated the prior is.

The Dirichlet-Multinomial Update

Multiply likelihood and prior:

p(\theta \mid y) \propto \prod_{k=1}^K \theta_k^{y_k} \prod_{k=1}^K \theta_k^{\alpha_k-1}.

Collect powers:

p(\theta \mid y) \propto \prod_{k=1}^K \theta_k^{\alpha_k+y_k-1}.

Therefore

\theta \mid y \sim \mathrm{Dirichlet}(\alpha_1+y_1,\ldots,\alpha_K+y_K).

How to Remember It

Each category updates independently in the parameter list:

\alpha_k \quad \longrightarrow \quad \alpha_k+y_k.

The prior count for a category is increased by the observed count in that category.

Simulating from a Dirichlet Distribution

The Algorithm

To simulate

z \sim \mathrm{Dirichlet}(\alpha_1,\ldots,\alpha_K),

use independent Gamma draws:

Draw
$x_k \sim \mathrm{Gamma}(\alpha_k,1), \qquad k=1,\ldots,K.$
Normalize:
$z_k = \frac{x_k}{\sum_{j=1}^K x_j}.$

Then

z=(z_1,\ldots,z_K)

has the desired Dirichlet distribution.

Why It Works Intuitively

The Gamma draws create positive category weights. Dividing by their sum turns those weights into probabilities that add to one.

Example: Market Shares

A survey of 513 smartphone owners gives:

Category	Count
iPhone	180
Android	230
Windows	62
Other	41

An older survey suggested shares of 30%, 30%, 20%, and 20%. Represent that prior as a 50-person pseudo-survey:

\alpha=(15,15,10,10).

The posterior is

\theta \mid y \sim \mathrm{Dirichlet}(195,245,72,51).

The interpretation is direct: prior pseudo-counts plus observed counts.

Multivariate Normal Model

What Problem Are We Solving?

Now each observation is a vector:

y_i=(y_{i1},\ldots,y_{ip}).

Assume

y_1,\ldots,y_n \overset{\mathrm{iid}}{\sim} N_p(\mu,\Sigma),

where $\Sigma$ is known and $\mu$ is unknown.

The goal is to learn the vector mean $\mu$ .

The Likelihood

Let

\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i.

As a function of $\mu$ , the likelihood has the same shape as a multivariate Normal density centered at $\bar{y}$ :

p(y_1,\ldots,y_n \mid \mu,\Sigma) \propto \exp\left[ -\frac{1}{2} \sum_{i=1}^n (y_i-\mu)^\mathsf{T}\Sigma^{-1}(y_i-\mu) \right].

The matrix $\Sigma^{-1}$ is the data precision matrix for one observation.

Normal Prior for the Vector Mean

Use the conjugate prior

\mu \sim N_p(\mu_0,\Lambda_0).

Here $\mu_0$ is the prior mean vector and $\Lambda_0$ is the prior covariance matrix.

The posterior is

\mu \mid y \sim N_p(\mu_n,\Lambda_n).

The posterior precision matrix is

\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}.

The posterior mean is

\mu_n = \Lambda_n \left( \Lambda_0^{-1}\mu_0 + n\Sigma^{-1}\bar{y} \right).

How to Read the Multivariate Formula

This is the vector version of the one-dimensional Normal update.

The posterior precision equals:

\text{prior precision} + \text{data precision}.

The posterior mean combines:

prior information, $\Lambda_0^{-1}\mu_0$ ;
data information, $n\Sigma^{-1}\bar{y}$ .

If the prior is made very weak, then $\Lambda_0^{-1}$ approaches zero and

\mu \mid y \sim N_p\left(\bar{y},\frac{\Sigma}{n}\right).

So the posterior centers on the sample mean vector with covariance shrinking at rate $1/n$ .

Study Questions

What does it mean to marginalize over a nuisance parameter?
Why does the Normal mean have a $t$ posterior when $\sigma^2$ is unknown?
In the Normal-inverse-chi-squared update, what does $\kappa_0$ control?
How does the Dirichlet prior update after multinomial counts are observed?
How is the multivariate Normal update similar to the one-dimensional Normal update?

Chapter Summary

Multiparameter Bayesian models use joint posterior distributions. When one parameter is not the main target, it is removed by marginalization. In the Normal model with unknown variance, uncertainty about $\sigma^2$ makes the marginal posterior for the mean a Student $t$ distribution. The conjugate Normal-inverse-chi-squared prior adds prior information about both mean and variance. For categorical counts, the Dirichlet prior updates by adding observed counts to prior pseudo-counts. For a multivariate Normal model with known covariance, the vector mean update is the same precision-weighted idea as in the one-dimensional Normal model.