Skip to content

Normal and Poisson Models

Source: Lecture 2 — Normal and Poisson data. Prior elicitation (BDA Ch. 2).

The Bernoulli model handled binary data. This chapter adds two common data types:

  1. Continuous measurements, using the Normal model.
  2. Counts, using the Poisson model.

The Bayesian workflow is the same:

  1. Choose a likelihood.
  2. Choose a prior.
  3. Multiply prior and likelihood.
  4. Read the posterior distribution.

The new idea is that different likelihoods have different convenient conjugate priors.

Suppose the observations are continuous and centered around an unknown mean θ\theta. The observation variance σ2\sigma^2 is assumed known.

The model is

X1,,Xnθ,σ2iidN(θ,σ2).X_1,\ldots,X_n \mid \theta,\sigma^2 \overset{\mathrm{iid}}{\sim} N(\theta,\sigma^2).

The parameter of interest is the mean θ\theta.

For the Normal model, the data enter the likelihood through the sample mean xˉ\bar{x}.

As a function of θ\theta, the likelihood is proportional to

p(x1,,xnθ,σ2)exp[12(σ2/n)(θxˉ)2].p(x_1,\ldots,x_n \mid \theta,\sigma^2) \propto \exp\left[ -\frac{1}{2(\sigma^2/n)}(\theta-\bar{x})^2 \right].

This is the kernel of a Normal density in θ\theta.

It is centered at

xˉ\bar{x}

and has variance

σ2n.\frac{\sigma^2}{n}.

So the likelihood favors values of θ\theta near the sample mean. As nn increases, the likelihood becomes narrower.

A flat prior writes

p(θ)c.p(\theta) \propto c.

It gives equal prior weight to all values of θ\theta. This is an improper prior, but in this simple model it leads to a proper posterior when data are observed.

Because the prior is constant, the posterior has the same shape as the likelihood:

θx1,,xnN(xˉ,σ2n).\theta \mid x_1,\ldots,x_n \sim N\left(\bar{x},\frac{\sigma^2}{n}\right).

With a flat prior, the posterior mean is the sample mean and the posterior standard deviation is

σn.\frac{\sigma}{\sqrt{n}}.

This is the usual standard error, now interpreted as posterior uncertainty about θ\theta.

A Normal prior is natural when prior knowledge says that θ\theta is probably near some value μ0\mu_0, with uncertainty measured by τ02\tau_0^2.

Write

θN(μ0,τ02).\theta \sim N(\mu_0,\tau_0^2).

This prior is conjugate for the Normal likelihood with known variance.

The posterior is Normal:

θxN(μn,τn2).\theta \mid x \sim N(\mu_n,\tau_n^2).

The posterior precision is

1τn2=nσ2+1τ02.\frac{1}{\tau_n^2} = \frac{n}{\sigma^2} + \frac{1}{\tau_0^2}.

The posterior mean is

μn=wxˉ+(1w)μ0,\mu_n = w\bar{x}+(1-w)\mu_0,

where

w=n/σ2n/σ2+1/τ02.w = \frac{n/\sigma^2}{n/\sigma^2+1/\tau_0^2}.

Precision means inverse variance. The posterior precision is:

posterior precision=data precision+prior precision.\text{posterior precision} = \text{data precision} + \text{prior precision}.

The posterior mean is a weighted average:

  • xˉ\bar{x} receives more weight when the data are precise or nn is large;
  • μ0\mu_0 receives more weight when the prior is precise.

This is the same Bayesian update as before, now expressed on the precision scale.

Bayesian updating can be done one observation at a time.

For three observations,

p(θx1,x2,x3)p(x3θ)p(θx1,x2).p(\theta \mid x_1,x_2,x_3) \propto p(x_3 \mid \theta)p(\theta \mid x_1,x_2).

The posterior after the first two observations becomes the prior before the third observation.

This is not a different method. It is the same Bayes’ theorem applied repeatedly.

Suppose n=205n=205 log-wages are modeled as

XiθN(θ,0.4).X_i \mid \theta \sim N(\theta,0.4).

Use the vague prior

θN(12,100).\theta \sim N(12,100).

The data precision is

nσ2=2050.4=512.5.\frac{n}{\sigma^2} = \frac{205}{0.4} = 512.5.

The prior precision is

1100=0.01.\frac{1}{100} = 0.01.

Therefore

w=512.5512.5+0.010.99998.w = \frac{512.5}{512.5+0.01} \approx 0.99998.

The posterior mean is almost entirely determined by the sample mean. The prior is weak compared with the information in 205 observations.

The Poisson model is used for counts: number of bids, number of calls, number of defects, or number of events in a fixed exposure period.

Let θ\theta be the event rate. The model is

Y1,,YnθiidPois(θ).Y_1,\ldots,Y_n \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Pois}(\theta).

For one observation,

p(yiθ)=θyieθyi!.p(y_i \mid \theta) = \frac{\theta^{y_i}e^{-\theta}}{y_i!}.

Multiplying over observations gives

p(yθ)θi=1nyiexp(nθ).p(y \mid \theta) \propto \theta^{\sum_{i=1}^n y_i}\exp(-n\theta).

The data enter through two quantities:

  • the total count iyi\sum_i y_i;
  • the number of observations nn.

The total count pulls θ\theta upward. The exposure nn appears in exp(nθ)\exp(-n\theta) and keeps the rate scaled per observation.

The rate θ\theta must be positive. The Gamma distribution is flexible on positive values and is conjugate to the Poisson likelihood.

Use the shape-rate parameterization:

θGamma(α,β),\theta \sim \mathrm{Gamma}(\alpha,\beta),

with kernel

p(θ)θα1exp(βθ).p(\theta) \propto \theta^{\alpha-1}\exp(-\beta\theta).

The prior mean is

E(θ)=αβ.E(\theta)=\frac{\alpha}{\beta}.

Multiply likelihood and prior:

p(θy)θiyiexp(nθ)θα1exp(βθ).p(\theta \mid y) \propto \theta^{\sum_i y_i}\exp(-n\theta) \theta^{\alpha-1}\exp(-\beta\theta).

Collect terms:

p(θy)θα+iyi1exp[(β+n)θ].p(\theta \mid y) \propto \theta^{\alpha+\sum_i y_i-1} \exp[-(\beta+n)\theta].

Therefore

θyGamma(α+i=1nyi,β+n).\theta \mid y \sim \mathrm{Gamma}\left(\alpha+\sum_{i=1}^n y_i,\beta+n\right).

The Gamma-Poisson update is:

(α,β)(α+iyi,β+n).(\alpha,\beta) \quad \longrightarrow \quad \left(\alpha+\sum_i y_i,\beta+n\right).

Observed counts add to the shape. Observed exposure adds to the rate.

Suppose n=1000n=1000 coin auctions have total bid count

i=11000yi=3635.\sum_{i=1}^{1000} y_i = 3635.

Use the prior

θGamma(2,1/2).\theta \sim \mathrm{Gamma}(2,1/2).

This prior has mean 4. After observing the data,

θyGamma(3637,1000.5).\theta \mid y \sim \mathrm{Gamma}(3637,1000.5).

The posterior mean is

E(θy)=36371000.53.635.E(\theta \mid y) = \frac{3637}{1000.5} \approx 3.635.

The posterior is concentrated because the dataset is large. The prior suggested a rate near 4, but 1000 auctions pull the posterior toward the observed average.

If auctions are split by reservation price, the two groups can have very different posterior means. That is a model-building lesson: a single Poisson rate may be too simple when important predictors are ignored.

A posterior distribution is often summarized by an interval containing most of its probability.

A 95% credible interval [a,b][a,b] satisfies

Pr(aθby)=0.95.\Pr(a \le \theta \le b \mid y)=0.95.

An equal-tail 95% interval uses the 2.5% and 97.5% posterior quantiles.

It leaves 2.5% posterior probability below the interval and 2.5% above it.

An HPD interval is the shortest interval containing 95% posterior probability.

For symmetric unimodal posteriors, equal-tail and HPD intervals are often the same or very close. For skewed posteriors, such as some Gamma posteriors, they can differ.

When the posterior is approximately Normal,

E(θy)±1.96SD(θy)E(\theta \mid y) \pm 1.96\,\mathrm{SD}(\theta \mid y)

gives an approximate 95% credible interval.

A prior family is conjugate for a likelihood family if the posterior stays in the same family as the prior.

The main examples so far are:

LikelihoodConjugate priorPosterior
BernoulliBetaBeta
Normal mean, known varianceNormalNormal
PoissonGammaGamma
MultinomialDirichletDirichlet

Conjugacy gives closed-form updates. This is useful for learning the structure of Bayesian inference and for simple applied models.

But conjugacy is not required for Bayesian inference. When the posterior is not available in closed form, simulation methods can still be used.

Prior elicitation means turning domain knowledge into a probability distribution.

The expert may not think in terms of parameters. The statistician’s job is to ask questions about meaningful quantities and translate the answers into a prior.

Useful questions include:

  • What value seems most plausible?
  • What range would contain the quantity with high probability?
  • Is it plausible that the quantity is below a particular threshold?
  • How surprising would very large or very small values be?

People are affected by anchoring and overconfidence. It helps to show the implied prior distribution back to the expert and check whether its consequences make sense.

Sometimes we want an automatic prior that is less tied to a specific parameterization.

Jeffreys’ prior uses the Fisher information:

p(θ)I(θ)1/2.p(\theta) \propto |I(\theta)|^{1/2}.

For one parameter,

I(θ)=Exθ[2θ2logp(xθ)].I(\theta) = -E_{x \mid \theta} \left[ \frac{\partial^2}{\partial \theta^2} \log p(x \mid \theta) \right].

Jeffreys’ prior is invariant under one-to-one transformations of the parameter.

For Bernoulli observations,

I(θ)=nθ(1θ).I(\theta) = \frac{n}{\theta(1-\theta)}.

Therefore

p(θ)θ1/2(1θ)1/2.p(\theta) \propto \theta^{-1/2}(1-\theta)^{-1/2}.

This is the kernel of

Beta(1/2,1/2).\mathrm{Beta}(1/2,1/2).

The prior puts more density near 0 and 1 than a uniform prior does.

Jeffreys’ prior depends on the Fisher information, and the Fisher information can depend on the sampling scheme.

For example, Bernoulli sampling and negative-binomial sampling can give different Jeffreys’ priors even when the likelihoods are proportional after the data are observed.

This is a warning: automatic priors are not free of modeling choices. In multiparameter problems, Jeffreys’ prior can also become difficult or inappropriate.

Bayesian analyses commonly use several kinds of prior information:

  • Expert information from previous studies or domain experience.
  • Weak or vague priors that regularize without dominating the likelihood.
  • Smoothness or shrinkage priors that stabilize complex models.

The right prior depends on the purpose of the analysis. Estimation, prediction, and model comparison can place different demands on the prior.

  1. In the Normal model with known variance, why does the likelihood become narrower as nn grows?
  2. How is the Normal posterior mean a weighted average of prior information and data information?
  3. In the Poisson-Gamma update, why does nn add to the Gamma rate parameter?
  4. What is the difference between an equal-tail credible interval and an HPD interval?
  5. Why should Jeffreys’ prior be treated with caution?

Normal and Poisson models extend the basic Bayesian update to continuous and count data. With known Normal variance, a Normal prior gives a Normal posterior whose precision is the sum of prior and data precision. For Poisson counts, a Gamma prior gives a Gamma posterior by adding observed counts and exposure. Credible intervals summarize posterior uncertainty, conjugate priors provide closed-form updates, and prior elicitation connects the mathematics to real information. Jeffreys’ prior is useful as an invariant automatic construction, but it still reflects modeling choices.