Skip to content

Introduction to Bayesian Inference

Source: Lecture 1 — Introduction and Bernoulli data (BDA Ch. 1, 2.1—2.4).

Bayesian inference starts from a practical problem:

How should our uncertainty change after we see data?

The answer has three pieces:

  1. A model for how data are generated.
  2. A prior distribution for what we believed before seeing the data.
  3. A posterior distribution for what we believe after seeing the data.

The chapter builds these pieces in the simplest useful setting: Bernoulli data, where each observation is either success or failure.

Suppose we observe binary data

x1,,xn.x_1,\ldots,x_n.

Let xi=1x_i=1 mean success and xi=0x_i=0 mean failure. The unknown success probability is θ\theta.

The Bernoulli model is

X1,,XnθiidBern(θ).X_1,\ldots,X_n \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bern}(\theta).

This means:

  • each observation has probability θ\theta of being 1;
  • each observation has probability 1θ1-\theta of being 0;
  • the observations are conditionally independent given θ\theta.

Let

s=i=1nxi,f=ns.s=\sum_{i=1}^n x_i, \qquad f=n-s.

Here ss is the number of successes and ff is the number of failures.

The likelihood tells us which parameter values make the observed data more or less plausible.

For Bernoulli data, multiplying the probabilities of all observations gives

p(x1,,xnθ)=θs(1θ)f.p(x_1,\ldots,x_n \mid \theta) = \theta^s(1-\theta)^f.

As a function of θ\theta, this is the likelihood function.

The term

θs\theta^s

rewards values of θ\theta that make the observed successes plausible.

The term

(1θ)f(1-\theta)^f

rewards values of θ\theta that make the observed failures plausible.

If the data contain many successes, the likelihood is larger for larger θ\theta. If the data contain many failures, the likelihood is larger for smaller θ\theta.

The likelihood is not a probability distribution over θ\theta.

In the sampling model, θ\theta is fixed and the data are random. In the likelihood, the data are fixed and θ\theta is varied. The same expression p(xθ)p(x \mid \theta) is used, but the interpretation changes.

Bayesian probability is allowed to describe uncertainty about fixed but unknown quantities.

For example,

Pr(θ<0.6data)\Pr(\theta < 0.6 \mid \text{data})

is meaningful in Bayesian inference. It means:

After seeing the data, how much posterior probability is below 0.6?

The prior distribution p(θ)p(\theta) represents uncertainty before observing the current data. It may come from previous studies, expert knowledge, or a deliberately weak starting point.

The important rule is that the prior must be stated before it is combined with the current likelihood.

Bayes’ theorem updates the prior by the likelihood:

p(θdata)=p(dataθ)p(θ)p(data).p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta)p(\theta)} {p(\text{data})}.

The denominator is

p(data)=p(dataθ)p(θ)dθ.p(\text{data}) = \int p(\text{data} \mid \theta)p(\theta)\,d\theta.

It makes the posterior integrate to one.

For many calculations, we first use the proportional form:

p(θdata)p(dataθ)p(θ).p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta)p(\theta).

In words:

PosteriorLikelihood×Prior.\text{Posterior} \propto \text{Likelihood} \times \text{Prior}.

This is the central update rule of Bayesian inference.

  • The prior says which parameter values were plausible before the data.
  • The likelihood says which parameter values are supported by the data.
  • The posterior combines both sources of information.
  • The normalizing constant makes the result a valid probability distribution.

Let AA be the event that a person has a very rare disease. Suppose

Pr(A)=0.0001.\Pr(A)=0.0001.

Let BB be the event that the test is positive. Suppose

Pr(BA)=0.9,Pr(BAc)=0.05.\Pr(B \mid A)=0.9, \qquad \Pr(B \mid A^c)=0.05.

Bayes’ theorem gives

Pr(AB)=Pr(BA)Pr(A)Pr(BA)Pr(A)+Pr(BAc)Pr(Ac).\Pr(A \mid B) = \frac{\Pr(B \mid A)\Pr(A)} {\Pr(B \mid A)\Pr(A)+\Pr(B \mid A^c)\Pr(A^c)}.

Substituting the numbers,

Pr(AB)0.0018.\Pr(A \mid B) \approx 0.0018.

The positive test matters: the probability increased by a factor of about 18. But the disease is so rare that the posterior probability is still small.

This example shows why both ingredients matter. A strong likelihood signal can still lead to a small posterior probability when the prior probability is extremely low.

The Beta Prior for a Bernoulli Probability

Section titled “The Beta Prior for a Bernoulli Probability”

The success probability θ\theta must lie between 0 and 1. The Beta distribution is a flexible distribution on this interval.

Write

θBeta(α,β).\theta \sim \mathrm{Beta}(\alpha,\beta).

Its density is

p(θ)=Γ(α+β)Γ(α)Γ(β)θα1(1θ)β1,0θ1.p(\theta) = \frac{\Gamma(\alpha+\beta)} {\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad 0 \le \theta \le 1.

The parameters α\alpha and β\beta control the prior shape. Roughly, larger α\alpha supports more prior successes, and larger β\beta supports more prior failures.

For ss successes and ff failures,

p(xθ)θs(1θ)f.p(x \mid \theta) \propto \theta^s(1-\theta)^f.

The Beta prior has the kernel

p(θ)θα1(1θ)β1.p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}.

Bayes’ theorem gives

p(θx)θs(1θ)fθα1(1θ)β1.p(\theta \mid x) \propto \theta^s(1-\theta)^f \theta^{\alpha-1}(1-\theta)^{\beta-1}.

Collecting powers,

p(θx)θα+s1(1θ)β+f1.p(\theta \mid x) \propto \theta^{\alpha+s-1}(1-\theta)^{\beta+f-1}.

This is the kernel of a Beta distribution, so

θxBeta(α+s,β+f).\theta \mid x \sim \mathrm{Beta}(\alpha+s,\beta+f).

The update is:

Beta(α,β)Beta(α+s,β+f).\mathrm{Beta}(\alpha,\beta) \quad \longrightarrow \quad \mathrm{Beta}(\alpha+s,\beta+f).

Add observed successes to the first Beta parameter. Add observed failures to the second Beta parameter.

The Beta prior is conjugate to the Bernoulli likelihood because the posterior is again a Beta distribution.

George examines 4601 emails. Of these, 1813 are spam and 2788 are not spam.

Let xi=1x_i=1 if email ii is spam. The model is

XiθiidBern(θ).X_i \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bern}(\theta).

With prior

θBeta(α,β),\theta \sim \mathrm{Beta}(\alpha,\beta),

the posterior is

θxBeta(α+1813,β+2788).\theta \mid x \sim \mathrm{Beta}(\alpha+1813,\beta+2788).

If n=10n=10, different reasonable priors can noticeably change the posterior.

If n=100n=100, the prior still matters, but less.

If n=4601n=4601, the likelihood is so concentrated that reasonable priors give very similar posteriors.

The practical lesson is not that priors are irrelevant. It is that prior influence depends on how much information the likelihood contains.

The likelihood principle says:

Once the data have been observed, all evidence about θ\theta contained in the data is in the likelihood function.

This means that two experiments with proportional likelihoods should lead to the same inference about θ\theta if the same prior is used.

Suppose the observed data contain ss successes and ff failures.

In a fixed-order Bernoulli experiment, the likelihood is

p(xθ)=θs(1θ)f.p(x \mid \theta) = \theta^s(1-\theta)^f.

In a binomial experiment with fixed nn, the likelihood is

p(sθ)=(ns)θs(1θ)f.p(s \mid \theta) = \binom{n}{s}\theta^s(1-\theta)^f.

In a negative binomial experiment with fixed ss, the likelihood is

p(nθ)=(n1s1)θs(1θ)f.p(n \mid \theta) = \binom{n-1}{s-1}\theta^s(1-\theta)^f.

The binomial coefficients do not depend on θ\theta. Therefore all three likelihoods are proportional as functions of θ\theta.

Bayes’ theorem multiplies the likelihood by the prior and then normalizes. Constants that do not depend on θ\theta cancel during normalization.

So, with the same prior, all three sampling schemes produce the same posterior:

p(θdata)θs(1θ)fp(θ).p(\theta \mid \text{data}) \propto \theta^s(1-\theta)^f p(\theta).

This is why Bayesian inference respects the likelihood principle in this setting.

  1. In the Bernoulli model, what information from the data enters the likelihood?
  2. Why is the likelihood not itself a probability distribution over θ\theta?
  3. What does the normalizing constant in Bayes’ theorem do?
  4. In the Beta-Bernoulli update, why do successes add to α\alpha and failures add to β\beta?
  5. Why do proportional likelihoods lead to the same Bayesian posterior when the prior is the same?

Bayesian inference updates uncertainty by combining a prior distribution with a likelihood. The likelihood measures how well different parameter values explain the observed data, while the prior represents uncertainty before the data. Bayes’ theorem turns these into the posterior distribution. In the Bernoulli-Beta model, the update is especially simple: add observed successes and failures to the two Beta parameters. This example also shows the likelihood principle: Bayesian inference depends on the likelihood as a function of the parameter, not on sampling details that only multiply the likelihood by constants.