Introduction to Bayesian Inference

Source: Lecture 1 — Introduction and Bernoulli data (BDA Ch. 1, 2.1—2.4).

The Main Question

Bayesian inference starts from a practical problem:

How should our uncertainty change after we see data?

The answer has three pieces:

A model for how data are generated.
A prior distribution for what we believed before seeing the data.
A posterior distribution for what we believe after seeing the data.

The chapter builds these pieces in the simplest useful setting: Bernoulli data, where each observation is either success or failure.

Step 1: Write Down the Sampling Model

Suppose we observe binary data

x_1,\ldots,x_n.

Let $x_i=1$ mean success and $x_i=0$ mean failure. The unknown success probability is $\theta$ .

The Bernoulli model is

X_1,\ldots,X_n \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bern}(\theta).

This means:

each observation has probability $\theta$ of being 1;
each observation has probability $1-\theta$ of being 0;
the observations are conditionally independent given $\theta$ .

Let

s=\sum_{i=1}^n x_i, \qquad f=n-s.

Here $s$ is the number of successes and $f$ is the number of failures.

Step 2: Build the Likelihood

What Is It For?

The likelihood tells us which parameter values make the observed data more or less plausible.

For Bernoulli data, multiplying the probabilities of all observations gives

p(x_1,\ldots,x_n \mid \theta) = \theta^s(1-\theta)^f.

As a function of $\theta$ , this is the likelihood function.

How to Read the Formula

The term

\theta^s

rewards values of $\theta$ that make the observed successes plausible.

The term

(1-\theta)^f

rewards values of $\theta$ that make the observed failures plausible.

If the data contain many successes, the likelihood is larger for larger $\theta$ . If the data contain many failures, the likelihood is larger for smaller $\theta$ .

A Common Confusion

The likelihood is not a probability distribution over $\theta$ .

In the sampling model, $\theta$ is fixed and the data are random. In the likelihood, the data are fixed and $\theta$ is varied. The same expression $p(x \mid \theta)$ is used, but the interpretation changes.

Step 3: Represent Prior Uncertainty

Bayesian probability is allowed to describe uncertainty about fixed but unknown quantities.

For example,

\Pr(\theta < 0.6 \mid \text{data})

is meaningful in Bayesian inference. It means:

After seeing the data, how much posterior probability is below 0.6?

The prior distribution $p(\theta)$ represents uncertainty before observing the current data. It may come from previous studies, expert knowledge, or a deliberately weak starting point.

The important rule is that the prior must be stated before it is combined with the current likelihood.

Step 4: Use Bayes’ Theorem

The Formula

Bayes’ theorem updates the prior by the likelihood:

p(\theta \mid \text{data}) = \frac{p(\text{data} \mid \theta)p(\theta)} {p(\text{data})}.

The denominator is

p(\text{data}) = \int p(\text{data} \mid \theta)p(\theta)\,d\theta.

It makes the posterior integrate to one.

The Working Version

For many calculations, we first use the proportional form:

p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta)p(\theta).

In words:

\text{Posterior} \propto \text{Likelihood} \times \text{Prior}.

This is the central update rule of Bayesian inference.

What Each Piece Does

The prior says which parameter values were plausible before the data.
The likelihood says which parameter values are supported by the data.
The posterior combines both sources of information.
The normalizing constant makes the result a valid probability distribution.

Example: A Rare Disease Test

Let $A$ be the event that a person has a very rare disease. Suppose

\Pr(A)=0.0001.

Let $B$ be the event that the test is positive. Suppose

\Pr(B \mid A)=0.9, \qquad \Pr(B \mid A^c)=0.05.

Bayes’ theorem gives

\Pr(A \mid B) = \frac{\Pr(B \mid A)\Pr(A)} {\Pr(B \mid A)\Pr(A)+\Pr(B \mid A^c)\Pr(A^c)}.

Substituting the numbers,

\Pr(A \mid B) \approx 0.0018.

The positive test matters: the probability increased by a factor of about 18. But the disease is so rare that the posterior probability is still small.

This example shows why both ingredients matter. A strong likelihood signal can still lead to a small posterior probability when the prior probability is extremely low.

The Beta Prior for a Bernoulli Probability

Why Use a Beta Distribution?

The success probability $\theta$ must lie between 0 and 1. The Beta distribution is a flexible distribution on this interval.

Write

\theta \sim \mathrm{Beta}(\alpha,\beta).

Its density is

p(\theta) = \frac{\Gamma(\alpha+\beta)} {\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}, \qquad 0 \le \theta \le 1.

The parameters $\alpha$ and $\beta$ control the prior shape. Roughly, larger $\alpha$ supports more prior successes, and larger $\beta$ supports more prior failures.

The Bernoulli-Beta Update

Ingredient 1: Likelihood

For $s$ successes and $f$ failures,

p(x \mid \theta) \propto \theta^s(1-\theta)^f.

Ingredient 2: Prior

The Beta prior has the kernel

p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}.

Step 3: Multiply

Bayes’ theorem gives

p(\theta \mid x) \propto \theta^s(1-\theta)^f \theta^{\alpha-1}(1-\theta)^{\beta-1}.

Collecting powers,

p(\theta \mid x) \propto \theta^{\alpha+s-1}(1-\theta)^{\beta+f-1}.

This is the kernel of a Beta distribution, so

\theta \mid x \sim \mathrm{Beta}(\alpha+s,\beta+f).

How to Remember It

The update is:

\mathrm{Beta}(\alpha,\beta) \quad \longrightarrow \quad \mathrm{Beta}(\alpha+s,\beta+f).

Add observed successes to the first Beta parameter. Add observed failures to the second Beta parameter.

The Beta prior is conjugate to the Bernoulli likelihood because the posterior is again a Beta distribution.

Example: Spam Email

George examines 4601 emails. Of these, 1813 are spam and 2788 are not spam.

Let $x_i=1$ if email $i$ is spam. The model is

X_i \mid \theta \overset{\mathrm{iid}}{\sim} \mathrm{Bern}(\theta).

With prior

\theta \sim \mathrm{Beta}(\alpha,\beta),

the posterior is

\theta \mid x \sim \mathrm{Beta}(\alpha+1813,\beta+2788).

What This Example Teaches

If $n=10$ , different reasonable priors can noticeably change the posterior.

If $n=100$ , the prior still matters, but less.

If $n=4601$ , the likelihood is so concentrated that reasonable priors give very similar posteriors.

The practical lesson is not that priors are irrelevant. It is that prior influence depends on how much information the likelihood contains.

The Likelihood Principle

The Principle

The likelihood principle says:

Once the data have been observed, all evidence about $\theta$ contained in the data is in the likelihood function.

This means that two experiments with proportional likelihoods should lead to the same inference about $\theta$ if the same prior is used.

Three Ways to Observe Bernoulli Data

Suppose the observed data contain $s$ successes and $f$ failures.

In a fixed-order Bernoulli experiment, the likelihood is

p(x \mid \theta) = \theta^s(1-\theta)^f.

In a binomial experiment with fixed $n$ , the likelihood is

p(s \mid \theta) = \binom{n}{s}\theta^s(1-\theta)^f.

In a negative binomial experiment with fixed $s$ , the likelihood is

p(n \mid \theta) = \binom{n-1}{s-1}\theta^s(1-\theta)^f.

The binomial coefficients do not depend on $\theta$ . Therefore all three likelihoods are proportional as functions of $\theta$ .

Why the Bayesian Posterior Is the Same

Bayes’ theorem multiplies the likelihood by the prior and then normalizes. Constants that do not depend on $\theta$ cancel during normalization.

So, with the same prior, all three sampling schemes produce the same posterior:

p(\theta \mid \text{data}) \propto \theta^s(1-\theta)^f p(\theta).

This is why Bayesian inference respects the likelihood principle in this setting.

Study Questions

In the Bernoulli model, what information from the data enters the likelihood?
Why is the likelihood not itself a probability distribution over $\theta$ ?
What does the normalizing constant in Bayes’ theorem do?
In the Beta-Bernoulli update, why do successes add to $\alpha$ and failures add to $\beta$ ?
Why do proportional likelihoods lead to the same Bayesian posterior when the prior is the same?

Chapter Summary

Bayesian inference updates uncertainty by combining a prior distribution with a likelihood. The likelihood measures how well different parameter values explain the observed data, while the prior represents uncertainty before the data. Bayes’ theorem turns these into the posterior distribution. In the Bernoulli-Beta model, the update is especially simple: add observed successes and failures to the two Beta parameters. This example also shows the likelihood principle: Bayesian inference depends on the likelihood as a function of the parameter, not on sampling details that only multiply the likelihood by constants.