Skip to content

Glossary

This glossary collects technical terms introduced throughout the textbook.

In Metropolis—Hastings, the probability α\alpha of accepting a proposed move, computed as the ratio of posterior densities times the proposal ratio.

The ratio of marginal likelihoods B12=p1(y)/p2(y)B_{12} = p_1(y)/p_2(y), measuring the relative evidence for model M1M_1 versus M2M_2.

p(θy)p(yθ)p(θ)p(\theta \mid y) \propto p(y \mid \theta)\, p(\theta). Combines prior and likelihood into the posterior.

Leave-one-out cross-validation computed from the posterior, asymptotically equivalent to WAIC.

In large samples, the posterior distribution is approximately normal centered at the MLE, regardless of the prior (under regularity conditions).

p(θ)θα1(1θ)β1p(\theta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1} for θ[0,1]\theta \in [0,1]. Conjugate prior for Bernoulli/binomial likelihood.

The initial portion of an MCMC chain discarded before the chain has converged to the stationary distribution.

A prior family such that the posterior belongs to the same family. Examples: Beta—Bernoulli, Normal—Normal, Gamma—Poisson, Dirichlet—multinomial.

An interval [a,b][a, b] such that Pr(θ[a,b]y)=1α\Pr(\theta \in [a,b] \mid y) = 1 - \alpha. Interpreted as a probability statement about θ\theta.

Introducing latent variables to make Gibbs sampling conditionals tractable. Used in mixture models, logistic regression, and probit regression.

Framework for choosing actions aa to maximize posterior expected utility Ep(θy)[U(a,θ)]E_{p(\theta \mid y)}[U(a, \theta)].

Multivariate generalization of the Beta distribution. Conjugate prior for the multinomial likelihood.

ESS=N/IF\mathrm{ESS} = N/\mathrm{IF}. The number of independent draws equivalent to NN autocorrelated MCMC draws.

I(θ)=Exθ[2lnp(xθ)/θ2]I(\theta) = -E_{x \mid \theta}[\partial^2 \ln p(x \mid \theta)/\partial\theta^2]. Measures the information that data carries about θ\theta.

p(θjθj,y)p(\theta_j \mid \theta_{-j}, y). The distribution of one block of parameters given all others and the data. Used in Gibbs sampling.

p(θ)θα1exp(βθ)p(\theta) \propto \theta^{\alpha-1}\exp(-\beta\theta). Conjugate prior for the Poisson likelihood.

An MCMC algorithm that iteratively draws from full conditional distributions. Converges to the joint target distribution.

An MCMC algorithm that uses Hamiltonian dynamics (via the leapfrog integrator) to make distant proposals with high acceptance rates.

The shortest credible interval containing a given probability mass.

A model where parameters have priors whose hyperparameters themselves have priors, creating a multi-level structure.

IF=1+2k=1ρk\mathrm{IF} = 1 + 2\sum_{k=1}^\infty \rho_k. Measures the efficiency loss due to MCMC autocorrelation.

p(θ)I(θ)1/2p(\theta) \propto |I(\theta)|^{1/2}. A non-informative prior based on Fisher information. Transformation-invariant but violates the likelihood principle.

A normal approximation to the posterior obtained by Taylor-expanding the log-posterior around the mode.

L1L_1-regularized regression. Equivalent to the posterior mode under a Laplace (double-exponential) prior on coefficients.

p(yθ)p(y \mid \theta) viewed as a function of θ\theta for fixed data. Not a probability distribution over θ\theta.

All evidence about θ\theta in a sample is contained in the likelihood function. Bayesian inference respects this principle.

p(y)=p(yθ)p(θ)dθp(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta. The probability of the data averaged over the prior. Used for model comparison.

Integrating out nuisance parameters from the joint posterior: p(θ1y)=p(θ1,θ2y)dθ2p(\theta_1 \mid y) = \int p(\theta_1, \theta_2 \mid y)\, d\theta_2.

A family of algorithms that construct a Markov chain whose stationary distribution is the target posterior.

A general MCMC algorithm: propose from q(θpθ)q(\theta^p \mid \theta), accept with probability α=min(1,posterior ratio×proposal ratio)\alpha = \min(1, \text{posterior ratio} \times \text{proposal ratio}).

Combining predictions or inferences across models, weighted by posterior model probabilities.

Approximating E[g(θ)]E[g(\theta)] by 1Ni=1Ng(θ(i))\frac{1}{N}\sum_{i=1}^N g(\theta^{(i)}) where θ(i)p(θ)\theta^{(i)} \sim p(\theta).

An adaptive variant of HMC that automatically selects trajectory length, implemented in Stan.

Introducing Polya-Gamma latent variables to make logistic regression amenable to Gibbs sampling.

p(θy)p(yθ)p(θ)p(\theta \mid y) \propto p(y \mid \theta)\, p(\theta). The updated belief about θ\theta after observing data.

p(y~y)=p(y~θ)p(θy)dθp(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta)\, p(\theta \mid y)\, d\theta. The distribution of future data averaging over posterior uncertainty.

Pr[T(yrep)T(y)]\Pr[T(y^{\mathrm{rep}}) \ge T(y)]. Measures whether observed data is extreme relative to model-generated replicates.

The distribution of future observations given observed data and model. Accounts for both parameter and sampling uncertainty.

p(θ)p(\theta). The belief about θ\theta before observing data.

A language for specifying probabilistic models and performing automatic Bayesian inference (e.g., Stan).

Pr(yi=1xi)=Φ(xiβ)\Pr(y_i = 1 \mid x_i) = \Phi(x_i'\beta). Alternative to logistic regression using the normal CDF link.

L2L_2-regularized regression. Equivalent to the posterior mean under a Normal prior on coefficients.

A conjugate prior for the variance σ2\sigma^2 in Normal models.

A prior that shrinks parameters (e.g., spline coefficients) toward zero to prevent overfitting. Controls model flexibility.

A probabilistic programming language that implements HMC/NUTS for Bayesian inference. Written in C++, callable from R.

The distribution π\pi satisfying π=πP\pi = \pi P for a Markov chain with transition matrix PP. MCMC constructs chains with the posterior as stationary distribution.

Probability as a degree of belief, not a long-run frequency. Foundation of Bayesian inference.

Binary indicators I=(I1,,Ip)I = (I_1, \ldots, I_p) determining which covariates are included in a regression model.

Widely Applicable Information Criterion. Measures predictive accuracy using the log pointwise predictive density and an effective number of parameters.