Estimation and Model Fitting: Bayesian Parameter Estimation

Rice Chapter 8.6–8.9

Statistics

Author

Donghyun Ko

Published

May 26, 2026

This lecture note covers the Bayesian approach to parameter estimation, contrasting it with the frequentist paradigm and developing the full machinery of prior-to-posterior updating. The notes begin by distinguishing the two paradigms: while frequentists treat θ as a fixed unknown constant and perform inference through estimators and their sampling distributions, the Bayesian framework models the parameter itself as a random variable Θ, encoding pre-data uncertainty via a prior distribution fΘ(θ) and updating it through Bayes’ theorem once data are observed. The resulting posterior distribution fΘ|X(θ|x) ∝ Likelihood × Prior serves as the complete summary of post-data knowledge about the parameter, and all inferential summaries — point estimates, intervals, and predictions — are derived from it.

The core components of Bayesian inference are developed carefully. The prior distribution captures beliefs before data collection, the likelihood function connects observed data to the parameter through the model, and Bayes’ theorem produces the posterior by combining both. A key computational technique, the Bayesian trick, is introduced: rather than evaluating the normalizing constant explicitly, one identifies the posterior family by matching the kernel of the numerator fX|Θ(x|θ)fΘ(θ) to a known distribution. Posterior summaries discussed include the posterior mean (which minimizes expected squared-error loss), the posterior mode or MAP estimator, posterior variance and standard deviation, and two types of credible intervals — the percentile-based (quantile) interval and the Highest Posterior Density (HPD) interval — both of which carry the direct probabilistic interpretation P(L ≤ Θ ≤ U | data) = 1 − α that frequentist confidence intervals do not.

Two fully worked conjugate models form the heart of the notes. In the Beta–Binomial model, a Binomial(n, θ) likelihood combined with a Beta(α, β) prior yields a Beta(α + x, β + n − x) posterior, with posterior mean (α + x)/(α + β + n) interpretable as a precision-weighted average of the prior mean and the MLE. A manufacturing defects example compares a flat uniform prior (statistician A) with an informative Beta(20, 2) prior (statistician B) on 50 observations, illustrating concretely how prior strength relative to sample size determines the posterior’s location. In the Normal–Normal model, a Normal likelihood with known variance is combined with a Normal prior on the mean; the posterior precision equals the sum of the data precision nξ₀ and the prior precision ξ_prior, and the posterior mean is the precision-weighted combination of the sample mean and prior mean. Because the posterior is symmetric and unimodal, the Bayesian credible interval and the HPD interval coincide in this case. A Poisson model with an improper 1/λ prior is also worked through, showing that even when the prior does not integrate to a finite value, the posterior can still be proper — here a Gamma distribution — provided the data supply enough information.

The second half of the notes addresses computational strategies for posterior analysis. Three approaches are presented in order of increasing generality. The first is direct identification: when the posterior kernel matches a known family, no computation is needed beyond recognizing the form. The second is the Laplace approximation: by expanding the log-posterior q(θ) = log fΘ(θ) + ℓ(θ) to second order around its maximum (the posterior mode), one obtains a Normal approximation Θ|X ≈ N(θ̂, [−q″(θ̂)]⁻¹), which is particularly accurate when the posterior is sharply peaked and the sample is large. The third approach is Markov Chain Monte Carlo (MCMC), used when neither analytic identification nor Laplace approximation is adequate. The Metropolis–Hastings algorithm is developed in full: at each iteration a candidate θ* is drawn from a proposal distribution, and the acceptance ratio r = fΘ|X(θ*|x)/fΘ|X(θₜ₋₁|x) is computed using only the posterior kernel (the normalizing constant cancels). The candidate is accepted with probability min(r, 1), ensuring the chain spends more time in high-density regions while still exploring the full posterior. Practical guidance on proposal standard deviation tuning is given: too small leads to slow diffusion, too large leads to excessive rejection. Gibbs sampling is then introduced as an alternative that avoids proposal tuning entirely by iteratively drawing each parameter from its full conditional distribution; in multi-parameter settings where these conditionals are standard distributions (Normal, Gamma), every draw is automatically accepted. Both algorithms produce a Markov chain whose stationary distribution is the target posterior; after discarding an initial burn-in, the retained draws are used to estimate posterior means, variances, modes, and credible intervals numerically. The notes close by comparing the two MCMC methods and providing practical guidance on trace plots, burn-in diagnostics, and posterior histogram interpretation.

Full notes: Download PDF