Survey Sampling

Rice Chapter 7.1–7.3

Statistics

Author

Donghyun Ko

Published

May 26, 2026

This lecture note develops the foundations of survey sampling, covering how to draw valid inferences about population-level quantities from a randomly selected subset when measuring the entire population is infeasible. The notes begin by identifying the four core steps of survey sampling — sample selection, data collection, estimation, and inference — and sharply distinguishing probabilistic sampling designs (in which every unit’s selection probability is known and nonzero) from non-probabilistic schemes such as convenience or volunteer samples. The key point is that only probabilistic designs permit valid standard error and confidence interval calculations, because only then is the randomness of the estimator fully characterized by the sampling mechanism. A concrete example with university heights illustrates how a gym-convenience sample can introduce systematic bias that additional sample size cannot fix.

Five common probabilistic designs are surveyed: Simple Random Sampling (SRS) as the conceptual baseline; stratified sampling, which partitions the population into homogeneous strata and samples independently within each, reducing variance when within-stratum variability is small; cluster sampling, which selects groups of units and observes all members within chosen clusters, trading some efficiency for cost savings while requiring appropriate standard-error formulas; multistage sampling, which chains these designs across stages; and systematic random sampling, which selects every k-th unit after a random start. The note then establishes the fundamental distinction between fixed population parameters (mean µ, total τ, variance σ², and proportion p) and sample statistics viewed as random variables (estimators) before data collection. For binary (0–1) characteristics, it is shown algebraically that the population mean equals the proportion p and the population variance equals p(1 − p), so proportion estimation is a special case of mean estimation. A conceptual random variable X is introduced as a bookkeeping device for a uniformly drawn population unit, with E[X] = µ and Var(X) = σ², whose mean and variance are derived from first principles using the frequency representation of the finite population.

The variance and inferential properties of the sample mean are then derived under two designs. Under SRS with replacement (WR), the draws X₁, …, Xₙ are i.i.d., which immediately yields E[X̄] = µ and Var(X̄) = σ²/n by linearity of expectation and variance additivity. The standard error σ/√n decreases at rate 1/√n, so halving it requires quadrupling the sample size. The Central Limit Theorem guarantees that (X̄ − µ)/(σ/√n) converges in distribution to N(0, 1) regardless of the population shape, and since σ is unknown in practice it is replaced by the sample standard deviation S — justified by Bessel’s-corrected S² being unbiased and consistent for σ². This produces the studentized statistic whose limiting normality underpins the CLT-based confidence interval X̄ ± z_{1−α/2} S/√n. The bias of the naive variance estimator (denominator n) is derived in full, showing it underestimates σ² by the factor (n−1)/n and motivating the n−1 correction.

Under SRS without replacement (WOR), observations are dependent because previously selected units are removed from the pool. Each Xi still has the same marginal distribution as a uniformly drawn population element (by the symmetry of SRS), so E[X̄] = µ and the estimator remains unbiased. However, the covariance between any two draws is derived to be −σ²/(N−1), reflecting the negative dependence: selecting a large value leaves slightly fewer large values for subsequent draws. Substituting into the variance decomposition gives Var(X̄) = (σ²/n)·(N−n)/(N−1), where the finite population correction (FPC) factor (N−n)/(N−1) quantifies the variance reduction relative to WR sampling. When the sampling fraction f = n/N is negligible the FPC is close to 1 and WR formulas suffice; when f is non-negligible, ignoring the FPC leads to conservative (inflated) standard errors. Practical standard error estimation under WOR uses s²/n·(1 − n/N) with the ordinary sample variance, and the same finite-population CLT justifies the approximate confidence interval X̄ ± z_{1−α/2} sX̄ where sX̄ = (s/√n)√(1 − n/N). The variance estimator for the total T = NX̄ and the proportion estimator p̂ = X̄ for 0–1 data are obtained as immediate corollaries. Numerical examples compare FPC values for different population and sample sizes, making concrete when WOR matters and when it can safely be ignored.

Full notes: Download PDF