Testing Hypotheses and Assessing Goodness of Fit
Rice Chapter 9
This lecture note follows Rice Chapter 9 with a textbook-style exposition that emphasizes explanation over formula collection. It begins by motivating hypothesis testing as a principled framework for deciding whether observed data are sufficiently inconsistent with a baseline claim, introducing the language of null and alternative hypotheses, simple versus composite hypotheses, and the two possible conclusions — reject or fail to reject. The Bayesian perspective is developed first through Bayes factors, which combine prior odds and likelihood ratios to quantify relative support for competing hypotheses, illustrated with a detailed binomial notification-app example including explicit computation of rejection regions and error probabilities. The frequentist perspective is then developed carefully: Type I and Type II errors are defined and interpreted via the justice-system analogy, rejection regions are constructed by controlling the significance level under the null distribution, and the power function is derived and analyzed through a fully worked normal-mean example at RDU airport. The p-value is introduced as both the probability of observing data as extreme or more extreme under H₀ and as the smallest significance level at which H₀ would be rejected, with explicit warnings about common misconceptions and the distinction between statistical and practical significance.
Two general test statistics are then developed in depth. The Wald test standardizes a consistent estimator under H₀ and uses the asymptotic normality of the MLE, with decision rules tabulated for right-tailed, left-tailed, and two-sided alternatives. The Likelihood Ratio Test (LRT) is introduced for simple hypotheses by comparing likelihoods directly, and the Neyman–Pearson Lemma is stated and interpreted: among all level-α tests, the LRT is the most powerful, motivating the likelihood ratio as the optimal decision criterion. Uniformly Most Powerful (UMP) tests are discussed for one-sided composite alternatives, with examples for the exponential and binomial families showing that the rejection region depends only on a sufficient statistic. The Generalized Likelihood Ratio Test (GLRT) extends these ideas to composite hypotheses by comparing the maximized likelihood under H₀ to the unrestricted maximum, and Wilks’ Theorem establishes that −2 ln Λ converges in distribution to a chi-squared distribution whose degrees of freedom equal the difference in parameter dimensions, making rejection regions and p-values computable without knowing the exact null distribution of Λ. One-sided GLRT variants are handled via the signed root statistic.
The second half of the notes covers model assessment. The duality between two-sided tests and confidence intervals is proved algebraically: failing to reject H₀ at level α is equivalent to the null value lying inside the 100(1−α)% confidence interval, unifying the two major branches of frequentist inference. Pearson’s χ² goodness-of-fit statistic is derived as a large-sample approximation to the GLRT in the multinomial model, and its degrees of freedom are explained as the number of cells minus one minus the number of estimated parameters. Three detailed examples are solved in full: a coin-fairness test with 17,950 tosses, a binomial fit for grouped coin-toss frequencies revealing a striking excess of 4-head outcomes, and a Poisson fit to the classical Bortkiewicz horse-kick data that fails to reject and illustrates the purpose of goodness-of-fit testing. Normal Q–Q plots are then constructed and interpreted as a visual diagnostic for normality: straight lines indicate approximate normality, while bending patterns reveal heavy tails, light tails, right skewness, and left skewness. The Poisson dispersion test is introduced as a check on the mean-equals-variance constraint, and probability plots for general location-scale families are discussed as richer alternatives to single-number test statistics. The notes conclude with a unified summary emphasizing that likelihood ratios organize virtually all of the chapter’s testing procedures, that confidence intervals and tests are dual constructions, and that formal tests and graphical diagnostics should always be used together.
Full notes: Download PDF