Quick answer. Hypothesis testing is a procedure for deciding whether the patterns you see in data are likely to be real or could be explained by chance alone. You state two opposing claims (null and alternative), pick a statistical test based on your data type, run the test, and interpret the resulting p-value against a pre-decided significance threshold (typically α = 0.05). If your p-value is below that threshold, you reject the null and accept the evidence is unlikely under chance alone. The procedure does not “prove” anything — it bounds the probability of being wrong.
Hypothesis testing is the most-used statistical framework in social-science, health, and natural-science research. It is also one of the most misused. Half the controversies in published research over the past decade — HARKing, p-hacking, the replication crisis — trace back to mishandled hypothesis tests. This guide explains the logic step by step, names the most common tests by data type, and walks through a worked example you can adapt for your own dissertation. We use plain language and reserve the formulas for where they matter.
The Core Logic
Hypothesis testing answers one specific question: “If chance alone were producing my data, how likely would I be to see this pattern?” If the answer is “very unlikely”, we infer that something other than chance is at work. The test does not tell you what that something is — that interpretation depends on your study design.
The procedure has five steps:
- State the null hypothesis (H₀) — the claim of “no effect” or “no difference”. E.g., “There is no difference in mean essay scores between students using ChatGPT and students not using it.”
- State the alternative hypothesis (H₁) — the claim you actually care about. E.g., “Mean essay scores differ between the two groups.” Can be two-sided (any difference) or one-sided (a specific direction).
- Pick a significance level (α) — the false-positive risk you’re willing to accept. Convention is 0.05 (5%); some fields use 0.01 or 0.001 for higher-stakes claims.
- Run the appropriate test — calculate a test statistic and the associated p-value.
- Compare — if p < α, reject the null; if p ≥ α, fail to reject. Note: “fail to reject” is not the same as “the null is true”.
Null and Alternative: The Two Claims
The null hypothesis is what your research is set up to falsify. It states no effect, no difference, no relationship. Researchers do not usually believe the null is true — they design studies to test whether the data can refute it.
The alternative hypothesis is the substantive claim. It states the relationship you predict from theory or prior literature. Phrasing matters:
- Two-sided alternative: “Mean A is different from mean B.” Doesn’t predict direction. Default when prior evidence is mixed.
- One-sided alternative: “Mean A is greater than mean B.” Requires strong prior evidence or theory; doubles your statistical power if direction is correctly predicted.
- Point alternative: “Mean A equals 7.5.” Rarely used outside specific industrial-quality applications.
Type I and Type II Errors
Hypothesis testing produces two error types:
| H₀ is actually TRUE | H₀ is actually FALSE | |
|---|---|---|
| Reject H₀ | TYPE I ERROR (α) — false positive | Correct decision (power = 1−β) |
| Fail to reject H₀ | Correct decision | TYPE II ERROR (β) — false negative |
Type I errors (saying there’s an effect when there isn’t) are controlled by your significance threshold α. Set it lower (0.01 instead of 0.05) and Type I errors drop — at the cost of higher Type II errors.
Type II errors (missing a real effect) depend on sample size, effect size, and α. The probability of avoiding a Type II error is your statistical power (typically targeted at 80% or higher). Run a power analysis before you collect data — underpowered studies are the single largest reason for irreproducible results.
Choosing the Right Test
The test you use depends on three things: the type of data (continuous, ordinal, categorical), the number of groups or variables, and whether observations are independent or paired.
| Research question | Outcome type | Test |
|---|---|---|
| Compare 2 group means (independent) | Continuous | Independent-samples t-test |
| Compare 2 group means (matched/paired) | Continuous | Paired t-test |
| Compare 3+ group means | Continuous | One-way ANOVA |
| Compare means across 2+ factors | Continuous | Two-way / factorial ANOVA |
| Test relationship between 2 continuous variables | Continuous | Pearson correlation, simple linear regression |
| Predict continuous outcome from multiple predictors | Continuous | Multiple linear regression |
| Compare 2 group medians (non-normal) | Ordinal / non-normal | Mann–Whitney U |
| Compare 3+ group medians | Ordinal / non-normal | Kruskal–Wallis H |
| Compare 2 paired distributions (non-normal) | Ordinal / non-normal | Wilcoxon signed-rank |
| Test association between 2 categorical variables | Categorical | Chi-square test of independence |
| Predict binary outcome from predictors | Binary | Logistic regression |
Hire A Professional Academic Writer
- PhD-qualified Canadian writer
- AI-free, plagiarism-free guaranteed
- On-time delivery, every time
- Unlimited free revisions 14 days
- 24/7 chat support

Worked Example: ChatGPT and Essay Scores
You want to know whether ChatGPT-assisted outlining changes argumentative-essay scores. You collect data from 120 first-year undergraduates: 60 use ChatGPT for 15 minutes before drafting, 60 do not. Rubric scores are continuous (0–100). Two independent groups, continuous outcome — the test is an independent-samples t-test.
- Hypotheses. H₀: mean scores are equal between groups. H₁: mean scores differ (two-sided).
- α = 0.05.
- Check assumptions. Independence (random assignment ensures this); approximate normality (Shapiro–Wilk on each group); equal variances (Levene’s test). All satisfied.
- Run the test. MeanAI = 76.4, SD = 8.1, n = 60. Meancontrol = 75.9, SD = 7.6, n = 60. Independent t-test gives t(118) = 0.36, p = 0.72.
- Interpret. p > 0.05. Fail to reject H₀. We do not have evidence that ChatGPT-assisted outlining changes overall essay scores. Cohen’s d = 0.06 (essentially zero).
What this result does not say: ChatGPT has no effect on essays. The study was powered to detect medium effects (d = 0.5); it cannot rule out small ones. To rule out small effects you would need a sample of around 800 students.
Common Misinterpretations
- “p > 0.05 means no effect.” Wrong. It means insufficient evidence of effect under your sample size. Absence of evidence is not evidence of absence.
- “p < 0.05 means the effect is important.” Wrong. p-values measure statistical reliability, not practical magnitude. A large enough sample finds tiny meaningless effects significant. Report effect sizes (Cohen’s d, r, η², odds ratios) alongside p-values.
- “The p-value is the probability the null is true.” Wrong. The p-value is the probability of seeing data this extreme given the null is true. The probabilities are not interchangeable.
- “Statistically significant proves causation.” Wrong. Significance plus a correlational design proves correlation. Causation requires experimental control or strong identification strategy (instrumental variables, regression discontinuity, etc.).
Power Analysis Before You Collect Data
Run a power analysis before recruitment, not after. Inputs:
- Minimum effect size of interest — the smallest effect you would consider meaningful in your field.
- Significance level α — usually 0.05.
- Desired power — usually 0.80 (80%).
- Test type — t-test, ANOVA, regression, etc.
Output: the sample size you need. Free tools (G*Power, R’s `pwr` package, Python’s `statsmodels`) handle the calculation. If the required N is impossible given your timeline or funding, restructure the study now — collecting data with insufficient power wastes participants’ time.
Pre-registration and HARKing
Decide your hypotheses, your test, and your significance level BEFORE you look at the data. If you change them afterward to fit what you found, you are HARKing (Hypothesising After Results are Known) — a research-integrity violation.
Pre-register your study on OSF, AsPredicted, or ClinicalTrials.gov. Pre-registration is now expected by many high-impact journals (BMJ, Nature, PLOS Biology) and is the strongest defense against accusations of p-hacking. For exploratory analyses outside the pre-registration, label them clearly as exploratory.
Reporting Hypothesis Tests
Every hypothesis test in a paper or dissertation needs:
- The test name and version (Welch’s t-test, two-tailed; Pearson chi-square with Yates correction; etc.)
- The test statistic with degrees of freedom (t(118) = 0.36; F(2, 117) = 4.21)
- The exact p-value (p = 0.72, not p < 0.05)
- The effect size and its 95% confidence interval
- Whether the test was pre-registered or exploratory
See our deeper guide to the difference between r and p values for how to choose and report effect sizes correctly.
Frequently Asked Questions
Is hypothesis testing the only way to analyse quantitative data?
No. Bayesian inference (posterior probabilities, Bayes factors) and effect-size estimation (with confidence intervals) are increasingly used alternatives. Most Canadian programmes still teach frequentist hypothesis testing as the baseline; advanced quantitative students are expected to know both approaches.
What if my data is not normally distributed?
Use a non-parametric alternative: Mann–Whitney U instead of t-test, Kruskal–Wallis instead of ANOVA, Spearman’s rho instead of Pearson’s r. Or transform the data (log, square-root) if a meaningful transformation exists. Many parametric tests are robust to mild non-normality with large samples (N > 30 per group).
Should I use a one-tailed or two-tailed test?
Two-tailed by default. One-tailed only if you have strong prior evidence the effect can only go in one direction, and only if you pre-registered the prediction. One-tailed tests halve your p-value but examiners are sceptical of post-hoc one-tailed claims.
Can I run multiple tests on the same data?
Yes, but each additional test increases your family-wise Type I error rate. Adjust with Bonferroni (divide α by number of tests), Holm–Bonferroni, or false-discovery-rate control (Benjamini–Hochberg). For 20 simultaneous tests at α = 0.05, expect one significant result by chance alone if no real effects exist.
Where do I run hypothesis tests?
SPSS is the most-used in Canadian dissertations (Click through menus). R is preferred in academic publication (free, replicable scripts). Stata is common in economics + epidemiology. Python (scipy.stats, statsmodels) is rising. Choice depends on your supervisor’s expectation and your discipline’s norms. Our data-analysis software guide compares them.




