Quick answer. P-values tell you whether an effect is likely real; effect sizes tell you how big it is. A small p-value with a tiny effect size is statistically significant but practically meaningless. A non-significant p-value with a large effect size suggests an underpowered study, not absence of effect. Modern research reporting requires both: every significance test should be accompanied by an effect size (Cohen’s d for mean differences, r for correlations, odds ratios for binary outcomes) and its 95% confidence interval. If you must rank which matters more for interpretation, effect size wins.
One of the most common mistakes in quantitative research is treating “statistically significant” as a synonym for “important”. The American Statistical Association issued a formal warning against this conflation in 2016. Journals from Nature to BMJ now require effect sizes alongside p-values. This guide explains why both metrics are needed, what each one tells you that the other cannot, and how to report them correctly in a dissertation or paper.
What the P-value Actually Tells You
The p-value is the probability of observing data at least as extreme as yours, assuming the null hypothesis is true. Three things follow:
- It is a probability about your DATA under a hypothesis, not a probability about the hypothesis itself.
- It is sensitive to sample size. Large samples can detect tiny effects as “significant”.
- It does not tell you how large the effect is, how important it is, or how likely it is to replicate.
The conventional threshold (p < 0.05) is a social convention, not a natural cutoff. Fisher introduced 0.05 as a “convenient” guideline; we’ve been arguing about it ever since.
What Effect Size Tells You
An effect size measures the magnitude of a difference or relationship, independent of sample size. Common metrics by test type:
| Comparison type | Effect size metric | Small / Medium / Large (Cohen) |
|---|---|---|
| Two-group mean difference | Cohen’s d | 0.2 / 0.5 / 0.8 |
| Correlation between two continuous variables | Pearson’s r | 0.1 / 0.3 / 0.5 |
| 3+ group mean differences (ANOVA) | Eta-squared (η²) or omega-squared | 0.01 / 0.06 / 0.14 |
| Binary outcome (2×2 table) | Odds ratio, risk ratio, Cohen’s h | 1.5 / 2.0 / 3.0 (OR) |
| Predictor importance in regression | Standardised β, semipartial r² | field-specific |
| Categorical association | Cramér’s V or Phi | 0.1 / 0.3 / 0.5 |
Cohen’s thresholds (small / medium / large) are heuristics, not absolutes. A “small” effect of 0.1 may matter enormously in clinical medicine where it represents lives saved; a “large” effect of 0.8 may be unimpressive in education research where well-designed interventions routinely produce 1.0+.
Four Worked Combinations
| Scenario | P-value | Effect size | Sample | Interpretation |
|---|---|---|---|---|
| A | p = 0.03 | d = 0.85 | n = 60 | Real, large effect. Highest-confidence positive finding. Report unhedged. |
| B | p = 0.001 | d = 0.05 | n = 5,000 | Statistically significant, practically trivial. The large sample inflated significance. Report effect size; explicitly note the result is not practically meaningful. |
| C | p = 0.18 | d = 0.70 | n = 25 | Underpowered study. A medium-to-large effect likely exists but the sample was too small to detect it. Replicate with larger N. |
| D | p = 0.65 | d = 0.04 | n = 800 | Convincing evidence of no meaningful effect. Adequately powered; both p and d agree. |
The four scenarios show why you need both numbers. Scenario B looks impressive on p alone and worthless on d. Scenario C looks worthless on p alone and worth replicating on d. Reporting only one number gives the wrong impression.
Confidence Intervals: The Third Metric
A 95% confidence interval around an effect size shows the range of effects compatible with your data. A study reporting d = 0.50 with a 95% CI of [0.30, 0.70] is far more informative than d = 0.50 alone — the interval excludes “no effect” and is reasonably narrow.
Three things to read off a CI:
- Does it cross zero? If yes, the result is not statistically significant at α = 0.05.
- How wide is it? A narrow CI (e.g., [0.45, 0.55]) signals precision; a wide one (e.g., [0.10, 0.90]) signals an underpowered study.
- Where is the lower bound? If the lower bound is above your minimum effect of interest, the result is both statistically significant and practically meaningful.
Hire A Professional Academic Writer
- PhD-qualified Canadian writer
- AI-free, plagiarism-free guaranteed
- On-time delivery, every time
- Unlimited free revisions 14 days
- 24/7 chat support

Why Sample Size Distorts P-values
Most introductory texts skip this point, but it explains 80% of misleading findings: with a large enough sample, every effect (no matter how tiny) becomes statistically significant. A height difference of 0.1 cm between two groups is statistically significant if you sample a million people. It is not interesting.
The fix is to pre-decide what effect size matters in your field. In medical trials, a smallest clinically important difference (SCID) is defined before recruitment. In education, an effect-size benchmark of 0.4 (one year of additional learning) is sometimes used. Decide your threshold of practical importance before you analyse, then check both the p-value AND the effect size against it.
When Effect Size Alone Is Enough
Some modern statisticians argue for abandoning p-values entirely in favour of effect-size estimation. The “New Statistics” movement (Cumming, 2014) advocates:
- Report effect sizes with confidence intervals as the primary result.
- Skip the binary “significant / not significant” verdict.
- Plan studies to estimate effect sizes precisely, not to “reject the null”.
This is increasingly the norm in psychology, neuroscience, and pre-clinical research. In a dissertation, follow your supervisor’s convention — most still expect both metrics reported. Bayesian researchers report posterior distributions instead.
How to Report Both Correctly
Standard reporting form for a comparison of two means:
Students in the ChatGPT condition (M = 76.4, SD = 8.1) did not differ significantly from controls (M = 75.9, SD = 7.6), t(118) = 0.36, p = .72, Cohen’s d = 0.06, 95% CI [−0.30, 0.42].
Notes:
- Report exact p (p = .72 not p > .05) per APA 7 and most journals.
- Include the effect size with its 95% CI.
- The CI [−0.30, 0.42] crosses zero, which agrees with the non-significant p.
- The CI width tells the reader you could detect effects of moderate size but not small ones.
Common Errors
- Reporting only p — loses 70% of the information.
- Reporting only the effect size — can’t distinguish a precise estimate from a noisy guess without the CI.
- Calling p = 0.06 “marginally significant” — this language is dying. Either pre-declare your threshold and stick to it, or report the exact p and let the reader decide.
- Mistaking p < 0.001 for “stronger effect” — it only means the result is more reliably non-null. The effect size answers the strength question.
- Cherry-picking the test that gives the lowest p — this is p-hacking. Pre-register your analysis.
Frequently Asked Questions
Is a smaller p-value better than a larger effect size?
No. A smaller p-value means more reliable evidence the effect is non-zero. A larger effect size means the effect itself is bigger. Both matter; neither substitutes for the other.
Why does my paper need to report effect sizes if it has significant p-values?
Because journals and reviewers require it. APA 7, AMA, and most discipline journals now mandate effect-size reporting for every significance test. Without it your paper will be returned for revision.
What’s the difference between Cohen’s d and Hedges’ g?
Hedges’ g is a small-sample correction to Cohen’s d. With n > 50 per group the difference is trivial. Report g when sample sizes are small (n < 30 per group).
How do I calculate effect size in SPSS?
SPSS does not output Cohen’s d by default for t-tests — you compute it from the means and pooled SD. Use a calculator or syntax extension. In R, use the `effsize` or `effectsize` package; in Python, `pingouin`. Our data-analysis software guide covers each.
If my effect size is 0.20 (small) is my study worthless?
No. Small effects can matter enormously when scaled to large populations (vaccine efficacy, education policy, public-health interventions). Context determines whether 0.20 is meaningful.
Should I report effect size for non-significant results?
Yes — especially for non-significant results. The effect size and its CI tell readers whether the study was underpowered (large effect, wide CI, non-significant p) or genuinely null (small effect, narrow CI). Both are useful contributions to the literature.




