A/B testing in SEO: methodology that resists noise
How to run SEO tests with real statistical significance, separating signal from seasonal, algorithmic, and plain-luck noise.
Most SEO tests I see test nothing. Someone swaps a title tag on Tuesday, sees traffic jump 12% by Thursday, and declares victory in Slack. That same week Google rolled out a core update, the main competitor dropped thirty positions, and it was a US holiday. You did not measure the title change, you measured chaos. Serious A/B testing in SEO starts by admitting the channel is noisy by nature, and without methodology you are reading static as if it were signal. This is a field manual for tests that survive a skeptical CFO.
First, understand SEO does not allow classic user-level split tests. You cannot serve URL A to half of Googlebot and URL B to the other half. What you actually run is a page-level split test: divide a homogeneous set of URLs into control and treatment, apply the change to half, then compare the click delta against a forecast model. Tools like SearchPilot and GrowthBook with the SEO plugin do this, but you can replicate the logic with BigQuery and GSC. Before testing anything, work through How to audit on-page SEO without falling into guesswork to make sure your baseline is clean, because testing on garbage only produces statistically significant garbage.
Choosing the test set is where 80% of experiments die. You need pages with enough volume, similar behavior, and no cannibalization. For a large ecommerce site, that usually means PLPs from the same category with at least 500 organic clicks per month each. If you only have 50 relevant pages, forget statistical testing and do a careful before/after with causal analysis. With small sets, changes from Title tags that convert: 7 patterns tested on real SERPs and Does meta description still matter? What CTR data shows may move the needle, but you cannot prove it with a p-value. Honesty here saves credibility later.
Google's CausalImpact model, originally BSTS, became the standard because it builds a counterfactual from correlated series. You feed control URL traffic as covariates, the model learns the pre-period relationship and projects what would have happened to treated URLs if nothing changed. The effect is the gap between observed and projected, with a confidence interval. Run at least 21 days pre and 21 days post, ideally six weeks each side. Structural changes like Headings H1-H6: the structure Google actually reads and Canonical tags: common mistakes bleeding your organic traffic need even longer because Google takes time to reprocess and the effect is not linear.
Watch out for three traps that invalidate everything. First: control contamination, when you touch interlinking or a global template and the effect leaks into the control group. Second: window cherry-picking, when you run the test for 90 days and pick the best 45. Third: multiple hypotheses without Bonferroni correction, when you test 20 variables and celebrate the one with p<0.05 (statistically, 1 in 20 hits significance by chance). Writing the hypothesis and window down before you start is the only defense. For Image optimization: alt text, weight and LCP in practice and Core Web Vitals: beyond LCP, what actually moves the needle tests, contamination is worse because CDN changes hit the whole site.
On sample size: use SearchPilot's calculator or run a power simulation in R with pwr. To detect a 5% lift in clicks with 80% power and alpha 0.05, you usually need 100+ pages per arm in categories with typical ecommerce variance. Lifts below 3% are nearly impossible to prove on sites with fewer than 500 relevant pages, and that is fine, it means you should be testing bigger changes. Small optimizations from CTR benchmark by position: updated 2026 data and Search Console: 7 underused reports and what to extract from them are better treated as iterative rollouts than as formal experiments.
Practical takeaway: before your next test, write one A4 page with hypothesis, causal mechanism, primary metric, pre and post window, stopping rule, and what you will do if it comes back non-significant. If you cannot write that down, you do not have a test, you have a hunch. Honest SEO accepts that half your ideas will not move the needle measurably, and that is valuable information. Stop hunting for wins and start hunting for truth, ranking shows up as a consequence.