robots.txt: the traps that silently block indexing
A misplaced Disallow can wipe 40% of organic traffic with no alert. The common robots.txt mistakes and how to audit before they bite.
In February I picked up a fashion e-commerce that had lost 38% of organic traffic in six weeks. No manual action, no core update, no content overhaul. The culprit was one line in robots.txt: Disallow: /product- added by a developer who only wanted to block a staging page called /product-test. The implicit wildcard knocked 12,000 PDP URLs out of the index in 19 days. Search Console never screamed; it just showed a slow drift in 'Discovered - currently not indexed'. That case sums up the issue: robots.txt is the most underrated and most dangerous file on a site. Partial indexing: why pages disappear from Google
Trap number one is confusing crawl blocking with index blocking. Robots.txt does not remove pages from the index, it just stops Googlebot from reading the content. If the URL already had backlinks or sat in the sitemap, it keeps showing up on the SERP, now wearing that ugly snippet: 'No information is available for this page'. I have watched large brands pay an agency to 'deindex' faceted filters with Disallow only to see the opposite happen: the URLs got stuck in the index with no decent title, cannibalizing the main pages. Canonical tags: common mistakes bleeding your organic traffic explains why canonicals do that job better.
Another recurring mistake: blocking the whole /wp-content/ or /assets/ on WordPress and Next.js sites. Google needs to render CSS and JS to understand layout, mobile friendliness and Core Web Vitals. Block those directories and Googlebot sees an unstyled page, flags it as not mobile friendly, and the CrUX-measured LCP drifts away from your PageSpeed score. On one log-file audit, 23% of Googlebot requests were getting 200 for HTML and 403 for the bundles, tanking perceived quality. JavaScript SEO: rendering, hydration, and indexing Core Web Vitals: beyond LCP, what actually moves the needle
The third trap lives in the syntax. Robots.txt is not regex, but it accepts two wildcards: asterisk for any sequence and dollar sign for end of URL. Almost nobody uses the dollar sign, which is why Disallow: /*.pdf catches both /report.pdf and /report.pdf?utm=email, the desired behavior. But Disallow: /search blocks /search, /search-results, /searchengineland-comparison and anything starting with /search. Testing in the Search Console robots tester with 20 real URLs pasted in is not optional. Run Screaming Frog with 'Respect robots.txt' and again with 'Ignore robots.txt' to see the delta. Log file analysis: what Googlebot is actually doing
Then comes the order and specificity bug. Googlebot follows the most specific rule for the most specific user-agent, not the first one in the file. If you have User-agent: * with Disallow: /admin and below it an empty User-agent: Googlebot block, Googlebot ignores the previous rules entirely, assuming Allow: /. That alone has leaked admin panels into the index more than once. One more detail: the file must be UTF-8 and weigh at most 500 KiB. Above that, Google silently truncates. The sitemap declared in robots.txt must be an absolute URL with the same protocol. Modern XML sitemaps: priority, lastmod, and what to skip Crawl budget: when to worry and how to measure it
How to audit without guessing: pull the current robots.txt, run it against the URLs in your sitemap.xml using the Python package 'reppy' or batch the Google validator via the URL Inspection API. Cross it with the Search Console coverage report filtered by 'Blocked by robots.txt'. If commercial URLs show up there, you have a fire. Search Console: 7 underused reports and what to extract from them walks through that alert. For large sites, log file analysis is the only honest path: you see exactly what Googlebot tried and whether it got a 403 or a 200. How to audit on-page SEO without falling into guesswork
Practical takeaway: treat robots.txt as critical infrastructure, not a text file. Put it under Git, require code review before any new Disallow, and wire up an automated test that runs against your top 50 URLs on every deploy and pings Slack the moment one flips to Disallow. In 2026, with Googlebot crawling less thanks to AI budget pressure, every URL you block by mistake takes longer to crawl back into the index once you fix it, sometimes 30 to 60 days. The cost of a wrong Disallow has never been higher.