Technical SEO

Crawl budget: when to worry and how to measure it

Por Lucas ·

Crawl budget becomes a problem before it becomes a metric. Real signals and how to diagnose with logs, GSC and BigQuery without guesswork.

Crawl budget is the kind of topic that only enters the conversation once it has already become a problem. Google published in 2017 that sites with fewer than a few thousand URLs do not need to worry, and that sentence turned into an excuse to ignore the topic even in e-commerce stores with 800k PDPs. The breaking point is rarely the number of URLs in the sitemap, but the ratio between useful pages and pages that Googlebot actually finds. When that ratio drops below 30%, you are already bleeding crawl, even if organic traffic does not show it yet.

The first practical signal shows up in Search Console, under Settings > Crawl stats. If daily requests fall 40% in two weeks with no server change, or if average response time jumps from 300ms to 1.2s, you have a budget problem. Another classic symptom: new pages taking more than 10 days to be indexed on a site that used to get Googlebot the same day. Before touching anything, it pays off to run an honest How to audit on-page SEO without falling into guesswork to rule out that the issue lives in the content itself.

The real diagnosis lives in server logs. Filtering by Googlebot user-agent across 30 days of Nginx or Cloudflare logs reveals things GSC will never show: how many times the bot hit URLs with ?sort= parameters, how many came back as 404, how many requests went to infinite pagination spawned by faceted filters. For one of our clients, 62% of crawl was going to URLs with three or more parameters, all canonicalized to the clean version. The fix was not noindex, it was correcting internal links and reviewing Canonical tags: common mistakes bleeding your organic traffic to stop the bleeding at the source.

Three technical traps concentrate most cases. The first is a misconfigured robots.txt blocking resources the renderer needs, which is why robots.txt: the traps that silently block indexing is required reading. The second is a bloated sitemap full of 404, redirected or noindexed URLs, covered in Modern XML sitemaps: priority, lastmod, and what to skip. The third, more subtle, is JavaScript that only renders content after interaction, making Googlebot see empty pages and burn quota trying to understand. The JavaScript SEO: rendering, hydration, and indexing guide covers patterns that work in 2026 with Next.js and Astro.

To measure precisely, build a BigQuery query joining the GSC export with CDN logs. Count distinct URLs crawled per day, group by path pattern (/p/, /c/, /blog/, /tag/), and calculate the ratio between crawled and indexed. If /tag/ represents 35% of crawl but 2% of impressions, you have a clear leak. This kind of analysis is detailed in BigQuery + GSC: queries your agency won't run and in Log file analysis: what Googlebot is actually doing, which shows how to separate real Googlebot from bots that spoof the user-agent (around 18% of traffic identifying as Googlebot, per Cloudflare 2024 data).

There is a moment when the problem stops being technical and becomes architectural. Sites with more than 100k active URLs need prioritization logic baked into the HTML itself: reliable lastmod, internal linking that mirrors the commercial importance of pages, and aggressive removal of variants that do not generate demand. If your main category gets fewer Googlebot hits than an obsolete tag page, the problem is not the budget, it is the design. Practical takeaway: run a 30-day log analysis before next quarter, sort URLs into three buckets (active, dormant, trash), and attack the third bucket first. Crawl budget recovers in 6 to 8 weeks once you stop feeding what should not exist.

Nenhum comentário ainda

Seja o primeiro a comentar.

Deixe seu comentário

Entre com sua conta Canverly para comentar. Você pode usar a mesma conta em qualquer site da rede.

Entrar com Canverly