Log file analysis: what Googlebot is actually doing
Search Console shows what Google wants to tell you. Server logs show what it actually does. That gap pays the senior SEO salary.
Search Console reports 18k pages crawled per day. Your Nginx log shows 47k Googlebot hits, with 31k of them on URLs you thought you retired back in 2023. Welcome to the part of SEO nobody puts in the pitch deck. Log file analysis is the only source that shows what the crawler actually did, in what order, at what frequency, and with what status code. Everything else is narrative. Before any serious audit, I close the pretty dashboards and open an 8GB .gz file in the terminal.
The starting point is simple: 30 days of raw logs, filter by Googlebot user-agent validated through reverse DNS (host 66.249.x.x must resolve to googlebot.com), then aggregate by URL, status code, and timestamp. Tools like Screaming Frog Log File Analyser handle up to 5M lines; above that, BigQuery or a DuckDB pipeline is cheaper. If you are not already running BigQuery + GSC: queries your agency won't run alongside, half the signal stays on the table. They complement each other: GSC tells you which query, the log tells you which URL the bot prioritized.
The first discovery is usually uncomfortable. In an e-commerce audit I ran last month, 62% of crawl went to URLs with filter parameters (?color=blue&size=M) that should not exist in the index. That is Crawl budget: when to worry and how to measure it burned on pages with zero search value. The fix was surgical: canonical pointing to the clean category, parameters in robots.txt, sitemap revision. In 21 days, crawl on real PDPs jumped 34%. None of this shows up in the coverage report until the damage is done.
Status codes tell the story nobody wants to read. If 12% of Googlebot requests return 304, you are fine - your Last-Modified is doing its job. If 8% return 301, acceptable. Above 15% of chained redirects, you are bleeding equity, as I cover in 301 vs 302 Redirects: The Real Ranking Impact. When sporadic 5xx errors correlate with peak traffic hours, it is not an SEO problem, it is infra - but the ranking impact lands two weeks later. Plot status code by hour on a simple chart and patterns jump out.
Crawl frequency by page type is the most underused signal. Categories crawled every 6 hours, PDPs every 3 days, blog posts every 11 days - that delta matters. If a cluster you consider strategic is being visited once a month, you have an Smart interlinking: the internal authority map problem, and probably a Topical authority: how to build clusters that rank one too. Cross-reference with Content decay: spotting the posts quietly losing traffic and you reveal which URLs lost the bot's attention before they lost traffic - giving you a 30 to 60 day window to act.
Three queries I run on every new project. First: top 100 URLs by Googlebot hits that return 404 - that alone pays the engagement. Second: URLs in the sitemap that received no crawl in 60 days - candidates for pruning or rewriting per Rewrite or rebuild: making the call with SERP data. Third: URLs with high crawl but zero impressions in GSC - almost always technical pages leaking authority. Document them, prioritize by estimated impact, and ship in two-week sprints. No 400-row spreadsheet nobody will execute.
Practical takeaway: schedule monthly log extraction, keep a 13-month history (so year-over-year comparison is free of seasonal noise), and set alerts on three metrics - Googlebot 5xx rate above 2%, a 20% drop in hits to strategic URLs, and any new cluster of unwanted URLs appearing in the top 50 by crawl. Log file analysis is not a one-off project, it is continuous instrumentation. Anyone treating it as an annual audit finds out about the problem six months after traffic falls.