What Is Crawling in SEO? How Search Bots Find Pages
Crawling is how search engine bots discover and read the pages on your site. Learn how it works, how to control it with robots.txt, and how to fix crawl errors.

Crawling is the process by which search engine bots — automated programs also called crawlers or spiders — discover and read the pages on the web by following links and fetching their content. It is the first stage of the search pipeline: an engine must crawl a page before it can index it, and index it before it can rank it. If a crawler can't reach a page, nothing else about that page matters, because the engine never sees it.
This guide explains what crawling is, how a search engine crawls a site, what bots and user-agents are, how to control crawling with tools like robots.txt, what crawl budget means and who should care, the common errors that block crawlers, and how the same mechanics now apply to AI crawlers.
What is crawling, exactly?
Crawling is automated discovery. A search engine runs software — Googlebot for Google, Bingbot for Bing — that requests a page, reads its HTML, extracts the links it contains, and adds those linked URLs to a queue to fetch next. Repeated across billions of pages, this process lets an engine map the web by treating it as a giant graph of pages connected by links.
Two consequences follow immediately. First, links are how crawlers travel, so a page with no links pointing to it (an "orphan" page) is hard to discover. Second, crawling is about access and reading, not understanding or storing — that comes later, at indexing. A page can be crawled and then deliberately not indexed, so "Googlebot visited my page" is necessary but not sufficient for it to appear in results.
How does a search engine crawl your site?
Crawling is continuous rather than a one-time sweep. The cycle runs roughly like this:
- Start from known URLs. The engine begins with pages it already knows and URLs submitted via sitemaps.
- Fetch and parse. It requests each page, reads the HTML, and renders it where needed to see content built by JavaScript.
- Discover new links. It extracts the links on the page and adds new URLs to its crawl queue.
- Schedule re-crawls. It decides how often to return, based on how frequently a page changes and how important it appears — a news homepage might be crawled many times a day, a static page rarely.
This is why two things consistently help crawling: a current XML sitemap that lists your important URLs, and a clean internal linking structure that gives crawlers a path to every page that matters.
What is a crawler, bot, spider or user-agent?
"Crawler", "bot" and "spider" all name the same thing: an automated program that fetches web pages. Each identifies itself with a user-agent string — a label sent with every request that says which bot is visiting. Servers and log files use that string to recognize who is crawling. Major engines also run several specialized crawlers for different content types.
| Crawler | Operated by | Purpose |
|---|---|---|
| Googlebot (Smartphone & Desktop) | Crawls pages for the main index; the smartphone agent is primary under mobile-first indexing | |
| Googlebot-Image / Video | Crawls media for image and video results | |
| Bingbot | Microsoft Bing | Crawls pages for Bing's index |
| GPTBot | OpenAI | Crawls content for AI model use |
| Google-Extended | A control token to allow/deny use of content for Google's AI training |
Because the user-agent is self-declared, malicious bots can spoof a legitimate one. Real Googlebot can be verified by reverse DNS lookup of the requesting IP — worth knowing when analyzing server logs.
How do you control crawling?
You can guide what bots crawl, but it's vital to separate crawl control from index control — they are different jobs and a classic source of mistakes.
- robots.txt — a file at your domain root that tells compliant bots which paths they may or may not crawl. It controls crawling, not indexing: a URL disallowed in robots.txt can still be indexed (without its content) if other pages link to it.
- The noindex directive — a meta tag or HTTP header that controls indexing, not crawling. For it to work, the page must be crawlable, so the engine can read the directive. Blocking a page in robots.txt and adding noindex is self-defeating: the bot never reads the noindex.
- nofollow — a link attribute hinting that a crawler shouldn't pass authority or necessarily follow that link.
- Server response codes — 200 (OK), 301/302 (redirects), 404 (not found) and 5xx (server errors) all shape crawler behavior and recrawl scheduling.
A common pitfall: people block a page in robots.txt to keep it out of Google, then are surprised to see it indexed anyway. To remove a page from results, allow crawling and use noindex — or remove the page and return the right status code.
What is crawl budget, and who needs to care?
Crawl budget is the amount of crawling a search engine is willing to spend on your site over a period. It is shaped by two forces: the crawl rate limit (how much the engine can fetch without straining your server) and crawl demand (how much the engine wants to crawl, based on your site's popularity and freshness). For most small and mid-sized sites, crawl budget is a non-issue — Google can comfortably crawl everything.
It becomes real on large sites — e-commerce catalogs, big publishers, anything with tens of thousands of URLs or more. There, crawl budget is wasted by faceted-navigation URLs, infinite parameter combinations, duplicate pages, redirect chains and slow responses. The remedy is to stop bleeding budget on junk: consolidate duplicates, manage URL parameters, keep response times fast, and make sure crawlers spend their visits on the pages that actually earn traffic.
What stops bots from crawling? (crawl errors)
When important pages aren't being crawled, the cause is usually one of a handful of issues:
| Issue | Effect on crawling |
|---|---|
| Disallowed in robots.txt | Compliant bots skip the path entirely |
| Server errors (5xx) | The bot can't fetch the page and backs off, slowing crawl |
| Redirect chains / loops | Waste crawl budget and can cause bots to give up |
| Slow response times | The engine reduces crawl rate to avoid overloading the server |
| Orphan pages | No internal links, so crawlers have no path to discover them |
| JavaScript-only links | Links not in the HTML may not be followed reliably |
How do you check crawling?
Three sources give you visibility. Google Search Console's Crawl Stats report shows how often Google crawls your site, response times and any host issues. The URL Inspection tool shows when a specific URL was last crawled and whether it succeeded. And for serious diagnosis, server log analysis is the ground truth: your logs record every bot request, revealing exactly which pages Googlebot fetches, how often, and where it wastes time — information no other tool gives you directly.
How does crawling relate to AI crawlers?
AI systems crawl the web with the same fundamental mechanics, using their own bots — OpenAI's GPTBot, Anthropic's ClaudeBot, Perplexity's crawler, and others — to gather content for training models or to answer live queries. They send user-agent strings, follow links, and respect (or are expected to respect) robots.txt, so the technical hygiene that helps search crawlers also governs whether AI engines can read your content.
This raises a new strategic choice that classic SEO never posed: whether to allow AI crawlers, and how to see what they're doing. Allowing them is what makes your content eligible to be cited in AI answers; blocking them protects content but forfeits that visibility. And because these bots leave traces in your server logs and edge analytics, you can now measure inbound AI-crawler activity directly — which bots visit, how often, and which pages they read. [Editor: link to the AI crawler glossary entry and to Cliro's AI Site Audit / crawler-detection feature here; consider a data point on AI-crawler traffic.]
Crawling checklist
- Make pages reachable. Ensure every important page has internal links pointing to it.
- Audit robots.txt. Confirm you aren't accidentally blocking valuable paths or resources.
- Keep links in the HTML. Don't rely on JavaScript-only navigation for discovery.
- Maintain a clean sitemap. List canonical, indexable URLs and keep it current.
- Fix crawl waste. Resolve redirect chains, 5xx errors and duplicate-URL sprawl, especially on large sites.
- Monitor. Watch Crawl Stats and, where possible, server logs to see what bots actually do.
Frequently asked questions
What is crawling in SEO?
Crawling is the process by which search engine bots discover and read web pages by following links and fetching their content. It is the first stage of the search pipeline, before indexing and ranking.
What is the difference between crawling and indexing?
Crawling is discovering and reading a page; indexing is understanding and storing it. A page must be crawled before it can be indexed, but being crawled does not guarantee it will be indexed.
How do I stop Google from crawling a page?
Disallow the path in robots.txt to prevent crawling. Note that to keep a page out of search results you should instead allow crawling and use a noindex directive, because a page blocked from crawling can still be indexed without its content.
What is crawl budget?
Crawl budget is how much crawling a search engine will spend on your site, set by your server's capacity and the engine's demand. It mainly matters for large sites, where wasted budget on duplicate or low-value URLs can leave important pages uncrawled.
Do AI crawlers work like search engine crawlers?
Yes. AI crawlers such as GPTBot and ClaudeBot follow links, send user-agent strings and generally respect robots.txt, just like search bots. Allowing them makes your content eligible to be used and cited by AI systems.

Written by
Federico Ergang
Cliro cofounder & CEO
Federico Ergang is cofounder and CEO of Cliro, the AI visibility and GEO platform for Latin America.
Related articles
What Is Indexing in SEO? How Search Engines Store Pages
Indexing is how a search engine stores your pages so they can appear in results. Learn how it works, why some pages aren't indexed, and how to get them in.
What Is SEO? Search Engine Optimization Explained
SEO is the practice of improving a site's visibility in organic search. A technical guide to how engines crawl, index and rank — and how it works in 2026.
What Is a SERP? Search Engine Results Pages Explained
A SERP is the page a search engine returns for a query. Learn its anatomy, the SERP features that now dominate it, and how AI Overviews reshaped SEO strategy.
