SEO
Should you block AI crawlers? Protect content without disappearing
AI crawlers · SEO · rights
Blocking AI crawlers is not a binary decision. You have to separate visibility, training, search, user-triggered actions and real content protection.
- Allow Crawlers that support discovery, citations and useful user journeys.
- Reserve Content you do not want used for model training or specific AI uses.
- Limit Crawlers that are too frequent, opaque or bring no visible benefit to the site.
- Protect Sensitive areas with real controls, not only with robots.txt.
References checked on June 21, 2026. This article is not legal advice: it helps frame an SEO, GEO, technical and editorial decision before changing a robots.txt file.
Short answer
Do not block "AI" as a whole. Decide which uses of your website you allow.
The weak decision is to add a few lines to robots.txt to "block ChatGPT", "block Gemini" or "block all AI crawlers" without separating use cases. The same company may use several crawlers: one for training, one for search, one for a user-triggered action, sometimes a robots.txt token that is not a crawler by itself.
The better decision starts with a sharper question: do you want to stay visible in search engines, AI answers and assisted journeys while refusing specific training or reuse of your content? In most cases, the answer is neither "open everything" nor "close everything".
Keep visibility open, reserve rights on sensitive content, and technically protect what should never be public.
Decision framework
Four different use cases hide behind the phrase "AI crawler".
Before touching robots.txt, classify each crawler by role. A training crawler does not have the same impact as a search crawler. A user-triggered agent does not have the same status as an automatic crawler. And a traditional search engine crawler can still be the necessary gateway to AI features embedded in that search engine.
Classic indexing
Googlebot, Bingbot or Applebot are primarily used to discover, index and rank pages. Blocking them often means reducing search visibility.
AI search
Some crawlers support generated answers, citations or conversational search. Blocking them can reduce presence in those answers.
Training
Other crawlers or tokens help control whether content may be used to improve models. This is often the most legitimate perimeter to reserve.
User action
Agents may visit a page because a user asked for research, comparison or an action. Blocking them may break an intentional user journey.
Reference table
The main AI crawlers and tokens to know before blocking.
This table summarizes documented or publicly declared behavior as of June 21, 2026. It must be reviewed regularly: crawler names, roles and access policies move quickly.
| Crawler or token | Main role | Effect of blocking | Recommended decision |
|---|---|---|---|
| Googlebot | Google Search indexing and AI features integrated into Search. | Reduces or cuts Google’s access to your pages for Search, AI Overviews and AI Mode. | Do not block if Google visibility matters. |
| Bingbot | Bing Search indexing and Microsoft AI experiences that rely on Bing results. | Reduces or cuts Bing’s access to your pages for Search and Copilot answers grounded in Bing. | Do not block if Bing, Edge or Copilot visibility matters. |
| Google-Extended | robots.txt token to control some Gemini, Vertex AI and grounding uses outside Search. | Does not affect inclusion or ranking in Google Search, but may limit some Google AI uses outside Search. | Block if you want to reserve AI use without leaving Google Search. |
| GPTBot | Crawl that may be used to train OpenAI models. | Signals that content should not be used for generative model training by OpenAI. | Often the first selective block to consider, while keeping OAI-SearchBot if ChatGPT citation matters. |
| OAI-SearchBot | Automatic crawl related to search and citations in ChatGPT. | May reduce discovery, citation and presence in ChatGPT Search. | Keep it if you want GEO visibility in ChatGPT. |
| ChatGPT-User | Visits triggered by user actions or requests in ChatGPT and some GPTs. | May prevent a ChatGPT-assisted user from accessing content or completing an action. | Block only if agentic use of the site is undesirable. |
| ClaudeBot | Collection that may contribute to training Anthropic models. | Signals exclusion of future content from Anthropic training datasets. | Block if you reserve training uses. |
| Claude-SearchBot | Search crawler used to improve Claude search results and answers. | May reduce visibility and accuracy of your content in Claude search answers. | Decide according to your AI visibility strategy. |
| Claude-User | Web access requested by a Claude user. | May prevent Claude from retrieving your content for a user request. | Keep for useful public content; limit for sensitive areas. |
| Applebot | Discovery for Apple experiences such as Safari, Spotlight, Siri and Search. | May reduce discoverability across the Apple ecosystem. | Keep for strategic public pages. |
| Applebot-Extended | Use control for training Apple foundation models. | Does not stop Applebot from crawling; it refuses some training uses. | Block if you reserve Apple training uses. |
| PerplexityBot / Perplexity-User | Declared Perplexity crawlers for crawling and user access. | May reduce presence in Perplexity, but robots.txt alone may not always control observed traffic. | Manage with robots.txt, logs and network rules if the topic is sensitive. |
| CCBot, Bytespider, meta-externalagent, Amazonbot | Collection, indexing or AI-use crawlers depending on operators and periods. | Direct benefit is often less clear for a business website; impact must be checked in logs. | Monitor, document and block only if the risk-benefit ratio is unfavorable. |
The Google trap
Blocking Google-Extended does not remove you from AI Overviews.
This is the most counter-intuitive point. Google-Extended controls some uses related to Gemini, Vertex AI and grounding in Google systems other than Search. Google explicitly says this token does not affect inclusion or ranking in Google Search.
AI features integrated into Search, such as AI Overviews or AI Mode, rely on the usual Search controls. If you want to limit what Google can show in those experiences, Google points to nosnippet, data-nosnippet, max-snippet or noindex. These controls can also reduce the classic snippet or search visibility.
You cannot cleanly opt out of AI Overviews without touching how Google can display your content in Search.
Europe
Blocking can also express a rights reservation.
In Europe, the topic is not only technical. Article 4 of Directive 2019/790 on copyright in the Digital Single Market creates an exception for text and data mining, but this exception applies only if the use has not been expressly reserved by rights holders, including through machine-readable means for online content.
The AI Act adds obligations for providers of general-purpose AI models, including a policy to comply with Union copyright law. The GPAI Code of Practice also includes a Copyright chapter designed to help providers demonstrate compliance. In practice, robots.txt can become an editorial and legal governance layer, not only an SEO tool.
Caution remains essential: Edikka is not a law firm. But a European website that publishes proprietary content, studies, document bases or editorial corpora should document explicitly what it allows and what it reserves.
Policy to document
The robots.txt decision should be linked to a content policy.
- DSM Directive
Possible rights reservation against some text and data mining uses, including by machine-readable means.
- AI Act
Obligations for GPAI model providers, including a policy to comply with Union copyright law.
- GPAI Code
Copyright chapter proposing practical compliance measures for model providers.
- Website
robots.txt, terms of use, logs, CDN and editorial governance should tell the same story.
This reading must be validated according to your activity, rights and target jurisdictions.
Technical limit
robots.txt is a declared preference, not a security wall.
A respectful crawler reads robots.txt and applies the instructions. An opportunistic scraper can ignore it, change IP address, change user-agent or route through third-party providers. Cloudflare documented stealth crawling behavior attributed to Perplexity despite blocking directives and WAF rules.
The conclusion is not that robots.txt should be abandoned. The conclusion is that it needs the right role. It expresses a preference, an access policy and sometimes a machine-readable rights reservation. It does not protect a confidential file, private PDF, internal endpoint, staging environment or back office.
Give instructions to declared crawlers, document a preference, reduce some respectful crawls, express a machine-readable reservation.
Block HTTP access, authenticate an agent, protect sensitive data, stop a hostile scraper or hide an already known URL.
Authentication, noindex, X-Robots-Tag, WAF, rate limiting, IP verification, logs, alerts and strict separation between public and private areas.
Edikka matrix
The right policy depends on the website’s business model.
An agency website, a media publisher, an ecommerce store and a SaaS product do not have the same interest in opening or closing content. Crawl policy should start from content value, visibility need, copy risk and the technical ability to control access for real.
| Website type | Main objective | Recommended policy | Mistake to avoid |
|---|---|---|---|
| B2B showcase site | Be found, understood, recommended and contacted. | Keep AI search crawlers, optionally reserve training, protect forms and endpoints. | Blocking all AI crawlers and losing useful citations. |
| Media or publisher | Preserve editorial value and negotiate content use. | Clear rights reservation, training blocks, premium policy, log monitoring and section-by-section decisions. | Leaving the whole corpus open by inertia. |
| Ecommerce | Stay visible in comparisons, prices, products and shopping assistants. | Open public product pages, control dynamic stock/prices, protect accounts, carts and checkout. | Blocking useful agents or exposing sensitive actions without safeguards. |
| SaaS | Make the offer, documentation and use cases understandable. | Open marketing and public docs, reserve proprietary content, authenticate the app and APIs. | Confusing public documentation with customer data. |
| Training or premium content | Sell access to structured knowledge. | Open excerpts, proof pages and outlines; reserve full modules and paid resources. | Putting paid content only behind an unlinked URL. |
| Intranet, staging, back office | Prevent unauthorized access. | Authentication, IP allowlist, noindex, network blocking, not only robots.txt. | Believing Disallow: / protects a private area. |
robots.txt examples
Three typical configurations to adapt before publication.
These examples are starting points. They must be tested, documented and adapted to your objectives. User-agents change: always verify names in official documentation before going live.
For a website that mainly wants visibility and temporarily accepts AI uses while monitoring logs.
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xml To remain visible in search and useful answers while reserving the most obvious training uses.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: *
Disallow: /admin/
Disallow: /client/
Disallow: /private/
Sitemap: https://www.example.com/sitemap.xml For a media publisher, premium corpus or site that wants to strongly limit public AI use. Complete it with CDN, WAF and contractual rules.
# Keep Googlebot and Bingbot for Search, block the main declared AI crawlers.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: *
Disallow: /premium/
Disallow: /private-resources/
Sitemap: https://www.example.com/sitemap.xml Method
Before blocking, audit what AI crawlers already see.
Classify content by value and risk.
Separate public pages, conversion pages, studies, images, PDFs, documentation, paid resources, staging, back office and APIs. A single policy for the whole domain is rarely optimal.
Read logs before deciding.
Identify user-agents, frequency, touched pages, HTTP statuses, IPs, countries and traffic spikes. A crawler that never appears in logs does not always deserve a priority decision.
Choose by use case, not by fear.
Keep crawlers that support discovery and useful citation. Reserve training when content is strategic. Close areas that have no reason to be read by a public crawler.
Test the real effect after publication.
Check that the file is reachable, rules are syntactically valid, target crawlers read them, and visibility in Google, ChatGPT, Claude, Perplexity or Apple evolves as expected.
Internal path
AI crawler governance completes SEO, GEO and agent-readiness.
Blocking or allowing crawlers only makes sense inside a wider strategy: be found, be cited, be understood by agents, and measure what actually happens. Read this page with the rest of the SEO and AI visibility cluster.
Conclusion
The best AI crawler policy is selective, dated and verified.
A site that blocks everything may protect itself, but it also becomes less visible in environments where users already ask AI systems to search, compare and recommend. A site that opens everything may gain exposure, but it lets content leave without a strategy.
The right level is between the two: open public pages that should be cited, reserve training when the content justifies it, close sensitive areas with real controls, then measure the effect in logs and AI answers.
Do not block AI by reflex. Govern each use as a visibility, rights and security decision.
AI crawler governance is becoming a normal layer of SEO strategy.
The question is no longer only "can we be crawled?". The real question is: which content should remain visible, which content should be reserved, and which content should be protected by more than a declaration?
At Edikka, an AI crawler policy is not a defensive reflex. It is a trade-off between acquisition, rights, trust and security: keep open what should support discovery, reserve what is an editorial asset, and technically protect what should never depend on a simple robots.txt file.
Stay citable
Public pages carrying offers, proof and useful answers should remain accessible to the right engines.
Reserve sensitive uses
Proprietary content can justify an explicit reservation against some training uses.
Verify real access
Logs, CDN and network rules often say more than the robots.txt file alone.
Go further on this topic
Additional answers to clarify the key points covered in this article.