Insights

Level: Advanced

Should you block AI crawlers? Protect content without disappearing

GPTBot, Google-Extended, ClaudeBot, Applebot-Extended, Perplexity: a decision method for blocking, allowing or monitoring AI crawlers without breaking search visibility.
AI crawler decision between visibility, training and content protection

AI crawlers · SEO · rights

Blocking AI crawlers is not a binary decision. You have to separate visibility, training, search, user-triggered actions and real content protection.

  • Allow Crawlers that support discovery, citations and useful user journeys.
  • Reserve Content you do not want used for model training or specific AI uses.
  • Limit Crawlers that are too frequent, opaque or bring no visible benefit to the site.
  • Protect Sensitive areas with real controls, not only with robots.txt.

References checked on June 21, 2026. This article is not legal advice: it helps frame an SEO, GEO, technical and editorial decision before changing a robots.txt file.

Short answer

Do not block "AI" as a whole. Decide which uses of your website you allow.

The weak decision is to add a few lines to robots.txt to "block ChatGPT", "block Gemini" or "block all AI crawlers" without separating use cases. The same company may use several crawlers: one for training, one for search, one for a user-triggered action, sometimes a robots.txt token that is not a crawler by itself.

The better decision starts with a sharper question: do you want to stay visible in search engines, AI answers and assisted journeys while refusing specific training or reuse of your content? In most cases, the answer is neither "open everything" nor "close everything".

Edikka position

Keep visibility open, reserve rights on sensitive content, and technically protect what should never be public.

Decision framework

Four different use cases hide behind the phrase "AI crawler".

Before touching robots.txt, classify each crawler by role. A training crawler does not have the same impact as a search crawler. A user-triggered agent does not have the same status as an automatic crawler. And a traditional search engine crawler can still be the necessary gateway to AI features embedded in that search engine.

01

Classic indexing

Googlebot, Bingbot or Applebot are primarily used to discover, index and rank pages. Blocking them often means reducing search visibility.

02

AI search

Some crawlers support generated answers, citations or conversational search. Blocking them can reduce presence in those answers.

03

Training

Other crawlers or tokens help control whether content may be used to improve models. This is often the most legitimate perimeter to reserve.

04

User action

Agents may visit a page because a user asked for research, comparison or an action. Blocking them may break an intentional user journey.

Reference table

The main AI crawlers and tokens to know before blocking.

This table summarizes documented or publicly declared behavior as of June 21, 2026. It must be reviewed regularly: crawler names, roles and access policies move quickly.

AI crawlers, documented role and likely effect of blocking
Crawler or token Main role Effect of blocking Recommended decision
Googlebot Google Search indexing and AI features integrated into Search. Reduces or cuts Google’s access to your pages for Search, AI Overviews and AI Mode. Do not block if Google visibility matters.
Bingbot Bing Search indexing and Microsoft AI experiences that rely on Bing results. Reduces or cuts Bing’s access to your pages for Search and Copilot answers grounded in Bing. Do not block if Bing, Edge or Copilot visibility matters.
Google-Extended robots.txt token to control some Gemini, Vertex AI and grounding uses outside Search. Does not affect inclusion or ranking in Google Search, but may limit some Google AI uses outside Search. Block if you want to reserve AI use without leaving Google Search.
GPTBot Crawl that may be used to train OpenAI models. Signals that content should not be used for generative model training by OpenAI. Often the first selective block to consider, while keeping OAI-SearchBot if ChatGPT citation matters.
OAI-SearchBot Automatic crawl related to search and citations in ChatGPT. May reduce discovery, citation and presence in ChatGPT Search. Keep it if you want GEO visibility in ChatGPT.
ChatGPT-User Visits triggered by user actions or requests in ChatGPT and some GPTs. May prevent a ChatGPT-assisted user from accessing content or completing an action. Block only if agentic use of the site is undesirable.
ClaudeBot Collection that may contribute to training Anthropic models. Signals exclusion of future content from Anthropic training datasets. Block if you reserve training uses.
Claude-SearchBot Search crawler used to improve Claude search results and answers. May reduce visibility and accuracy of your content in Claude search answers. Decide according to your AI visibility strategy.
Claude-User Web access requested by a Claude user. May prevent Claude from retrieving your content for a user request. Keep for useful public content; limit for sensitive areas.
Applebot Discovery for Apple experiences such as Safari, Spotlight, Siri and Search. May reduce discoverability across the Apple ecosystem. Keep for strategic public pages.
Applebot-Extended Use control for training Apple foundation models. Does not stop Applebot from crawling; it refuses some training uses. Block if you reserve Apple training uses.
PerplexityBot / Perplexity-User Declared Perplexity crawlers for crawling and user access. May reduce presence in Perplexity, but robots.txt alone may not always control observed traffic. Manage with robots.txt, logs and network rules if the topic is sensitive.
CCBot, Bytespider, meta-externalagent, Amazonbot Collection, indexing or AI-use crawlers depending on operators and periods. Direct benefit is often less clear for a business website; impact must be checked in logs. Monitor, document and block only if the risk-benefit ratio is unfavorable.
Tracked sources OpenAI crawlers Google AI features Google-Extended Microsoft Bing / Copilot Anthropic crawlers Applebot Cloudflare · Perplexity

The Google trap

Blocking Google-Extended does not remove you from AI Overviews.

This is the most counter-intuitive point. Google-Extended controls some uses related to Gemini, Vertex AI and grounding in Google systems other than Search. Google explicitly says this token does not affect inclusion or ranking in Google Search.

AI features integrated into Search, such as AI Overviews or AI Mode, rely on the usual Search controls. If you want to limit what Google can show in those experiences, Google points to nosnippet, data-nosnippet, max-snippet or noindex. These controls can also reduce the classic snippet or search visibility.

Key takeaway

You cannot cleanly opt out of AI Overviews without touching how Google can display your content in Search.

Europe

In Europe, the topic is not only technical. Article 4 of Directive 2019/790 on copyright in the Digital Single Market creates an exception for text and data mining, but this exception applies only if the use has not been expressly reserved by rights holders, including through machine-readable means for online content.

The AI Act adds obligations for providers of general-purpose AI models, including a policy to comply with Union copyright law. The GPAI Code of Practice also includes a Copyright chapter designed to help providers demonstrate compliance. In practice, robots.txt can become an editorial and legal governance layer, not only an SEO tool.

Caution remains essential: Edikka is not a law firm. But a European website that publishes proprietary content, studies, document bases or editorial corpora should document explicitly what it allows and what it reserves.

Policy to document

The robots.txt decision should be linked to a content policy.

  1. DSM Directive

    Possible rights reservation against some text and data mining uses, including by machine-readable means.

  2. AI Act

    Obligations for GPAI model providers, including a policy to comply with Union copyright law.

  3. GPAI Code

    Copyright chapter proposing practical compliance measures for model providers.

  4. Website

    robots.txt, terms of use, logs, CDN and editorial governance should tell the same story.

This reading must be validated according to your activity, rights and target jurisdictions.

EU sources Directive 2019/790 · article 4 AI Act · article 53 GPAI Code of Practice

Technical limit

robots.txt is a declared preference, not a security wall.

A respectful crawler reads robots.txt and applies the instructions. An opportunistic scraper can ignore it, change IP address, change user-agent or route through third-party providers. Cloudflare documented stealth crawling behavior attributed to Perplexity despite blocking directives and WAF rules.

The conclusion is not that robots.txt should be abandoned. The conclusion is that it needs the right role. It expresses a preference, an access policy and sometimes a machine-readable rights reservation. It does not protect a confidential file, private PDF, internal endpoint, staging environment or back office.

What robots.txt can do

Give instructions to declared crawlers, document a preference, reduce some respectful crawls, express a machine-readable reservation.

What it cannot do

Block HTTP access, authenticate an agent, protect sensitive data, stop a hostile scraper or hide an already known URL.

What to add

Authentication, noindex, X-Robots-Tag, WAF, rate limiting, IP verification, logs, alerts and strict separation between public and private areas.

Edikka matrix

The right policy depends on the website’s business model.

An agency website, a media publisher, an ecommerce store and a SaaS product do not have the same interest in opening or closing content. Crawl policy should start from content value, visibility need, copy risk and the technical ability to control access for real.

Recommended AI crawler policy by website type
Website type Main objective Recommended policy Mistake to avoid
B2B showcase site Be found, understood, recommended and contacted. Keep AI search crawlers, optionally reserve training, protect forms and endpoints. Blocking all AI crawlers and losing useful citations.
Media or publisher Preserve editorial value and negotiate content use. Clear rights reservation, training blocks, premium policy, log monitoring and section-by-section decisions. Leaving the whole corpus open by inertia.
Ecommerce Stay visible in comparisons, prices, products and shopping assistants. Open public product pages, control dynamic stock/prices, protect accounts, carts and checkout. Blocking useful agents or exposing sensitive actions without safeguards.
SaaS Make the offer, documentation and use cases understandable. Open marketing and public docs, reserve proprietary content, authenticate the app and APIs. Confusing public documentation with customer data.
Training or premium content Sell access to structured knowledge. Open excerpts, proof pages and outlines; reserve full modules and paid resources. Putting paid content only behind an unlinked URL.
Intranet, staging, back office Prevent unauthorized access. Authentication, IP allowlist, noindex, network blocking, not only robots.txt. Believing Disallow: / protects a private area.

robots.txt examples

Three typical configurations to adapt before publication.

These examples are starting points. They must be tested, documented and adapted to your objectives. User-agents change: always verify names in official documentation before going live.

01 · Open and measured

For a website that mainly wants visibility and temporarily accepts AI uses while monitoring logs.

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml
02 · Selective Edikka

To remain visible in search and useful answers while reserving the most obvious training uses.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: *
Disallow: /admin/
Disallow: /client/
Disallow: /private/

Sitemap: https://www.example.com/sitemap.xml
03 · Strong protection

For a media publisher, premium corpus or site that wants to strongly limit public AI use. Complete it with CDN, WAF and contractual rules.

# Keep Googlebot and Bingbot for Search, block the main declared AI crawlers.

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: *
Disallow: /premium/
Disallow: /private-resources/

Sitemap: https://www.example.com/sitemap.xml

Method

Before blocking, audit what AI crawlers already see.

1
Map

Classify content by value and risk.

Separate public pages, conversion pages, studies, images, PDFs, documentation, paid resources, staging, back office and APIs. A single policy for the whole domain is rarely optimal.

2
Observe

Read logs before deciding.

Identify user-agents, frequency, touched pages, HTTP statuses, IPs, countries and traffic spikes. A crawler that never appears in logs does not always deserve a priority decision.

3
Decide

Choose by use case, not by fear.

Keep crawlers that support discovery and useful citation. Reserve training when content is strategic. Close areas that have no reason to be read by a public crawler.

4
Verify

Test the real effect after publication.

Check that the file is reachable, rules are syntactically valid, target crawlers read them, and visibility in Google, ChatGPT, Claude, Perplexity or Apple evolves as expected.

Internal path

Blocking or allowing crawlers only makes sense inside a wider strategy: be found, be cited, be understood by agents, and measure what actually happens. Read this page with the rest of the SEO and AI visibility cluster.

Reading path

Build a complete AI policy

From visibility to governance, each resource covers a different layer of the same system.

Conclusion

The best AI crawler policy is selective, dated and verified.

A site that blocks everything may protect itself, but it also becomes less visible in environments where users already ask AI systems to search, compare and recommend. A site that opens everything may gain exposure, but it lets content leave without a strategy.

The right level is between the two: open public pages that should be cited, reserve training when the content justifies it, close sensitive areas with real controls, then measure the effect in logs and AI answers.

Final decision

Do not block AI by reflex. Govern each use as a visibility, rights and security decision.

Edikka vision

AI crawler governance is becoming a normal layer of SEO strategy.

The question is no longer only "can we be crawled?". The real question is: which content should remain visible, which content should be reserved, and which content should be protected by more than a declaration?

At Edikka, an AI crawler policy is not a defensive reflex. It is a trade-off between acquisition, rights, trust and security: keep open what should support discovery, reserve what is an editorial asset, and technically protect what should never depend on a simple robots.txt file.

01 Visibility

Stay citable

Public pages carrying offers, proof and useful answers should remain accessible to the right engines.

02 Rights

Reserve sensitive uses

Proprietary content can justify an explicit reservation against some training uses.

03 Control

Verify real access

Logs, CDN and network rules often say more than the robots.txt file alone.

Article FAQ

Go further on this topic

Additional answers to clarify the key points covered in this article.

10 selected questions View all FAQs

Web solutions designed to perform

Strategy. Design. Code. SEO. AI. Clearer, faster, and more compelling digital experiences.