Robots.txt to Allow LLM AI Like ChatGPT for Search Only Using Content Signals

How to allow LLMs like ChatGPT to crawl your website for search purposes while preventing AI model training by using Cloudflare’s Content Signals

Previously, I wrote about how to block all LLM AIs like ChatGPT and then decided to allow them to visit again.

I thought most AI companies didn’t crawl very frequently, but then I checked my Cloudflare dashboard.

Look at the image below. It’s from Cloudflare’s AI Crawl Control. As you can see, the top crawler is ChatGPT. And interestingly, they crawl the most while actual user visits from ChatGPT remain the lowest.

Cloudflare AI Crawl Control
Cloudflare AI Crawl Control

To be clear, I’m not against AI. I use it and I like it. But some AI companies behave unethically. The relationship between creators and these AIs is often parasitic, benefiting only the AI companies while giving nothing back to creators.

But the genie is out of the bottle and we can’t put it back. From now on, it’s adapt or be left behind.

So what’s a fair solution?

For now, I think Cloudflare’s Content Signals mechanism is the best option.

It adds additional Content-Signal directives to the robots.txt file. There are three available signals:

  • search: building a search index and providing search results (e.g., returning hyperlinks and short excerpts from your content). This does not include AI-generated search summaries.
  • ai-input: using your content as input to AI models (e.g., RAG, grounding, or real-time ingestion for generative AI answers).
  • ai-train: training or fine-tuning AI models.

Here’s what my robots.txt file looks like. It basically grants permission only for search purposes:

# As a condition of accessing this website, you agree to abide by the following
# content signals:

# (a)  If a content-signal = yes, you may collect content for the corresponding
#      use.
# (b)  If a content-signal = no, you may not collect content for the
#      corresponding use.
# (c)  If the website operator does not include a content signal for a
#      corresponding use, the website operator neither grants nor restricts
#      permission via content signal with respect to the corresponding use.

# The content signals and their meanings are:

# search:   building a search index and providing search results (e.g., returning
#           hyperlinks and short excerpts from your website's contents). Search does not
#           include providing AI-generated search summaries.
# ai-input: inputting content into one or more AI models (e.g., retrieval
#           augmented generation, grounding, or other real-time taking of content for
#           generative AI search answers).
# ai-train: training or fine-tuning AI models.

# ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS RESERVATIONS OF
# RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN UNION DIRECTIVE 2019/790 ON COPYRIGHT
# AND RELATED RIGHTS IN THE DIGITAL SINGLE MARKET.

User-Agent: *
Content-Signal: ai-train=no, search=yes, ai-input=no
Allow: /

Of course, robots.txt is only a suggested rule, not a legally binding requirement. Crawlers can simply ignore it. There’s no guarantee they’ll comply.

But it’s the best we can do for now, and we can hope that future standards and regulations improve this situation.

Conclusion

If you have any questions or a better method, leave a comment below. Thanks for reading.

References

comments powered by Disqus