Should You Block AI Crawlers? Honest SaaS Answer

A founder sent me his robots.txt last month with a proud note: “Locked the AI bots out. Protecting our content.” He had copied a block list from a popular post, dropped in GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended, the whole roster, and felt like he had taken a stand. Then, two sentences later in the same message, he asked why his company never showed up when he asked ChatGPT for tools in his category. He did not see that those were the same fact. He had asked to be left out of the answer, and the answer left him out.

The block-the-AI-crawlers conversation has a genre now. Master lists, “what to allow and what to block” tables, robots.txt snippets you can paste in thirty seconds. Almost all of it is written as if every website has the same interests. It does not. For a B2B SaaS company, most of that advice is answering a question you should not be asking, and a couple of the snippets will quietly cost you the thing you actually want.

There is not one AI crawler. There are two, and they do opposite jobs.

The single most useful thing to understand is that “AI crawler” covers two different machines with two different purposes.

One is the training crawler. GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended. Its job is to pull content that may be used to train or improve a model. It reads you now so the model might know you later, with no link back, no citation, no traffic. This is the crawler the “protect your content” posts are worried about.

The other is the retrieval crawler. OAI-SearchBot fetches live pages for ChatGPT’s search answers. PerplexityBot does the same for Perplexity. And Google’s AI Overviews are built on the same index as normal Google search, crawled by the same Googlebot, which means there is no separate “AI part” of Google you can wall off on its own. The retrieval crawlers are the ones that decide whether you exist in how AI engines actually decide what to cite.

Most block lists lump the two together under “AI bots” and tell you to deny all of them. That instruction makes sense for exactly one kind of site, and it is not yours.

“Protect your content” is a publisher’s problem. You are not a publisher.

The whole protect-your-content framing was built by and for media companies. The big publishers whose content is the product they sell. When a training crawler ingests their archive, it is arguably taking the actual asset, the thing readers pay for. Blocking GPTBot is a rational business decision when your words are your revenue.

Your words are not your revenue. If you sell software, your blog posts and docs and comparison pages are marketing. Their entire job is to be read, repeated, and recommended. The idea that you must guard them from being ingested has the economics exactly backwards. You are not protecting inventory. You are hiding your sales collateral from the rooms where buyers now build their shortlist, which is the whole problem I laid out in your buyers are choosing in a chat you never see.

So the training-crawler question, the one the genre obsesses over, barely matters for you. Block GPTBot if it makes you feel better. It protects almost nothing you were selling anyway, and the measured traffic effect of blocking training bots sits inside normal fluctuation. Allow it and you keep the small chance the model carries your name into its memory on the next refresh, the slow clock I described in how long it takes to get cited. Either way it is a rounding error, not the decision worth agonizing over.

The one rule that actually matters: never block the retrieval crawlers.

Here is the part the paste-this-robots.txt posts get dangerously wrong. When you block “all AI bots,” you almost always catch the retrieval crawlers in the same net. And blocking a retrieval crawler is not content protection. It is you opting out of AI search.

If OAI-SearchBot cannot fetch your page, ChatGPT’s search has nothing of yours to read when a buyer asks who to use in your category. If PerplexityBot is denied, Perplexity cites your competitors instead, because they are what it can reach. And because AI Overviews ride the regular Google index, the only way to keep “AI” out of Google is to remove yourself from Google, which nobody sane wants. In every case you did not protect anything. You made yourself uncitable in the exact place your buyers are deciding, and you will never see the impression you lost, because a query that never surfaced you leaves no trace in your analytics.

That is the founder from the opening. He blocked everything to protect content that did not need protecting, and the collateral damage was the retrieval crawlers, so he disappeared from the answers and could not work out why. The robots.txt was the reason, sitting in a file he had not looked at since the day he pasted it.

What I got wrong

The first time a client asked me about this, I gave the reflexive answer. Block the training bots, protect your IP, it is the responsible default. I wrote them a tidy robots.txt with GPTBot and the rest denied, felt thorough, and moved on.

Then I actually thought about what I had protected. Nothing. Their content was lead-gen content, written to be spread as widely as possible. I had spent effort defending an asset that wanted to be taken, on a reflex imported from publishers whose situation was the exact opposite of my client’s. Worse, an earlier version of that file had denied a retrieval crawler in the same block, which would have started quietly walling them out of the answers we were at that moment paying to get them into.

Now I invert the default. For a software company, the starting robots.txt allows the retrieval crawlers without exception, and treats the training crawlers as an optional, low-stakes preference the founder can set either way over a coffee. The energy goes into being retrievable, not into building a wall around marketing copy nobody was trying to steal.

Why this matters

The block-AI-crawlers question feels like a security decision, which is why founders answer it with a security instinct: lock it down, deny by default, be safe. But for a SaaS it is a distribution decision wearing a security costume. The real risk is not that a model reads your marketing. The real risk is that a copied block list makes you invisible to the engines your buyers now ask first, and that the file doing it sits in a place nobody checks for months.

It is worth seeing that this is the exact mirror image of the llms.txt mistake. There, people add a file hoping it makes engines read them more. Here, people add a file that makes engines read them less, while believing they are protecting themselves. Same folder at the root of the site, opposite errors, both made by pasting someone else’s technical advice without once asking whether it was written for a business like yours.

If you want to know whether your own robots.txt is quietly keeping you out of AI answers, that is one of the first things I check in an engagement, and the full method is on the methodology page.

Should you block AI crawlers? If you sell software, you are asking the wrong question.