AI Data Poisoning

Cloudflare has a new feature—available to free users as well—that uses AI to generate random pages to feed to AI web crawlers:

Instead of simply blocking bots, Cloudflare’s new system lures them into a “maze” of realistic-looking but irrelevant pages, wasting the crawler’s computing resources. The approach is a notable shift from the standard block-and-defend strategy used by most website protection services. Cloudflare says blocking bots sometimes backfires because it alerts the crawler’s operators that they’ve been detected.

“When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” writes Cloudflare. “But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources.”

The company says the content served to bots is deliberately irrelevant to the website being crawled, but it is carefully sourced or generated using real scientific facts—­such as neutral information about biology, physics, or mathematics—­to avoid spreading misinformation (whether this approach effectively prevents misinformation, however, remains unproven).

It’s basically an AI-generated honeypot. And AI scraping is a growing problem:

The scale of AI crawling on the web appears substantial, according to Cloudflare’s data that lines up with anecdotal reports we’ve heard from sources. The company says that AI crawlers generate more than 50 billion requests to their network daily, amounting to nearly 1 percent of all web traffic they process. Many of these crawlers collect website data to train large language models without permission from site owners….

Presumably the crawlers will now have to up both their scraping stealth and their ability to filter out AI-generated content like this. Which means the honeypots will have to get better at detecting scrapers and more stealthy in their fake content. This arms race is likely to go back and forth, wasting a lot of energy in the process.

Posted on March 26, 2025 at 7:07 AM21 Comments

Comments

finagle March 26, 2025 12:56 PM

Assuming of course that Cloudflare can tell the difference between bots and real users, or don’t have an agenda w.r.t. VPNs for instance, which they appear to have. So now I’ll have to contend with deliberate misinformation being farmed out as well. Thanks.

lurker March 26, 2025 1:13 PM

@finagle

Don’t assume that Cloudflare can tell the difference between bots and real users. Their script “Just testing that you are not a robot” has choked two of my favourite browsers for about a month now.

Victor Serge March 26, 2025 2:02 PM

@finagle, lurker

industrial scale, “tailor-made targeting of competitors” and privacy destruction seem EVER to be the goals of several “holy” agents, blindly endorsed by boot licking public servants, and horrified citizens, and many other rubber-stamp wielding bed-wetters; the sum of which unfortunately make a majority, where (technically living, breathing) regulators actually care.

“if it doesn’t kill us, it will make us stronger” – global redneck

“my bugs will kill your bugs” – the arrogant Imperialist Compliante

“Leave the weeds alone: since while you go about plucking them up, you also spoil some wheat. Let both grow together till harvest comes, then I will burn the weeds.” – Jesus

“frankly it IS time to burn it all to the ground” – 4eagleHeadmark2015

Thanks Bruce.

Clive Robinson March 26, 2025 2:42 PM

@ lurker, ALL,

With respect to Cloud-Fail and their “proof of life” tests, of which you observe,

“Don’t assume that Cloudflare can tell the difference between bots and real users.”

You are assuming that there is not an ulterior reason…

Whilst I’m not saying it’s true by any means some observers think people are being used by Cloud-Fail to “Classify Images” for free to build up the input corpus for their own AI.

The thing is true or not we can not tell… So it makes for a good conspiracy theory, that might be true…

finagle March 26, 2025 5:20 PM

@Clive
Nope, not assuming there is no ulterior motive. But I’m also not ruling out incompetence.

Another thing that occurs to me. How long before their AI serves something out that clashes with the domain they are ‘protecting’, or is legally actionable? No way as a site owner would I want an AI pushing out unmoderated, unknown content from my domain. Or for that matter humans, but goes double for AI. I can see this backfiring spectacularly.

Dave March 26, 2025 8:02 PM

Remember when spammers were harvesting lots of email addresses off of publicly available webpages? There were several programs available to do exactly this, to lure them in to a never-ending maze of pages to keep them distracted, and feed them fake email addresses.

Daisy L. March 26, 2025 11:36 PM

This article brilliantly highlights the escalating arms race in AI security—where data poisoning evolves from a theoretical concern to a practical weapon. The case studies demonstrating how subtle training data manipulations can corrupt model outputs (e.g., misclassifying stop signs) reveal alarming vulnerabilities in our AI-dependent infrastructure.

What’s most troubling is the asymmetry: poisoning attacks require minimal adversarial effort compared to the monumental cost of prevention. Schneier’s proposed ‘immunization’ frameworks—through cryptographic data provenance and decentralized verification—feel essential, yet their adoption lags far behind AI deployment speeds.

This isn’t just an ML problem but a systemic risk. As generative AI proliferates, could we see ‘feedback loop poisoning’ where corrupted outputs become future training inputs? The policy implications demand urgent attention.

Clive Robinson March 27, 2025 4:00 AM

@ Bruce, ALL,

This almost certainly turn into an “arms race” of the form ECM, ECCM, ECCCM turned into where nobody could count or remember how many “counter, counter, counter” there were racked up.

The only reason it stopped was not the fact it was rapidly ineffective even with exponentially rising cost to both sides, it was that it rapidly out stripped the technology limitations (you needed a 747 to carry stuff that was supposed to fit in high performance combined role fighter aircraft frames…).

It’s one of the reasons that interest in “anti radiation weapons” such as radar emitter seeking missiles became preferential.

The real issue however is the two basic costs,

1, Sunk cost
2, Lost opportunity cost

And that any returns are actually minimal whilst both costs rise almost exponentially.

Thus a change of tactics is probably needed.

What would effect these “pillage bot operators” the most is in effect being jailed or to use an older idea “sent to coventry”.

They can only work because “network operators” agree to give them network access. If such access is denied by “cutting them off” then the problem would significantly reduce.

Which is possibly why some big tech silicon valley Corps are investing as much as they are in back-bone networking.

They are however dependent on “the last mile” “off ramp” and so the requisite “peering agreements” in exactly the same way the old Telco’s were and still are.

If such unlawful “data pillaging” was declared “unacceptable use” then it could be “blocked upstream”.

It’s interesting to note who are against “Net neutrality rules” maybe it’s time they found out the rules are “two way”.

Clive Robinson March 27, 2025 10:43 PM

@ Daisy L., ALL,

You pose the problem of,

“As generative AI proliferates, could we see ‘feedback loop poisoning’ where corrupted outputs become future training inputs?”

We are already seeing it.

In theory as the output of an LLM is based on a probability curve, upto a point,

“Correct data will outnumber incorrect data.”

However “noise” which is what “stochastic” can be viewed as, will “broaden the curve” (lower the Q or quality factor and broaden the skirt). This will change the statistics and increasingly favour “incorrect data” that would have otherwise “be out side the skirt”.

Thus with feedback of incorrect data the Q will drop and the skirt broaden and as a consequence “flatten the statistics”.

Which is what encryption by a stream cipher aims to do. However it will be as though an OTP has been used for which “the pad” is lost so the output can not be decrypted…

ResearcherZero March 28, 2025 5:10 AM

@Victor Serge

ipso facto

Profit-driven AI models operating without boundaries.

‘https://theconversation.com/trumps-push-for-ai-deregulation-could-put-financial-markets-at-risk-251208

OpenAI wants to profit by freely violating copyright law to take what it wants.
https://carey.jhu.edu/research/whats-yours-isnt-mine-aI-intellectual-property

OpenAI argues it’s models do not replicate original works but instead “create” fantastic outputs… which it attempts to describe without mentioning cloning and replication.

https://arstechnica.com/tech-policy/2025/03/openai-urges-trump-either-settle-ai-copyright-debate-or-lose-ai-race-to-china/

Only human beings qualify as authors. AI then spits it back in fragments.
https://www.jonesday.com/en/insights/2023/08/court-finds-aigenerated-work-not-copyrightable-for-failure-to-meet-human-authorship-requirementbut-questions-remain

ResearcherZero March 28, 2025 5:29 AM

@Daisy L, ALL

There is solid proof to of what Clive was saying. A study of ten major AI chat bots found that one third of the time they regurgitated arguments made by the Pravda network.

This Russian disinformation network has flooded crawlers and search engines with 3.6 million articles.

‘https://www.axios.com/2025/03/06/exclusive-russian-disinfo-floods-ai-chatbots-study-finds

Ouroboros: A snake eating it’s own tail

Once ingested by LLMs and other repositories, laundered information may be regurgitated in perpetuity if a way to prevent the contamination of datasets and resources cannot be found.
https://thebulletin.org/2025/03/russian-networks-flood-the-internet-with-propaganda-aiming-to-corrupt-ai-chatbots/

A Nonny Bunny March 28, 2025 3:52 PM

However “noise” which is what “stochastic” can be viewed as, will “broaden the curve” (lower the Q or quality factor and broaden the skirt). This will change the statistics and increasingly favour “incorrect data” that would have otherwise “be out side the skirt”.

What research shows happening when you feed AI generated data back into an AI is actually a narrowing, because outliers keep dropping of in each generating cycle (because they’re improbable to generate).
Though the end result is still a drop in quality. There’s a marked difference between a probable linguistic pattern and a true statement. It also increases biases and stereotypes.

A Nonny Bunny March 28, 2025 3:57 PM

Aside from whether all that crawling should be allowed in the first place, maybe it would be good if they just worked together and shared data. Then websites only need to be crawled by one crawler, and not by everyone that wants the data.

Steven Griffin March 28, 2025 4:30 PM

Interestingly, Neal Stephenson wrote about this exact scenario in his novel Anathem. He called it “Artificial Inanity” which seems like a apropos term for this kind of poison-the-well approach to site scraping.

ResearcherZero March 29, 2025 12:54 AM

I’m just going to put this here.

The black lakes of Baotou.

‘https://www.bbc.com/future/article/20150402-the-worst-place-on-earth

Rare earths are actually more common than silver – but require a lot of highly toxic refining. Rare earth minerals are processed primarily from ores and minerals that also naturally contain uranium and thorium. Processing rare earth minerals involves the separation and removal of uranium and thorium, which results in TENORM wastes. ☢️

“In mineral-rich regions of China, poisoned water and soil have caused abnormal disease rates in “cancer villages” from which impoverished residents cannot afford to move. Crops and animals have died around a crusty lake of radioactive black sludge formed from mining waste near a major mining site in Baotou, Inner Mongolia.”

https://www.latimes.com/world-nation/story/2019-07-28/china-rare-earth-tech-pollution-supply-chain-trade

The vast sludge lake can even be seen on Google Maps.
https://abcnews.go.com/Technology/toxic-lake-black-sludge-result-mining-create-tech/story?id=30122911

Marcus Butler April 15, 2025 7:43 AM

I’m the author of one of the content-obfuscation tools meant to confound badly behaved bots (including AI scrapers.). One of the main differences between my tool (Quixotic) and most of the others out there is that, while it does include an optional link maze/tarpit, the default mode of operation is to serve pre-generated, obfuscated content. The reason for this is to avoid wasting energy and other resources on these bots. The Quixotic Markov generator makes one run against your content when you deploy to your site; after that, it’s just your web server serving static html files.

But, alas, there is limited signal on an individual web server to distinguish bots vs human user agents. As Bruce noted, this is an arms race, and the bots already have years of defenses built up from prior scraping experience. The current crop of AI bots will switch to a legit-looking user agent header if blocked. Many will rotate amongst a large range of IP addresses, sending no more than one or two requests from a given IP when scraping a site.

So, I think that’s where something like Cloudflare’s system becomes valuable. They can look at bot activity in aggregate, across many sites, to detect patterns that will be invisible to the operator of a single site.

I’ve thought about something like a crowdsec module for distributed hosts to better detect bots by sharing intelligence, but as for now, the Cloudflare solution seems to be the most practical solution if something like a content obfuscator isn’t enough for a site operator.

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.