How AI companies are secretly collecting training data from the web (and why it matters)

June 30, 2025 admin

gettyimages-1417866211 — Getty/the_burtons

Like most people, my wife types a search into Google many times each day. We work from home, so our family room doubles as a conference room. Whenever we’re in a meeting, and a question about anything comes up, she Googles it.

This is the same as it’s been for years. But what happens next has changed.

Instead of clicking on one of the search result links, she more often than not reads the AI summary. These days, she rarely clicks on any of the sites that provide the original information that Google’s AI summarizes.

Also: How much energy does AI really use? The answer is surprising – and a little complicated

When I spoke to her about this, Denise acknowledged that she actually visits sites less frequently. But she also pointed out that, for topics where she’s well-versed, she has noticed the AI is sometimes wrong. She said she takes the AI results with a grain of salt, but they often provide basic enough information that she needs to look no further. If in doubt, she does dig deeper.

So that’s where we are today. More and more users are like my wife, getting data from the AI and never visiting websites (and therefore never giving content creators a chance to be compensated for their work).

Worse, more and more people are trusting AI, so not only are they making it harder for content creators to make a living, but they are often getting hallucinatory or incorrect information. Since they never visit the original sources of information, they have little impetus to cross-check or verify what they read.

The impact of AI scraping

Cloudflare CEO Matthew Prince offered some devastating statistics. He used the ratio of the number of pages crawled compared to the number of pages fed to readers as a metric.

As a baseline, he said that 10 years ago, for every two pages Google crawled, it sent one visitor to a content creator’s site. Six months ago, that ratio was six pages crawled to one visitor sent to a content site. Now, just six months later, it’s 18 pages crawled to one visitor sent to a content site.

The numbers, according to Prince, are far worse for AI sites. AI sites derive substantial value from information they’ve scraped from all the rest of us. Six months ago, the ratio of pages scraped to visitors redirected via OpenAI was 250 to 1. Now, as people have become more familiar with trusting (or being too lazy to care about inaccuracies), the ratio is 1,500 to 1.

In many ways, AI is becoming an existential threat to content creators. By vacuuming up content produced by hard-working teams all across the world, and then feeding that content back as summaries to readers, the publishers and writers are losing revenue and influence. Many creators are also losing motivation, because if they can’t make a living doing it, or at least create a following, why bother?

Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

Some publishers, like Ziff Davis (ZDNET’s parent company) and the New York Times, are suing OpenAI for copyright infringement. You’ve probably seen the disclaimer on ZDNET that says, “Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.”

Other publishers, including the Wall Street Journal, the Financial Times, the Atlantic, and the Washington Post, have licensed their content to OpenAI and some other AI large language models.

The damage to society as a whole that AI intermediation can cause is profound and worth an article all on its own. But this article is more practical. Here, we acknowledge the threat AI presents to publishing, and focus on technical ways to fight back.

In other words, if the AIs can’t scrape, they can’t give away published and copyrighted content without publishers’ permission.

Robots.txt: Your first defense

The simplest, most direct, and possibly least effective defense is the robots.txt file. This is a file you put at the root of your website’s directory. It tells spiders, crawlers, and bots whether they have permission to access your site. This is also called User-Agent filtering.

This file has a number of interesting implications. First, only well-behaved crawlers will pay attention to its specifications. It doesn’t provide any security against access, so compliance is completely voluntary on the part of the bots.

Also: 15 new jobs AI could create – could one be your next gig?

Second, you need to be careful which bots you send away. For example, if you use robots.txt to deny access to Googlebot, your site won’t get indexed for searching on Google. Say goodbye to all Google referrals. On the other hand, if you use robots.txt to deny access to Google-Extended, you’ll block Gemini from indexing and using your site for Gemini training.

This site has an index of those bots you might want to deny access to. This is OpenAI’s guide on how to prevent OpenAI’s bots from crawling your site.

But what about web scrapers that ignore robots.txt? How do you prevent them from scraping your site?

How can you prevent rogue scraping?

It’s here that site operators need to use a belts-and-suspenders strategy. You’re basically in an arms race to find a way to defend against scraping, while the scrapers are trying to find a way to suck down all your site’s data. In this section, I’ll list a few techniques. This is far from a complete list. Techniques change constantly, both on the part of the defenders and the scrapers.

Rate limit requests: Modify your server to limit how many pages can be requested by a given IP address in a period of time. Humans aren’t likely to request hundreds of pages per minute. This, like most of the techniques itemized in this section, will differ from server to server, so you’ll have to look up your server to find out how to configure this capability. It may also annoy your site’s visitors so much that they stop visiting. So, there’s that.

Use CAPTCHAs: Keep in mind that CAPTCHAs tend to inconvenience users, but they can reduce some types of crawler access to your site. Of course, the irony is that if you’re trying to block AI crawlers, it’s the AIs that are most likely to be able to defeat the CAPTCHAs. So there’s that.

Selective IP bans: If you find there are IP ranges that overwhelm your site with access requests, you can ban them at the firewall level. FireHOL (an open source firewall toolset) maintains a blacklist of IP addresses. Most of them are cybersecurity-related, but they can get you started on a block list. Be careful, though. Don’t use blanket IP bans, or legitimate visitors will be blocked from your site. So, there’s that, too.

Also: 5 ways you can plug the widening AI skills gap at your business

The rise of anti-scraping services

There are a growing number of anti-scraping services that will attempt to defend your site for a fee. They include:

QRATOR: Network-layer filtering and DDoS-aware bot blocking
Cloudflare: Reputation-tracking, fingerprinting, and behavioral analysis
Akamai Bot Manager: Identity, intent, and behavioral modeling
DataDome: Machine learning plus real-time response
HUMAN Security: JavaScript sensors with Al backend
Kasada: Adaptive challenges and so-called tamper-proof JavaScript telemetry
Imperva: Threat intelligence plus browser fingerprinting
Fastly: Rule-based filtering with edge logic
Fingerprint: Cross-session fingerprinting and user tracking
Link11: Behavioral analysis and traffic sandboxing
Netacea: Intent-based detection and server-side analytics

Here’s a quick overview of some of the techniques these services use.

Behavior matching: This technique analyzes more than headers; it analyzes request behavior. It’s essentially a combination of header analysis and bot-by-bot request limiting.

JavaScript challenges: Beyond JavaScript-based CAPTCHA, these often run in the background of a web page. They require scripts to execute or measure the pacing of interaction on the page to allow further access.

Honeypot traps: These are often elements buried in a web page, like invisible fields or links, that are designed to capture bots. If a bot grabs everything on a site (which a human user is unlikely to do), the honeypot trap recognizes it and initiates a server block.

Overall behavioral analysis: This is where AIs are fighting AIs. AIs running on behalf of your website monitor access behavior, and use machine learning to identify access patterns that are not human. Those malicious accesses can then be blocked.

Browser fingerprinting: Browsers provide a wide range of data about themselves to the sites they access. Bots generally attempt to spoof the fingerprints of legitimate users. But they often inadvertently provide their own fingerprints, which blocking services can aggregate and then use to block the bots.

Decoy traps: These are mazes of decoy pages filled with autogenerated and useless content, linked together in a pattern that causes bots to waste their time or get stuck following links. Most of those are tagged with “nofollow” links, so search engines don’t index them or negatively affect your SEO rank. Of course, malicious bots are learning how to identify these traps and counter them, but they do offer limited protection.

The big trade-off of blocking scraping for AI training

As an author who makes my living directly from my creative output, I find the prospect of AIs using my work as training data to be offensive. How dare a company like OpenAI make billions off the backs of all of us creatives! They then turn around and provide a product that could potentially put many of us out of work.

And yet, I have to acknowledge that AI has saved me time in many different ways. I use a text editor or a word processor every day. But back when I started my career, the publications I wrote for had typesetting operators who converted my written words into publishable content. Now, the blogging tools and content management systems do that work. An entire profession vanished in the space of a few years. Such is the price of new technology.

I’ve been involved with AI innovation for decades. After writing about generative AI since it boomed in early 2023, I’m convinced it’s here to stay.

Also: The most critical job skill you need to thrive in the AI revolution

AI chatbots like Google Gemini and ChatGPT are making token efforts to be good citizens. They scrape all our content and make billions off of it, but they’re willing to provide links back to our work for the very few who bother to check sources.

Some of the big AI companies contend that they provide value back to publishers. An OpenAI spokesperson told Columbia Journalism Review, “We support publishers and creators by helping 400M weekly ChatGPT users discover quality content through summaries, quotes, clear links, and attribution.”

Quoted in Digiday, David Carr, senior insights manager at data analytics company Similarweb, said, “ChatGPT sent 243.8 million visits to 250 news and media websites in April 2025, up 98% from 123.2 million visits this January.”

Those numbers are big, but only without context. Google gets billions of visits a day, and before AI, nearly all those visits resulted in referrals out to other sites. With Google’s referral percentages dropping precipitously and OpenAI’s referral numbers being a very small percentage of traffic otherwise sent to content producers, the problem is very real.

Yes, those links are mere table scraps, but do we block them? If you enable web scraping blocks on your website, will it do anything other than “cut off your nose to spite your face,” as my mother used to say?

Also: Sam Altman says the Singularity is imminent – here’s why

Unless every site blocks AI scrapers, effectively locking AI data sets to 2025 and earlier, blocking your own site from the AIs will accomplish little more than preventing you from getting what little traffic there is from the AI services. So should you?

In the long term, this practice of AI scraping is unsustainable. If AIs prevent creatives from deriving value from their hard work, the creatives won’t have an incentive to keep creating. At that point, the quality of the AI-generated content will begin to decline. It will become a vicious circle, with fewer creatives able to monetize their skills and the AIs providing ever-worsening content quality.

So, what do we do about it? If we are to survive into the future, our entire industry needs to ask and attempt to answer that question. If not, welcome to Idiocracy.

What about you? Have you taken any steps to block AI bots from scraping your site? Are you concerned about how your content might be used to train generative models? Do you think the trade-off between visibility and protection is worth it? What kinds of tools or services, if any, are you using to monitor or limit scraping? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Source link

The impact of AI scraping

Robots.txt: Your first defense

How can you prevent rogue scraping?

The rise of anti-scraping services

The big trade-off of blocking scraping for AI training

You May Also Like