The internet went dark for AI after 2022 — and the race for clean data has never been fiercer

Why the year ChatGPT launched may have also been the year the best training data started running out — and what big tech is doing about it.

Vintage archive drawers and old hard drives in a dimly lit library vault
Vintage archive drawers and old hard drives in a dimly lit library vault

“Everything before 2022 is safe, fine, clean. Everything after that is dirty.” — AI researcher, as quoted in Futurism (2025)

The 2022 cutoff — why that year changed everything

When OpenAI launched ChatGPT in late November 2022, it didn’t just change how people interact with computers. It fundamentally altered the composition of the internet itself. Within months, AI-generated articles, forum posts, product descriptions, and social content began flooding the web at a scale that human writers simply could not match.

For AI researchers, this created a paradox: the tool that made AI famous also began poisoning the well from which future AI models must drink. Pre-2022 web data — written overwhelmingly by humans, for humans — is now considered the gold standard for training. Researchers describe it simply: data collected before 2022 has minimal, if any, contamination from generative AI. Everything after is suspect.


Model collapse — what happens when AI trains on AI

The technical term for what happens when AI models train on AI-generated data is “model collapse.” It’s not a metaphor — it describes a real, measurable degradation in output quality across generations of models.

Think of it like repeatedly photocopying a photocopy. Each generation loses fidelity. The outputs become more homogenous, more prone to error, and less grounded in the richness of genuine human expression. Researchers have found that even under ideal laboratory conditions, model collapse is essentially inevitable when synthetic data dominates training sets.

“The accumulation of AI-generated data across the internet may jeopardize the integrity of next-generation models — from small domain-specific tools to massive foundation models.”

By the end of 2025, the ratio of human-to-AI-generated content online had shifted dramatically. In 2018, nearly all training data came from human sources — photographers, journalists, developers, students. Today, AI-generated content is cheaper, faster, and available at scales that human-created content simply cannot match. The feedback loop intensifies with every article written by a chatbot.


The data drought — how close are we really?

MilestoneDetail
📅 2022Year AI content began flooding the web
📉 ~10%Chance of data exhaustion, per Anthropic CEO Dario Amodei
2026Projected year quality data runs out (Epoch AI)

Dario Amodei, CEO of Anthropic, has publicly acknowledged a roughly 10% chance that AI development could be constrained by data scarcity. “We’re not that far from running out of data,” he told interviewer Dwarkesh Patel. Epoch, an independent AI research institute, puts it more bluntly — all high-quality training data could be exhausted by 2026.

The crunch is compounded by a legal and policy shift. Since ChatGPT’s launch, hundreds of major websites have quietly updated their terms of service and robots.txt files to block AI crawlers. Publishers, once passive about data scraping, are now suing. The pool of legally accessible, high-quality, human-written text is shrinking from both ends — contaminated from above, and restricted from below.


Who benefits — and who loses

This is where the story gets politically uncomfortable. The companies that scraped the web extensively before 2022 — think Google, Meta, OpenAI — are sitting on archives of clean, pre-contamination human data that cannot be recreated. Late entrants to the AI race, no matter how well-funded, simply cannot go back in time.

Some researchers have drawn a pointed analogy: pre-2022 datasets are becoming like pre-World War II steel — prized precisely because they predate a form of contamination (in steel’s case, radioactive isotopes from nuclear testing) that permanently altered everything produced afterward. The analogy is apt. Clean data is a finite, non-renewable resource.

AI startups and academics are already hammering the Internet Archive and other repositories for older data. The race is quiet, but it is very real.


What comes next

Big tech’s proposed solutions range from pragmatic to audacious. OpenAI has reportedly explored training on transcriptions of public YouTube videos. Others are investing heavily in “synthetic data” — having current models generate training data for future models — though this is precisely the approach most likely to accelerate model collapse if done carelessly.

The most promising near-term fix, according to recent research, is deceptively simple: introducing even a single human-generated data point into a pool of synthetic training data can meaningfully slow the collapse. The implication is that human-generated content doesn’t need to dominate — it just needs to be present. But as AI content grows and human-authored content becomes relatively scarcer, even that minimum threshold becomes harder to guarantee.

What’s clear is that the AI industry’s hunger for data has created a structural vulnerability that no one anticipated when ChatGPT launched to breathless headlines three years ago. The year that changed everything may also be the year the clock started ticking.


Tags: Artificial Intelligence · Machine Learning · Data Science · Model Collapse · Big Tech · AI Training Data

Discussion

Sign in with GitHub to join the conversation.